指望垃圾爬虫遵守robots.txt协议是太可能的,大多数爬虫可以很快爬趴掉一个小型网站。那么我们只能手动禁止他们。
垃圾爬虫例子
UA | 说明 |
---|---|
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) | Majestic-12是一个用于商业SEO建议目的的分布式爬虫 |
Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) | 由SEMrush运营的爬虫机器人,该平台提供SEO数据,包括关键字研究和反向链接。 |
Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/) | 由Ahrefs运营的爬虫机器人,Ahrefs是一种营销和SEO工具,主要用作反向链接检查器。 |
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36 | UC收购的一搜,并没有什么用,还密集采集,严重影响性能。 |
crapy/1.7.3 (+https://scrapy.org) | 一个开源爬虫工具 |
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com) | 一种爬虫程序机器人,可识别电子商务网站,产品名称,价格数据和描述。 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot) | SMTBot,该平台提供SEO数据。疯狂爬取 |
NGINX下可以采用特征字符屏蔽规则 如下
if ($http_user_agent ~ "Bytespider|SemrushBot|crapy|FeedDemon|DotBot|SMTBot|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|python-requests|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) { return 403; }
如果您使用的是BT面板之类的自带防火墙
可直接编辑 user-Agent过滤模式添加垃圾爬虫特征符,也可能依据日志时时更新新的规则
APACHE可以在HTACCESS中添加过滤,如
RewriteCond %{HTTP_USER_AGENT} “.*EmbeddedWB.*” [OR]
RewriteCond %{HTTP_USER_AGENT} “.*QunarBot.*” [OR]
RewriteCond %{HTTP_USER_AGENT} “.*Windows 98.*” [OR]
RewriteCond %{HTTP_USER_AGENT} “^Mozilla/4.0$”
RewriteRule ^(.*)$ http://google.com/
BT面板 全过滤规则
(Go-http-client|crapy|Abonti|aggregator|AhrefsBot|asterias|BDCbot|BLEXBot|BuiltBotTough|Bullseye|BunnySlippers|ca-crawler|CCBot|Cegbfeieh|CheeseBot|CherryPicker|CopyRightCheck|cosmos|Crescent|discobot|DittoSpyder|DotBot|Download Ninja|EmailCollector|EmailSiphon|EmailWolf|EroCrawler|Exabot|ExtractorPro|Fasterfox|FeedBooster|Foobot|Genieo|grub-client|Harvest|hloader|httplib|humanlinks|ieautodiscovery|InfoNaviRobot|IstellaBot|Java1.|JennyBot|k2spider|Kenjin Spider|Keyword Density0.9|larbin|LexiBot|libWeb|libwww|LinkextractorPro|linko|LinkScan8.1a Unix|LinkWalker|LNSpiderguy|lwp-trivial|magpie|Mata Hari|MaxPointCrawler|MegaIndex|Microsoft URL Control|MIIxpc|Mippin|Missigua Locator|Mister PiX|moget|MSIECrawler|NetAnts|NICErsPRO|Niki-Bot|NPBot|Nutch|Offline Explorer|Openfind|panscient.com|ProPowerBot2.14|ProWebWalker|Python-urllib|QueryN Metasearch|RepoMonkey|RMA|SISTRIX|sitecheck.Internetseer.com|SiteSnagger|SnapPreviewBot|SpankBot|spanner|spbot|Spinn3r|suzuran|Szukacz1.4|Teleport|Telesoft|The Intraformant|TheNomad|TightTwatBot|Titan|True_Robot|turingos|TurnitinBot|UbiCrawler|UnisterBot|URLy Warning|VCI|WBSearchBot|Web Downloader6.9|WebAuto|WebBandit|WebCopier|WebEnhancer|WebmasterWorldForumBot|WebReaper|WebSauger|Website Quester|Webster Pro|WebStripper|WebZip|Wotbox|wsr-agent|WWW-Collector-E|Xenu|Zao|Zeus|ZyBORG|Incutio|lmspider|memoryBot|serf|Unknown|uptime files|HTTrack|harvest|audit|dirbuster|pangolin|nmap|sqln|hydra|Parser|libwww|BBBike|sqlmap|w3af|owasp|Nikto|fimap|havij|zmeu|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BabyKrokodil|netsparker|httperf|Baispider|SemrushBot|facebookexternalhit)
新增加一国内垃圾爬虫
Mozilla/5.0+(compatible;+Baispider/2.0;++http://www.bai.com/search/spider.html)
试图通过POST提交危险函数,具体日志