imagesift bot

作者:admi... 点击数: 0 收藏到会员中心
最后编辑时间: 2024-09-09 14:00

这家爬虫完全不会遵守robots协议,在他们的网站上您可以看到,禁止此爬虫的协议文本为?

禁止谷歌爬虫的时候可以一并禁止ImageSift Bot,但是从域名信息来看,这家从2023年成立的爬虫数据公司同谷歌没有任何关系


User-Agent: *
Disallow: /
User-Agent: Googlebot
 Allow: / 
 Disallow: /private/

Does ImageSiftBot follow Robots.txt rules?

Standard directives in robots.txt that target ImagesiftBot are respected. For example, the following will allow ImagesiftBot to crawl all pages, except those under /private/:

User-Agent: ImagesiftBot
 Allow: / 
 Disallow: /private/

ImagesiftBot also supports the crawl-delay directive in robots.txt files. It interprets the value as the minimum duration, in seconds, between the start of consecutive requests. For example, assume you have specified the following in your robots.txt file:

User-Agent: ImagesiftBot
Crawl-delay: 5

ImagesiftBot will split each day into 5 second intervals and issue at most one request to your domain inside each interval.

If there is no rule targeting ImagesiftBot, but there is a rule targeting Googlebot, then ImagesiftBot will follow the Googlebot directives. For example, ImagesiftBot will fetch all pages, except those under /private/ with the following robots.txt:

User-Agent: *
Disallow: /
User-Agent: Googlebot 
Allow: / 
Disallow: /private/

What information does ImageSiftBot save?

Along with images, ImageSiftBot saves the following information:

  • Host URL and text on the page

  • Alt text associated with image

How do we use this information?

Once images and text are downloaded from a webpage, ImageSift analyzes this data from the page and stores the information in an index. Our web intelligence products use this index to enable search and retrieval of similar images.


请登录已经激活账号继续浏此文章所有内容!
会员中心

已经有帐号啦 !赶紧同小伙伴们愉快地玩耍吧

立即登录

还没有账号? 马上注册一个新帐户

注册账号
热点标签: imagesift bot
内容说明:
如您需要转载本文请保留以下信息是对作者发文的支持与尊重:

imagesift bot 来源于 https://www.magentola.com/news-read-340.html
上一篇:Censys
下一篇:没有了
相关内容
产品推荐