Robots.txt hits analysis

Brief study on how search engines hit the robots.txt during a short period of time.

Published: 26 May 2007
Author: Serban Ghita

During April 1st until May 26th (+ few hours from May 27th) period i ran a test over one of my website's robots.txt file, counting all the external hits.

Before we go any further you must take the following data into account:

- in April the site had 100.000+ unique links reported in Google's index, about 2000+ in Yahoo's index
- in May the site had 40.000 unique links reported in Google's index, and about 58.000 in Yahoo's index
- there were some significant changes made to the site: the access was restricted to pages generated by filters (see my previous article - Search engines and dynamic content issues), at the end of the period the navigation menu changed, there were 2 or 3 days near the end of the period when the robots.txt file was blank or showing some errors.
- the Google Webmasters Tools interface reported: 240.000+ pages restricted by robots.txt, the Crawl Speed was automatically set from Faster to Normal (I don't have the option for Faster activated any more).

If we take a look at the complete User-Agent list with crawlers that have hit the robots.txt file, we notice that we have the top 5 seats taken by: Slurp, Googlebot, Gigabot, msnbot and Ask Jeeves/Teoma. A complete surprise is the 3rd position in which is situated Gigabot crawler , because the site I'm talking about is merely getting any hits from gigablast.com. Also Gigablast.com reports mostly 1 year old pages when i search the site URL in their index (about 19.000 pages).

The Gigabot spider has 5 different User-Agent signatures:

  • Gigabot/1.0
  • Gigabot/2.0
  • Gigabot/2.0 (http://www.gigablast.com/spider.html)
  • Gigabot/2.0/gigablast.com/spider.html
  • Gigabot/2.0att

The uncontested winner of this study is Yahoo's Slurp crawler which hit the robots.txt file 3405 times in 57 days (taking in account May 27th - the day just started) resulting an astonishing 59.7 robots.txt hits per day. As you can see from the graphic , there were days when Slurp passed 250 hits per day. Those were the days when i heavely edited the robots.txt file (you can also see Google reacting later). I noticed that when you make some significant changes in your robots.txt file, the crawlers tend to revisit it after a short time, because of obvious reasons: saves them and us bandwidth (e.g. maybe you've just restricted some areas of the site, the sooner the crawlers find out, the better for your server's bandwidth)

At the end of the period (as you can see from the graphic), i've messed the robots.txt a little bit - it was blank or showing some code errors for 2-3 days. This is because i have made some changes in the site's code, and my robots.txt is dynamic (relies on some classes). Yahoo's Slurp crawler reacted at this by increasing the hit rate.

Googlebot tends to have a constant hit rate at about 24-25 hits per day in this case.

The Yahoo Feedseeker bot has a constant 1 hit per day. They also have a test bot (YahooFeedSeeker Testing/2.0 (compatible; Mozilla 4.0; MSIE 5.5; http://publisher.yahoo.com/rssguide)), for the next version which produces a couple of hits.

I couldn't stop msn-media bot from indexing my site even if i had the following code in robots.txt:

User-agent: MSNBot-Media
Disallow: /

I remember taking this from a huge corporation site, which had these lines in their robots.txt code, i will try using User-agent: msnbot-media/1.0 in the future. By the way, do not copy-paste code from other sites, they might be wrong about the standards. Take New York Times for example:

User-agent: Mediapartners-Google*
Disallow:

From my knowledge of the robots.txt standard the * character is not allowed in the User-agent: [...] syntax. Also the line with Disallow: actually allows the bot to index the pages.

Summary:

- robots.txt hits from April 1st - May 26 (+ few hours from 27)
- robots.txt hits graphic
- robots.txt standard
- about robots.txt on Wikipedia

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.