Search engines and dynamic content issues

Find out what links can harm your rankings, and how to use robots.txt

Published on March 24, '07
by Serban Ghita
Comment
ShareThis

The problem of taming search engines such as Googlebot, Slurp, MSNbot, Teoma, Gigabot and so on appears frequently when you have a large and dynamic site. There are some sections that can have approximately the same content, like the situation when you are filtering some results in a grid. Many of the filters are in generaly made with forms, but if you have link based filters they can be easely indexed by web crawlers and your rank could decrease. The main problem is that you will serve the robots a lot of pages with no relevancy at all, and the parent pages that point to them will get little credit in the web results.

Bad examples

I'll skip the chit-chat to examples.

Grid example with multiple link based filters
Brand filter: Brand 1 / Brand 2 / Brand 3 / Brand 4 [ ... ] / Brand 10
Price filter: 0-50 / 50 - 150 / 150 - 300 / 300 - 550 / > 550
Item code (asc / desc) Item name (asc / desc) Item price (asc / desc) Item warranty (asc / desc) Action
Code #1 Item #1 Price item #1 Waranty item #1 Add to cart #1
Code #2 Item #2 Price item #2 Waranty item #2 Add to cart #2
Code #3 Item #3 Price item #3 Waranty item #3 Add to cart #3
[ ... ]
Code #20 Item #20 Price item #20 Waranty item #20 Add to cart #20
Page 1 2 3 .... 100

 

List with examples of URLs from the grid (with filters applied)
We suppose that looking at this grid you have to be in an item category section: http://www.site.com/category1
Brand URL: http://www.site.com/category1/brand1
Price filter URL: http://www.site.com/category1/filter/0-50
http://www.site.com/category1/filter/0-50/2
http://www.site.com/category1/brand1/filter/0-50
.. and other combinations
Sorting URL: http://www.site.com/category1/brand1/sort=item_code/desc
http://www.site.com/category1/sort=item_price/asc/2
http://www.site.com/category1/filter/0-50/sort=item_price/desc/2
... and other combinations

The above example is an actual example of grid of products (items) very often found in online ecommerce shops.

A major error for a webmaster would be to allow access of search engines (crawlers) in all the sections of this grid, like the price filter, price filter combinations (eg. brand selected + price filter selected), sorting links or any other combination of these two. Why? Apart of my personal test that i will reveal here, let's take a look at what Google says:

"Sites with more content can have more opportunities to rank well in Google. It makes sense that having more pages of good content represent more chances to rank in search engine result pages (SERPs). Some SEOs however, do not focus on the user’s needs, but instead create pages solely for search engines. This approach is based on the false assumption that increasing the volume of web pages with random, irrelevant content is a good long-term strategy for a site." (Google Webmaster Central Blog post)

"Hmm, but my pages in the grid help user navigate on the site!" you might say. True, but they hurt your rankings and positioning if indexed:

  • Search engines spend a lot of time indexing these pages on your site rather than concentrating on feeding on the parent pages
  • The content is dynamic, so are the filters. If you reduce the number of items, in our case some pages in the 'Price filters' might dissapear and return 404 or 301. A site that has great index fluctuation will never get appropiate ranks (mainly because this is how spam sites behave).
  • Using JavaScript (AJAX) is a solution, but not every visitor's browser has JavaScript support, so you need a JavaScript/HTML switch solution (eg. href="REAL LINK" onClick="AJAX LINK")

You see, the http://www.site.com/category1/brand1/sort=item_code/desc page might contain the same blocks of code also located in http://www.site.com/category1/brand1/3 (meaning page three from items located in Category1 & Brand1). And it's better to focus the search engines on indexing the second URL rather then the first (or the ones with filter applied).

Feeding filtered content from your site to search engines is similar to giving them access to your latest searches on your site. Searching the items in the database of your site is another form of filter.

Lots of webmasters optimize the SEO code of the search section of their site just to get more pages in the index. This is a major mistake and often goes to having your site not ranked like it is suppose to be or banned from searches. An obvious example is this one:

  • URL of the search page: http://www.site.com/search/[keyword]
  • There is a section in the site that prints a list with the latest searches on site, with links that point to the URLs like the one above.
  • The search engines index the pages and might be stuck there for a while, because you can generate a lot of pages in the search pages (like the filters in the grid). eg. http://www.site.com/search/keyword/filter/0-50 , http://www.site.com/search/keyword/brand1, etc.
  • Basicly you can generate a ton of URLs from your search section of the site, which is very bad

Solutions

I gave you some bad examples on how to screw your rankings, now let's see the options we have.

In the first example with the grid of items, you can do the following:

  • use JavaScript for filters, but you will lose visitors that don't have JavaScript enabled support. If you created the filters entirely with JavaScript and no HTML switch you will lose the rest of the pages from the pagination from getting indexed.
  • use forms for filters that post information about filtering parameters. You might get into trouble with the code here, because the pagination must obey the filters. The pagination is GET based, and it is recommended to stay that way.
  • you can use links (hrefs) on all filters with rel="nofollow" on them. Google Official blog post: "From now on, when Google sees the attribute (rel="nofollow") on hyperlinks, those links won't get any credit when we rank websites in our search results." So this will only resolve half of our problem! That pages will not get credit in the ranking process, but they will still be accesed by the robots: Google Official blog post on robots exclusion procedures: "Using NOFOLLOW is generally not the best method to ensure content does not appear in our search results. Using the NOINDEX tag on individual pages or controlling access using robots.txt is the best way to achieve this. ". This means that if we put rel="nofollow" attribute on <a> tags in the grid, but someone posts a link on their website with an URL like http://www.site.com/category1/brand1/sort=item_code/desc, it will still get indexed. This is also valid for this link appearing in the sitemap with no nofollow attribute on the link.

Best method is to use rel="nofollow" along with robots.txt directives:

User-agent: *
Disallow: /*filter
Disallow: /*sort=

Both Googlebot and Slurp claim to recognize this syntax. See Slurp robots.txt and Googlebot robots.txt info.

Problems

Here are some issues with the web robots that i have encountered:

User-agent: *
Disallow: /*pdf$
Disallow: /*xls$

These 2 lines prevent the bots from accesing urls like:

/accessories/apc/full/xls
/dell/1/ASC/DESC/NONE/xls
/memorii/mediaplayere/prestigio/1/pdf

Still MSNbot and Teoma (Ask.com) are not obeying that expression (today is March, 23 2007). Ask.com team says they are looking into it, i don't have an official answer by now. Note that if we remove the $ from Disallow: /*pdf , this line will ban every URL that contain the word 'pdf' (eg. /news/12/we_have_a_new_pdf_catalogue.aspx).

MSNbot ignores Disallow: /promo/*/similar_products and indexes http://www.site.com/promo/item_name~95025/similar_products/4 . I don't see the bot indexing /promo/item_name~95025/similar_products which is the root page (with pagination).

These problems might be generated because of a robots.txt that is updated frequently. If we look at some statistics:

11 March - 25 March - robots.txt hits
website has 150.000+ unique links, age 1 year
UA Hits Last access
Slurp 1834 2007-03-25 01:46:13
GoogleBot 1478 2007-03-25 02:25:36
MSNbot 794 2007-03-25 02:14:31
Gigabot 120 2007-03-25 01:59:19
Teoma 44 2007-03-25 01:08:27

We see that MSNbot, Gigabot and Teoma might crawl some pages with an old cached robots.txt (if you don't update it frequently).

Conclusions

  • do not allow search engines to index generated pages that have no value to them
  • do not allow search engines to index filter pages
  • do not allow search engines to index search results
  • "Content is the king."
  • watch closely your access log on your web server (check 404 errors and redirect them)
  • use tools to check your robots.txt such as Google's robots.txt analysis tool (this tool does not apply globaly on all search engines)
  • make sure you understand the robots.txt standards

Solutions

• MSNbot

[UPDATE 28 mar 07 #1] - Brent Hands (PM at MSNbot Live Search's crawler) - "The problem here comes from the fact that we do not interpret the “*” and “$” operators as a matter of course.  However, we do recognize the following pattern:

Disallow: /*.pdf$

So, by adding the “.”, you can effectively block all documents with a specific extension."
Maybe i've exagerated a bit with mod_rewrite, take this example: http://www.itpromo.net/servers/1/pdf - a PDF version of the first page from 'Servers Deals' section (http://www.itpromo.net/servers or http://www.itpromo.net/servers/1)

[UPDATE 03 apr 07 #1] - Brent Hands with the official Microsoft response:

"Unfortunately, we currently only interpret ‘*’ when used to filter by file extension.  As such, the following will work:

User-agent: *
Disallow: /*.pdf$
Disallow: /*.xls$

But this is the only context in which MSNBot understands wildcards. [...] we will work resolve this as quickly as possible."

• Ask.com / Teoma

[UPDATE 28 mar 07 #2] - Ask.com responded very quick "We have forwarded this situation to the appropriate technical team for investigation, and will get back to you as soon as we are able. Thank you for your patience.".

[UPDATE 03 apr 07 #2] - Official Ask.com response: "We do support the '*' wildcard character, but unfortunately, not regular expressions." I'm still having problems blocking Teoma from crawling URLs like: /pda/full/xls or /notebook/full/html

Post new comment

This blog is intended to foster polite on-topic discussions. Comments failing these requirements and spam will not get published.