Why you should keep your robots.txt clean!
Estimated reading time: 4 minutes, 1 second
There is one thing I see happening over and over again: Cutting out inbound link-juice by miss-using robots.txt directives. So what does it mean: Well simply put it means that you’re wasting (massive) juice flowing into and / or within your website. So stay with me and I’ll show you how to fix this in a bit.
First, let’s have a look at robots.txt in general. Probably the easiest “version” of a robots.txt file might look like this:
What does it mean? It just means that for every user-agent accessing the domain there is no restriction in crawling. Be aware that “Disallow: /” would probably cause a disaster because you’d block all access to the entire domain. Not so good…
On a side note: Please make sure that your web server does serve the file as “text/plain” (the official specification says so) – because a lot of boxes (mostly application servers) tend to serve everything as “text/html” if not specified otherwise. And hopefully that’s clear but: The server needs to return a HTTP status code 200.
All right, since we’re all set with the basics of serving the robots.txt file – let’s have a look at what happens if you’re using robots.txt directives but probably shouldn’t: Let’s say you have pages on your domain which are print-able. The default version is available as www.domain.com/my-article-123.html and the print-able version resides at www.domain.com/my-article-123.html?print=1 (you could probably find yourself in the same situation when providing filters, content-paging or similar – it doesn’t really matter). Let’s further consider you’re having ~10.000 articles and to prevent from duplicate content issues you at some point decided the prevent crawler access / indexing those print-able versions. One might think that robots.txt would do the trick. The directive would look like this:
You could do so – and your DC issues might be solved, however the bad news is that by doing so, you would also kill all inbound link-juice to those pages. To make it clear, Google will list pages that earn inbound-links (mainly external ones) but are blocked by robots.txt (they’ll just appear in SERPs with their URL) but your domain doesn’t benefit from those links anymore, which really sucks. Here’s an example from a big German bank (“Volksbanken Raiffeisenbanken“) and how you really should NOT do it (click to zoom):
To prevent search engines from indexing those sites, always use a Meta robots tag and set it to “noindex,follow” – especially if those sites already earned some links. It makes a huge difference because link-juice will still flow into your site and help other sections of the domain. Another way (especially to fix session-id issues) could also be an appropriate implementation of a rel-canonical tag which would consolidate all link-juice to that single URL (without the appended session-id, of course).
The downside of using the robots Meta tag is that you can just do it for HTML files but not for other file-types. If we’d change the above mentioned example to become a PDF instead of a print-able version we couldn’t simply use a Meta tag but had to rely on robots.txt to prevent the PDF from becoming a duplicate to the article’s site. Well… that’s not entirely true!
A while ago, Google introduced the support for X-Robots-Headers which allows specifying all values of a Meta tag within the HTTP header instead. Using PHP, this might look like this:
However, if you don’t want to pass all your files through a PHP script, you could also do it directly within the web-server’s .htaccess file (please note my post on potential performance implications) like Hamlet Batista showed in a blog post:
SetEnvIf Request_URI “*\.pdf$” is_pdf=yes
Header add X-Robots-Tag “noindex, follow” env=is_pdf
Here is what Hamlet says about it: “The first line sets an environment variable if the file requested is a PDF file. […] The second line adds the header only if the environment variable is_pdf (you can name the variable anything you want) is set.”
This is a pretty smart approach and again – if your PDF files do earn inbound links (and you have embedded a link back to your domain, which you should have anyways!) those links would be passing juice back to your domain. Same method can be applied for any file-type you want – no limitations here.
Really make sure you don’t waste inbound link-juice (and maybe other, positive factors) – building links is hard and a lot of work – not need to waste some! And if you do use robots.txt please consider what side effects might appear.