Robots.txt & Meta robots, two very similar ways to achieve a similar task – but in effect they’re significantly different!
Robots.txt was my first introduction to blocking pages from Google, it’d take a relatively small change to implement and it was very simple to explain if you were working with third-party developers. It’s too easy to use this badly however, and robots.txt became known as the “lazy option”. The obvious knee-jerk reaction? Meta robots (noindex, follow) everything – something made A LOT easier with Plugins like Yoast (for those WordPress Users out there).
The result? Neither method selected due to their effectiveness, but rather because it was the easiest option. Nightmare!
The problem? Many people I’ve worked with never truly understood what the problem they really need to solve was, therefore couldn’t select the correct solution.
But it can be done really, really easily.
Crawl or Index Issue?
When deciding how to block/limit Google (or other search engines) from your site, this is really the most crucial question for you to ask. Do you have an issue with crawl budget? Or is too much of your site being indexed?
Put even more simply, is Google seeing too little or too much of your site?
I’ve got to be a bit careful oversimplifying too much, in fact it’s a pretty tricky process identifying & fixing these problems.
The above is made worse by the fact it usually impacts very large sites. In an effort to rectify the situation you have to make changes which impact a lot of pages, meaning it takes longer to take effect. This is worse news if you do something wrong – putting it right again will take even longer!
You have been warned.
Is it a Crawl Budget Issue?
Much has been written about this over the last 12-18months – Dawn Anderson or Barry Adams are good people to start with if you want a more in-depth introduction. In a nutshell, you have have a big crawl budget issue when your site is x7-x10 larger than is being crawled daily. Easy enough to explain.
- Google has told us what crawl budget is/acknowledge it is indeed an important thing to be aware of – but is typically sparse on details
- The majority of us don’t understand this as well as we should
- Establishing your own crawl budget is very hard
- Selling the concept of crawl budget to a client is simple… until they start demanding specifics
- Once you identify it, you’re assuming Google crawls the site in a linear fashion from top, to bottom – it doesn’t.
- Elements which create issues with crawl budget are sometimes needed – & AMP links, hreflang can both eat major crawl budget – blocking these off defeats the point of using them.
That said, they think you’re making it up as a tall story when you start talking about robots and crawlers. You’re just the daft geek then
— Dawn Anderson (@dawnieando) August 28, 2017
The good news? If you have under 1,000 pages you probably don’t have a problem – Google should be able to handle it.
Establishing Crawl Budget Issues
Spotting where you have a crawl budget issue is relatively straightforward – start a crawl using your favourite software and look at where it gets stuck. If you’ve ever used Screaming Frog or similar, you’ll see this most commonly with layered navigation, filters/sorting options and with some event/calendar functionality.
Logfiles are one of the most sure-fire ways to determine whether you have an issue or not – you just need a Log File Analyser, Deep Crawl/Botify or some Grep skills. It’s too far beyond this post to go into the process – read here if you’re interested – but like the above, you need to monitor where Google spends its time when it’s not needed.
For the “quick & dirty” way to understand if you need to be worrying about crawl budget Joost has you covered:
- Look at GSC Crawl stats average pages crawled per day
- Take your site pages number & divide it by the average crawled per day
- If you have 10, you have x10 the pages than Google is crawling -a pretty big crawl problem
So the above example:
9,781 / 1,466 = 6.6 – nearly x7 times the daily crawl.
As a rough guide – if your site is over 1,000 crawled pages and you’re scoring over x5, you’re going to want to investigate further.
Do You Have Index Issues?
This one is altogether easier to diagnose, using search operators (site:, inurl:, intitle: etc) to see where key, top-level duplication may be taking place.
Also most decent SEO audit tools can get you closer to this – when I’m feeling lazy I usually turn to Siteliner. The real key question here is, is your site bigger than it should be? If it’s a yes and Google’s index is more bloated than it should be, you’ve got an index problem!
Crawl/Index Problems Aren’t Mutually Exclusive
Having made this relatively easy so far, here’s the curve-ball – you can have crawl & index issues at the same time.
Index bloat at scale will, undoubtedly cause crawl problems (Google’s crawling more than it should), yet if you have significant crawl issues, repairing other site issues, like duplicate content will take longer and be harder to judge the effects of.
Scale is the real killer here, the larger the website is, then one will really drive the other. However, I will fix index issues (mostly duplicate/thin content) before crawl issues. The logic here being I’d rather bring the perceived level of quality content up, rather than make Google crawl worse quality content quicker.
The above assumes you can’t fix both at once, which is a whole big list of “ifs” and “buts” – it also ignores other methods to handle this problems, canonical tags, nofollow, GSC parameter handling & 301 redirects being the most likely candidates.
Putting in the “Right Fix”
If you’ve just got a crawl issue, blocking the offending pages using robots.txt is the simple answer.
Likewise, if you’ve got a significant index problem, “noindex follow” using meta robots will be the best way to remove unwanted pages from Google without impacting pagerank flow.
In a practical situation, this means that if, for example, we wanted to fix layered navigation which was causing crawl issues AND index issues, I’d first used meta robots “noindex, follow” to ensure these pages were dropped. After these pages have been dropped, you could then block off using Robots.txt (but not before).
Don’t use robots.txt to block any pages which you’re using meta robots noindex – if Google can’t see it, it won’t take the robots directive in account.
A less frequently used method of using both together – from within Robots.txt. Using “noindex:” and “disalow:” rules, in that order should ensure that the pages is noindexed and then blocked from the crawl. Some believe this is less reliable others disagree & have seen results.
The question is the longevity of this method, John Mueller previously advised against this. Many tests verify that it works (like this one), but I would test/use in moderation – always have a “plan b”!
There must be a better way
I’ve been working on a flow chart to help simplify the above even further – click to enlarge, download it and take a look. Start at “do you have a duplication problem?” and follow the path until you hit a circle, the most relevant action for your circumstances.
We’ve done some pretty extensive testing on this, however, every site is different so if you’ve got any feedback – please let me know!