Find and Fix Common Crawl Optimisation Issues
Search Engine Optimisation

Find and Fix Common Crawl Optimisation Issues

14th September 2015

When I analyse websites for technical SEO issues, the biggest factor for me is always crawl optimisation – i.e. ensuring that when a search engine like Google crawls the site, only the right pages are crawled and Googlebot doesn’t waste much time on crawling pages that won’t end up in the index anyway.

If a site has too many crawlable pages that aren’t being indexed, you are wasting Google’s crawl budget. Crawl budget, for those who don’t know, is the amount of pages Google will crawl on your site – or, as some believe (myself included) the set amount of time Google will spend trying to crawl your site – before it gives up and goes away.

So if your site has a lot of crawl waste, there is a strong likelihood that not all pages on your site will be crawled by Google. And that means that when you change pages or add new pages to your site, Google might not be able to find them any time soon. The negative repercussions for your SEO efforts should be evident.

How do you find out if your site has crawl optimisation issues? Google’s Search Console won’t tell you much, but fortunately there are tools out there that can help. My preferred tool to identify crawl optimisation problems is DeepCrawl. With a DeepCrawl scan of your site, you can very quickly see if there are crawl efficiency issues:

DeepCrawl crawl report

The screenshot above is from DeepCrawl’s main report page for a site crawl. As is evident here, this site has a huge crawl optimisation issue: out of nearly 200k pages crawled, over 150k are not indexable for various reasons. But they’re still crawlable. And that means Google will waste an awful lot of time crawling URLs on this site that will never end up in its index – a total waste of Google’s time, and therefore a dangerous issue to have on your website.

Optimising crawl budget is especially important on larger websites where more intricate technical SEO elements can come in to play, such as pagination, sorted lists, URL parameters, etc.

Today I’ll discuss a few common crawl optimisation issues, and show you how to handle them effectively in ways that hopefully won’t cause your web developers a lot of hassle.

Accurate XML Sitemaps

One of the things I like to do when analysing a site is take the site’s XML sitemap and run it through Screaming Frog. While the Search Console report on a sitemap can give you good information, nothing is quite as informative as actually crawling the sitemap with a tool and seeing what happens.

Recently when analysing a website, the Search Console report showed that only a small percentage of the submitted URLs were actually included in Google’s index. I wanted to find out why, so I downloaded the XML sitemap and ran a Screaming Frog crawl. This was the result:

XML Sitemap with 301 redirects

As it turns out, over 90% of the URLs in the XML sitemap resulted in a 301 redirect. With several thousand URLs in the sitemap, this presented quite a waste of crawl budget. Google will take the URLs from the sitemap to seed its crawlers with, which will then have to do double the work – retrieve the original URL, receive a 301-redirect HTTP header, and then retrieve the redirect’s destination URL. This times several thousand URLs, and the waste should be obvious.

Upon looking at the redirected URLs in Screaming Frog, the root issue was clear very quickly: the sitemap contained URLs without a trailing slash, and the website was configured to redirect these to URLs with the trailing slash.

So this URL http://www.example.com/category/product then redirected to this URL: http://www.example.com/category/product/. The fix is simple: ensure that the XML sitemap contains only URLs with the trailing slash.

The key lesson here is to make sure that your XML sitemaps contain only final destination URLs, and that there’s no waste of crawl budget with redirects or non-indexable pages in your sitemap.

Paginated & Sorted Lists

A common issue on many ecommerce websites, as well as news publishers that have a great amount of content, is paginated listings. As users of the web, this is something we have become almost desensitised to: endless lists of products or articles which, in the end, don’t make finding what you’re looking for any easier; we end up using the site’s internal search function more often than not.

For SEO, paginated listings can cause an array of problems, especially when you combine them with different ways to sort the lists. For example, take an ecommerce website that in one of its main categories has 22 pages worth of products.

22 Pages

Now, this large list of products can be sorted in various different ways: by price, by size, by colour, by material, and by name. That gives us five ways to sort 22 pages of products. Each of these sortings generates a different set of 22 pages of content, each with their own slightly different URL.

Then add in the complication of additive filters – so-called faceted navigation, a very common feature on many ecommerce sites. Each of these will generate anywhere from one to 22 additional URLs, as each filtered list can also be sorted in five different ways. A handful of filters would make the amount of crawlable pages grow exponentially. You see how one product category can easily result in millions of URLs.

Obviously we’ll want to minimise the number of pages Google has to crawl to find all the products on the site. There are several approaches to do this effectively:

Increase the default number of products/articles per page. Few things grind my gears as much as a paginated list with only 10 products on a page. Put more products on a single page! Scrolling is easy – clicking on a new page is harder. Less clicking, more scrolling. Don’t be afraid to put 100 products on a single page.

Block the different sorted pages in robots.txt. In most cases, a different way to sort a list of products is expressed through a parameter in the URL, like ‘?order=price‘ or something like that. Prevent Google from ever crawling these URLs by blocking the unique parameter in robots.txt. A simple disallow rule will prevent millions of potential pages from ever being crawled:

User-agent: *
Disallow: /*order=price*

This way you can block all the unique parameters associated with specific ways to sort a list, thereby massively reducing the number of potentially crawlable pages in one fell swoop. Just be careful you don’t inadvertently block the wrong pages from being crawled – use Google’s robots.txt tester in the Search Console to double-check that the regular category pages, as well as your product pages, are not blocked from being crawled.

Excessive Canonicals

Ever since the advent of the rel=canonical meta tag, SEOs have used it enthusiastically to ensure that the right pages are included in Google’s index. Personally I like the canonical tag too, as it can solve many different issues and prevent other problems from arising.

But there’s a downside to using rel=canonicals: they’re almost too easy. Because implementing a rel=canonical tag can be such a blunt instrument, many SEOs use it without realising the true repercussions. It’s like using a hammer for all your DIY, when sometimes you should actually use a screwdriver instead.

Take the pagination issue I described above. Many SEOs would not consider using a robots.txt block or increasing the number of items per page. Instead they’d just canonicalise these paginated listings back to a main category page and consider the problem solved.

And from a pure index issue, it is solved; the canonical tag will ensure these millions of paginated pages will not appear in Google’s index. But the underlying issue – massive waste of crawl budget – is entirely unaffected. Google still has to crawl these millions of pages, only then to be told by the rel=canonical tag that no, actually, you don’t need to index this page at all, see the canonical page instead, thanks for visiting, kthxbye.

DeepCrawl non-indexable pages

Before implementing a rel=canonical tag, you have to ask yourself whether it actually addresses the underlying issue, or whether it’s a slap-dash fix that serves as a mere cosmetic cover-up for the problem. Canonical tags only work if Google indexes the page and sees the rel=canonical there, and that means it’ll never address crawl optimisation issues. Canonical tags are for index issues, not crawl issues.

In my Three Pillars of SEO approach, the first Technology pillar aligns with Google’s crawl process, and the second Relevance pillar aligns with the search engine’s indexer process. For me, canonical tags help solve relevance issues, by ensuring identical content on different URLs does not compete with itself. Canonical tags are never a solution for crawl issues.

Crawl issues are addressed by ensuring Google has less work do to, whereas canonicals generate more work for Google; to properly react to a canonical tag, Google has to crawl the duplicate URL as well as the original URL.

The same goes for the noindex meta tag – Google has to see it, i.e. crawl it, before it can act on it. It is therefore never a fix for crawl efficiency issues.

In my view, crawl issues are only truly solved by ensuring Google requires less effort to crawl your website. This is accomplished by an optimised site architecture, effective robots.txt blocking, and minimal wastage from additional crawl sources like XML sitemaps.

Just The Start

The three issues above are by no means the only technical SEO elements that impact on crawl efficiency – there are many more relevant aspects, such as load speed, page weight, server responses, etc. If you’re attending Pubcon this year, be sure to catch the SEO Tech Masters session on Tuesday where I’ll be speaking about crawl optimisation alongside Dave Rohrer and Michael Gray.

I hope that was useful, and if you’ve any comments or questions about crawl optimisation, please do leave a comment or catch me on Twitter: @badams.

Tags

Written By
Barry Adams is the chief editor of State of Digital and is an award-winning SEO consultant delivering specialised SEO services to clients worldwide.
  • This field is for validation purposes and should be left unchanged.