A Technical SEO Guide to Crawling, Indexing and Ranking
Estimated reading time: 14 minutes, 9 seconds
During the holiday weeks we will be showing you the 15 best read posts of 2013. Except for on Christmas and New Years day, each day you can read the best articles again, going from number 15 back to number 1.
Now it’s time for number 7, originally posted on February 6 2013, a post by Paddy Moogan.
Technical SEO can often be brushed aside a bit too easily in favour of things like content creation, social media and link building. However I’ve always believed that there are many opportunities for increasing traffic by looking inwards rather than outwards. One of the biggest areas of this for me is to make sure that your website is as accessible as possible to the search engines.
It’s quite simple really – if the search engines can’t crawl your website efficiently, you’re unlikely to rank. Even links and social shares won’t solve severe accessibility issues so the knock on impact is that your link building will look inaffective. This is the last thing you want because link building can be hard anyway, you don’t want to cripple yourself before you’ve even started
So in this post I’m going to talk through some of the key areas you need to think about when it comes to making your website accessible. An accessible website means that all target pages will be indexed and have the opportunity to rank for your target keywords.
To help keep things in a logical structure, I’ve divided the post into three main areas:
The first thing we need to look at is to make sure that all of our target pages can be crawled by the search engines. I say “target pages” because there will be occasions when you may want to actively stop certain pages being crawled, which I’ll cover shortly.
Before that, let’s look at how we make our website crawlable and how to look for potential problems.
Good site architecture
A good website architecture is not only good for search engines, it is good for users too. Put simply, you want to make sure that your most important pages are easy to find, ideally within a few clicks of the homepage. This works well for a couple of reasons:
- Usually, your homepage is the most linked to and therefore can flow a lot of PageRank throughout the rest of the site
- Users will be able to find your key pages quickly – increasing the likelihood of them finding what they want and converting into customers
In terms of what this actually looks like, a simple structure will be like this:
If you own an ecommerce website, the detail pages in this example would be your product pages. This is quite a logical structure and one that is recommended for small to medium size websites. The only other major consideration here is to make sure that you map keyword research to this structure which you can read about in this post by Stephanie Chang and this post by Richard Baxter.
But what if you have a website with millions of pages? Even with good category structure, your key products may end up being quite far away from the homepage. In this case, you may want to consider implementing a faceted navigation which can help with this. Faceted navigation adapts itself to what the user is looking for and eliminates a lot of the noise by letting them easily filter to find what they want. The best thing to do here is to show you an example of what I mean, luckily the guys at Madgex wrote this good article and created this example which visualises things very well:
As you can see, it is very easy for a user to filter a large set of results very quickly by clicking on the attributes they’re looking for. This is a great technique for ecommerce websites in particular because there are usually all sorts of product attributes like size, colour, brand etc that you may want to let the user filter by. An example of this in action with another type of website is Reed. When you go into a category, such as Accountancy, the filters on the left hand side are customised to that category so you can filter by Accounts Assistant or Credit Controller. You wouldn’t get these options if you visited the Marketing category. This is how large websites can make it easy for users and search engines to get to deep pages quickly.
In terms of crawling, there is one other thing to note here. Sometimes, you may not want the search engines to be able to crawl deeply and find pages that have too many attributes. For example, let’s say we sold outdoor clothing and were focusing on jackets. A jacket may have the following attributes:
Now we know that a keyword such as “mens waterproof jackets” has a decent amount of keyword volume from the Adwords keyword tool. Therefore we do want to have a page that the search engines can crawl, index and rank for this keyword. So we’d make sure that this is possible through our faceted navigation by making the links clean and easy to find.
On the other hand, a keyword such as “black mens large waterproof jacket under £100” does not have a lot of search volume. So we may want to stop the search engines from being able to crawl and index this page. But obviously we’d still want to make it available for users when they use our navigation.
Why worry about this? The concept of crawl budget or crawl allowance comes into play here and is what I’ll discuss in the next section. Below I’ll also discuss how we may stop certain pages like this from being crawled and indexed.
Google assign a crawl budget to each domain they crawl, according to Matt Cutts, this budget is roughly determined by the amount of PageRank you have. Whilst Google want to find as much content as they can, they only have a certain level of resource to crawl an ever expanding web. So they need to prioritise and be a bit selective so that they at least make sure they crawl as much of “the good stuff” as possible. Matt points out that there isn’t really a cap on the number of pages they will index from a single domain though. I interpret his comments as saying Google will crawl and index as much of your website as they can, but if your PageRank is not that high, it may take them a while to get through everything and find the deeper pages on your website.
Controlling the crawl
We know that you can build more quality links into your website which can help with your PageRank, this is something that we need to do anyway. But you can work to optimise your crawl budget by taking a few steps to gently nudge Google into the right direction when they crawl:
- Add the rel=”nofollow” tag to links to page you do not want crawled
- Block certain sets of pages in your robots.txt file to stop Google crawling them
The goal of all of this is not to control PageRank, but to try to control which pages your crawl budget gets used on. It is a waste if Google is using all of its crawl budget on pages that you don’t really care about. You’d much rather them spend time crawling the ones that you want to rank well and that may be updated more often. A by product though should be that PageRank does flow to your important pages and ones that you want to rank well. However this should be achieved anyway through having a good website architecture.
As well as rel=”nofollow” and robots.txt, you can use META tags to control how Google crawls your website. These are placed in the <head> section of your page and can do a number of things including:
- Tell Google not to index the page
- Tell Google now to crawl any links on the page
- Tell Google not to index images on the page
- Tell Google not to use a snippet from the page in their search results
- Combinations of the above
Remember that these tags are page level and only affect the page itself. Another important thing to remember here is that the search engines need to be able to access the page itself in order to see this tag. So if you block a page in robots.txt, the search engines will probably never crawl the page and find the META tag that is on it.
A quick recap and clarification here:
- The rel=”nofollow” tag, when used on an individual link, will only affect how Google treats that link, it won’t affect others on your website
- The rel=”nofollow” META tag is page level and will affect all links on the page where it is placed
- The robots.txt file can affect either individual pages, sections of a website or the entire website
What about the rel=canonical tag? This allows website owners to specify the canonical version of a page and highlight duplicate or near duplicate content, giving the search engines a signal of what pages they may and may not want to crawl, index and rank. At this point you should note that this tag isn’t a directive, meaning that the search engines can choose how to treat the tag and have the right to ignore it if they choose.
This tag can help you ensure that duplicate content is not going to hurt your website and can make sure that the correct URL is shown to users in search results. In terms of crawling, it would also make sense for the tag to nudge the search engines away from crawling duplicate pages as often. But similar to the META robots tag, the search engines must be able to access a page before they can find the tag.
If you want to get a bit more advanced with finding out how the search engines are crawling your website and to find problems, you can take a deep dive into your server log files to get more detailed information. Your server log files will record when pages have been crawled by the search engines (and other crawlers) as well as recording visits from people too. You can then filter these log files to find exactly how Googlebot crawls your website for example. This can give you great insight into which ones are being crawled the most and importantly, which ones do not appear to be crawled at all.
This is probably one of the best indicators you can get of what is stopping a page from being indexed and ranking well. You can do all sorts of on-site analysis but ultimately, if you can clearly see from server logs that a page isn’t being crawled, you have your answer. Then you can start to work out where the problem may be and work on a solution.
I like to use Splunk for log file analysis, it does take a bit of time to get used to but is certainly one of the better pieces of software that I’ve used. Dave Sottimano wrote a great tutorial that involved Splunk over on SEOmoz. If you prefer something else, try Craig Bradford’s post on Distilled on using the command line for log file analysis.
This image from Dave’s post gives us a perfect example of Google wasting crawl budget on pages we don’t care about:
Once you’re happy that the search engines are crawling your website correctly, it is time to monitor how your pages are actually being indexed and actively monitor for problems.
The easiest way to check that Google is indexing a page correctly is to check the cached version and compare it to the actual version. There are three ways you can do this quickly.
1. Run a Google search:
2. Click through from Google search results:
3. Use a bookmarklet
I use a simple bookmarklet in Chrome to check the cache of the page that I’m on. Create a new bookmark in your browser and add this as the location:
Nice and simple!
The goals of checking the page cache are:
- Check that a page is being cached regularly
- Checking that the cache contains all your content
If these are ok, then you know that a certain page is being crawled and indexed fine.
This has been written about a few times before so I won’t repeat everything here, there is also this great case study on SEOmoz regarding the use of XML sitemaps. To summarise briefly here though, the idea is that by creating several sitemaps for different parts of your website, you can monitor indexation using Google Webmaster Tools:
This is taken from my own account and shows three sitemaps that I’ve submitted to Google. Whilst these numbers are quite small, you can see how this can help you spot problems quite easily.
Another nice feature of Google Webmaster Tools is index status. This gives some insight on how Google is crawling and indexing your website as well as giving you an idea of how many pages Google are choosing not to index. Here is an example from my own account:
The spike in the green line shows when Google increased the number of URLs that is has classified as “not selected”. This means that Google has found the pages to be very similar to other pages or are redirected. I spent a bit of time looking at this and believe that the cause was a faulty plugin that caused lots of duplicate URLs to be linked to – but I’ve been lazy as it is my own site and not fixed it yet!
If you’re constantly adding new pages to your website, seeing a steady and gradual increase in the pages indexed probably means that they are being crawled and indexed correctly. On the other side, if you see a big drop (which wasn’t expected) then it may indicate problems and that the search engines are not able to access your website correctly.
Now the final part of our work, arguably the bit that we all truly care about the most! Are our pages ranking as well as they could be? We are always working to get our pages ranking higher than they already are, so I wanted to focus on a very specific tactic instead.
Firstly you need to find out how many pages you are trying to get traffic to. This will probably be your homepage, categories, products and content pages. There are a few ways you can do this, depending on your website setup:
- Look at number of URLs in your sitemap (relies on sitemaps being updated and accurate)
- Speak to your developers who should be able to give you a rough idea
- You could also crawl your website but this relies on all pages being accessible in the first place
Once you have this number, you need to check how many of these pages are getting organic traffic. You can do this using Google Analytics.
The rough and ready way to do this is to go to this report:
Make sure you filter by organic search only, choose a big date range (like at least six months) and then scroll down to see how many pages received entrances:
If this number is significantly lower than the number of pages you actually have, then you’re probably losing out on a lot of potential traffic.
If you want a more accurate idea and actually one to see which pages are not getting visits, you can export the list of URLs from analytics into a CSV, then compare them to the list you have of all pages. A simple VLOOKUP will tell you which pages are and aren’t getting traffic.
Once you have a list of pages that are not getting traffic, you can take a closer look into why. A few ways you can do this, utilizing what we looked at above:
- Create a dedicated sitemap containing just these URLs and measure how Google is indexing them
- Filter your server logs to include these URLs and see if they are being crawled
- Take samples and check the cache to see if they are being cached
That’s about it for this post, if you have any questions feel free to put them in the comments and I’ll do my best to answer.