Search engines depend on search robots or crawlers to find and collect all the information on the web they present in the search results. These robots however have a limited capacity and therefore simply cannot find all the available information on the web all the time. Especially for large content websites it can be a difficult task to get the most important and latest content in the index of search engines. You want to influence the way robots crawl your website to focus on the most important content. But how should you do this?
Crawling the web
First of all you need to know how search robots crawl the web. Google’s crawl process begins with a list of web page URLs, generated from previous crawl processes. From here they start indexing and following links on these web pages.
Crawlers behave much like a web browser. They request a web page from the server, download the page and send them to the Google index. They find these and new pages by adding the links they find on these pages to the list of pages to crawl. Links are the basis for all crawling and therefore key for directing crawlers.
Finding your website
So how do you direct crawlers to your important pages? Well, first of all you need to know where the crawlers most likely will enter your website. When entering your website they are following links as well. They enter your site on the pages with inbound links from external pages. Because most websites have the greater share of inbound links pointing to their homepage, the crawlers will most like start the crawling of your site on this page. The first important way to direct crawlers to your other important pages is by pointing inbound links directly to these pages. You want a key page in the index? Link to it from an external source. Can’t do that? At least link to it from your homepage.
Crawling your website
Now we know crawlers most likely will start crawling from your homepage. But what links will they follow first? Basically the crawlers add the links they see first to the top of the list of pages to crawl.
In a research on the Googlebot by Rolf Broer he found some interesting factors influencing the likelyhood of a link to be crawled. One important factor he found evidence for was the length of the URL. Basically the shorter the URL is, the more important the crawler deems it to be. This also affirms Google’s statement that parameter shouldn’t be part of URLs.
Other findings from this research show that adding semantics like headings (<h1> – <h6>) to a link doesn’t influence the likelyhood of a link to be followed. Links in breadcrumbs however tend to be ignored mostly in a crawl process. Possibly because the crawler assumes it already crawled those pages and prefers to find deeper pages.
Crawl frequency and depth
The crawl frequency and depth are determined by Google itself, there are no real ways to instruct Google how often they have to crawl the website or how deep. However there are factors Google influencing crawl frequency and depth.
Quoting Matt Cutts: “The best way to think about it is that the number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we’ll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we’ll crawl those as well. As you get deeper and deeper in your site, however, PageRank tends to decline.” (Source: Stonetemple.com). In the previously mentioned research on the Googlebot this statement proved to be true.
In that same research they also found that submitting a sitemap had a major influence on how many and which pages where crawled. “Googlebot placed the pages which were added to Google by sitemap on top of the crawl queue. [..] But what’s really remarkable is the extreme increase in crawl rate.”
The quality of pages found is also of influence on the number of pages being crawled on a domain. Matt Cutts: “If there are a large number of pages that we consider low value, then we might not crawl quite as many pages from that site.” Low quality in this case could be pages with no or little content, pages with mainly links or duplicate content.
Instructions for crawling your website
Besides the algorithmic rules crawlers follow for crawling your website there are a few ways to instruct these crawlers.
HTTP Headers are the main way to communicate with crawlers. For every request a crawler makes the server returns a HTTP status code. The status code tells the crawler something about the request. Whether it’s a 404 status which says the page cannot be found or 301 which says the page have moved permanently to a different URL. It’s clear way to communicate the way a crawler should handle the specific page requested. It’s even possible to tell a crawler whether a page has updated since their last visit using the proper response to the If-Modified-Since HTTP header sent by the crawler. When returning a status code “304 Not modified” the crawler can use it resources to index another page.
Robots.txt and meta robots
Sometimes you don’t want certain pages to be crawled (or indexed). There are multiple ways to instruct crawlers which pages they can crawl and/or index. First of all there’s the robots.txt file. Robots.txt is gatekeeper for crawler access to your website. In this file you can specify which folders or URLs the crawlers may not crawl. This is a great way to steer crawl capacity from unimportant pages to more important pages. Because crawlers will not open the pages blocked by robots.txt they will not find pages linked from those pages.
Another way to instruct robots is with a meta robots tag. With this tag you can specify for every page whether the page should be indexed or followed. As these terms already suggest it doesn’t keep crawlers from crawling the page. They still need to crawl the page occasionally to check whether the tag has changed. Setting the meta robots tag to nofollow however stops the crawler from following the links on that specific page. This can also be accomplished by adding the rel= “nofollow” attribute to individual links in the content.
An XML sitemap is a list of URLs of a site meant for search engine crawlers to find the most important pages. Google uses Sitemaps “to learn about your site’s structure, which will allow us to improve our crawler schedule and do a better job crawling your site in the future.” There’s no real evidence Google uses priorities defined in the sitemap to prioritize the pages to crawl. Earlier we already discussed the impact of Sitemaps on crawl speed and depth.
Although not really mentioned very often Google does use RSS/Atom-feeds to discover new content. “Using feeds for discovery allows us to get these new pages into our index more quickly than traditional crawling methods. We may use many potential sources to access updates from feeds including Reader, notification services, or direct crawls of feeds.” (Source: Official Google Webmaster Central Blog)
Google Webmaster Tools
Google Webmaster Tools gives site owners the possibility to give some extra instructions to web crawlers. Here you can define the crawl speed during the crawl process. This is a way to basically tell Google’s crawlers they can do more or less request on your site’s server, either increasing or decreasing freshness and number of pages crawled and decreasing or increasing traffic on your server. It doesn’t affect how often a site is crawled.