URLs, Crawling, and PageRank; Fundamentals of SEO

Technical SEO Those who know me will not be surprised to hear I have a lot of pet peeves; I’m easily annoyed. One of those pet peeves is URLs. Specifically, about the lack of respect that URLs are accorded by developers and marketers alike.

URLs are not some peripheral aspect of websites that you don’t have to worry about. Quite the opposite: URLs form the foundation of the web.

Proper appreciation of URLs and their role in making the web work prevents a lot of potential problems with usability and SEO. By understanding what URLs are, and how technology on the web depends on URLs, you’re much more likely to build and optimise websites in a way that enables online success.

So let’s dig in to URLs for a bit and explain why these strings in your browser’s address bar are so important, and what you can do to utilise their power for maximum effect.

What Are URLs?

URL stands for Uniform Resource Locator. Basically, a URL is the location of a given resource on the internet. Usually it refers to a webpage, but it can refer to anything published online: documents, images, JavaScript files, etc.

Sometimes you’ll see people refer to URIs instead: Uniform Resource Identifiers. While there is a technical difference between the two, for the purpose of this article we’ll keep things simple and refer to URLs only. We’ll also limit ourselves to URLs on the web, which means we focus exclusively on web addresses that start with http(s).

For example, the URL of this page is:

https://www.stateofdigital.com/urls-crawling-pagerank-fundamentals/

It starts with the protocol (http in this case), followed by the subdomain (www), the domain name (stateofdigital) and the top-level domain (.com). The rest of the URL refers to the location of the webpage on the domain.

For many web developers and marketers, that last bit of the URL is incidental. They don’t really care what it says, as long as the website serves the right content. Some developers even believe a website doesn’t need any URLs at all, and can instead serve all its content from a single URL.

And this is wrong. Why? Because URLs are what make the web work. If we forget that, we end up making huge mistakes in how we design and build websites, and how we can get those websites to attract traffic.

URLs & Content Hierarchy

Let’s start with how URLs should reflect a site’s content and structure. I’ve written before that URLs should be optimised for usability, so I won’t repeat myself too much here.

In summary, URLs should be human-readable. Moreover, they should be hierarchical and reflect a page’s place in the site’s overall structure. When you look at a URL like this, you see exactly where it fits in the overall structure of the site:

https://www.website.com/safety-boots/caterpillar/size12/

This page is the child of the Caterpillar page, which itself is the child of the Safety Boots page. Ideally, you should be able to take away any child page and end up on that page’s parent, like so:

https://www.website.com/safety-boots/caterpillar/ should result in the Caterpillar Safety Boots page.
https://www.website.com/safety-boots/ should give you the overall Safety Boots page.

Too often, URLs and content hierarchy are divorced from one another, and there is no parent-child relationship evident in a page’s URL. This is a mistake.

Good content hierarchy that is reflected in a page’s URL is not only great usability, it also sends strong semantic signals to search engines. If you have a URL without any hierarchy – or with the wrong hierarchy – it can confuse search engines and send the wrong relevancy signals (or worse, send no signals at all).

When you build a website’s content in a clear, navigable structure, your URLs should follow suit and reflect that structure. By creating strong parent-child relationships though URLs, you are sending powerful signals to search engines which can help your content rank better in organic search results.

URLs and Search Engines

For me, a more interesting aspect of URLs is how search engines engage with them beyond just relevancy signals. The crawling and indexing of content on the web by search engines like Google happens through URLs.

Everything web-based search engines do is built on that fundamental aspect of the web: URLs linking to other URLs.

And the better you understand that aspect of search engines, the more effective you can be at SEO. Let’s start with the very basics of search engine technology: crawling.

URLs and Crawling

A search engine crawler like Googlebot is focused purely on URLs. Almost all it does is retrieve URLs and find new URLs to crawl. The entire crawling process of a web search engine like Googlebot is URL-based.

In screenshots where Google explains the processes that make up its search engine, we see URLs getting their very own box. This is not by accident – the web is made of URLs, so almost everything Googlebot does centres around URLs.

Google Search Engine Pipeline

The various processes that go in to Google’s crawler are aimed at optimising the efficiency with which they crawl the web. There’s a scheduler system that prioritises URLs to be (re-)crawled, and a de-duping system that prevents Googlebot from crawling URLs it believes has the same content as already crawled URLs.

URL Scheduling

One common misconception about Googlebot is that it will try to crawl your entire site one page at a time until it’s done, and then it’ll start over again. This is a very inaccurate picture of how Google crawls your site.

Instead, Googlebot will focus its crawling on your site on the URLs it believes are most important. Those URLs that are important will be crawled and re-crawled more often, and URLs that are unimportant will be crawled very infrequently.

The importance of a URL depends on many different factors. One of the biggest factors is the URL’s PageRank. The higher the PageRank, the more often Googlebot will crawl the URL.

Another factor that impacts on crawling is load speed. If a website can serve a lot of URLs in a short amount of time, Googlebot can crawl the site much faster. Conversely, if a website loads each page slowly, Googlebot has to crawl the site at a much slower rate.

This is very clearly demonstrated by this screenshot from Google Search Console. You can see exactly when this particular website was upgraded to a much faster platform that could serve many more pages per second:

Google Search Console crawl stats

When the updated site launched, the crawl rate went from around 50,000 URLs per day to an average of over 200,000 a day. This coincided with a load speed improvement from around 2 seconds per page to just half a second per page.

From a technical SEO perspective, this is one of the main reasons load speed is so important. The improvement to your site’s crawl rate is immensely valuable for SEO, especially on large sites with a lot of URLs that frequently change their content.

URL De-Duplication

As part of the crawling process, Googlebot will try to avoid crawling URLs that it doesn’t believe add value. It’ll try not to crawl URLs that it believes have duplicate content, and has systems in place to de-dupe URLs prior to being crawled.

So Google doesn’t always have to look at a page’s content to recognise duplicate content. In a recent Webmaster Hangout, Googler John Mueller describes it as follows:

“What sometimes happens is we kind of proactively recognize that something is probably a duplicate, even before crawling it. So this happens when we see that the difference, for example, is within the URL somewhere in a place where we’ve generally noticed that the content shown in this part of the URL is not so relevant to the content that’s shown on the page.

So that could be something like you have a language parameter that you can set to any kind of term, and we might’ve gone through and tried something like “language=English,” “language=French,” “language=German,” … if we find that all these pages show the English content, except for maybe “language=Spanish” that chose the Spanish version, then we might assume that this language parameter is actually irrelevant to this page, and then we might miss that one page that actually has unique content.”

In a nutshell, Googlebot identifies specific URL patterns that indicate duplicate content, and will try not to crawl those URLs.

It’s good to be aware of this aspect of the crawler, so you can prevent accidental de-duplication of URLs by Googlebot when there is genuinely different content.

HTML Parser

One last thing about URLs and crawling I want to discuss is the HTML parsing process of the web crawler. As we know, when a search engine crawls a page it doesn’t do much with the content of the page. When it comes to understanding a page’s content, the heavy lifting is done by the indexing process.

What the crawler does do is extract URLs from a page to add to its crawl queue. Basically, the crawler has an HTML parser that finds and extracts URLs, primarily in anchor tags (i.e. links).

This is why websites that rely heavily on client-side rendering are so problematic for Googlebot and other crawlers.

If there are no URLs to find in a page’s pre-rendered HTML source code, then the crawler has to wait for the indexer to render the page and find URLs. This makes the process of crawling client-side JavaScript websites incredibly cumbersome and inefficient. I’ve written more about that in my article on JavaScript & SEO.

JavaScript HTML source code — *Example of a webpage relying on client-side JavaScript. The unrendered HTML source code contains no links for the crawler to extract.*

URLs and Indexing

There’s already a lot written about how search engines index pages, so I won’t go in to a lot of detail here. Nonetheless there are some under-appreciated aspects of Google’s indexing system that I want to highlight, specifically the PageRanker. This module in Google’s indexing system calculates each URL’s PageRank (PR) based on the quality and quantity of incoming links.

While Google has stopped publicly showing a page’s PageRank, and it’s no longer the pivotal ranking factor it used to be, PageRank still has a very big role to play in Google’s overall search engine processes.

For starters, a page’s PageRank has a big impact on its perceived importance. As I said above, more important URLs are crawled more often, so a good way to get Google to crawl a URL more frequently is to improve its PageRank. Good internal linking and/or getting links from external sites to a URL is a great way to improve the rate at which it is re-crawled.

PageRank isn’t calculated in real-time. It’s a system that periodically runs, looking at the entire link graph of the web and each URLs place within that graph.

PageRank & JavaScript

Again, this is where client-side JavaScript websites have a key concern. Because of the difficulty that the crawler has in finding all links within a JavaScript website (see above), it’s very common for the PageRanker to work with incomplete link graphs when it comes to JavaScript websites.

Especially on large JavaScript sites, this can be a severe problem. I’ve seen instances of JS-based websites where Googlebot focuses all its crawl effort on a narrow set of pages and ignores most deeper pages.

I believe this is because all the site’s PageRank is concentrated in those URLs that Googlebot already knows exist and has rendered with its Web Rendering Service.

Because of the lag between crawling and indexing on JavaScript websites, by the time the crawler finally gets around to crawling the URLs of deeper pages that have been discovered by the indexer, the crawl scheduler wants to go back to crawling the already known pages because their PR has already been calculated by the PageRanker, so they have a higher URL importance than the newly discovered URLs.

The end result is a very low rate of indexing on the site. Googlebot does its best, but its own URL scheduling systems don’t allow it to spend crawl effort on deeper URLs that it doesn’t see as having any value.

Only when PageRank is recalculated with the newly discovered links taken in to account, is there a chance that those deeper URLs are finally seen as having some measure of PR passed through from the higher level URLs on the site, and will Google spend effort on crawling and indexing those.

This, for me, lies at the root of the SEO issues that many client-side JavaScript websites are experiencing. The necessity for Google to render each page means they are crawled very inefficiently, and PageRank is not flowing easily through the site. This is evident from a whole range of symptoms such as low indexing levels and poor rankings in search results.

PageRank & XML Sitemaps

XML sitemaps are a funny thing. They exist outside of the web’s link graph, and are an artificial way to prioritise URLs for crawling. I suspect that the way XML sitemaps are handled is actually in line with how Googlebot handles everything it crawls on the web.

We know that including a URL in an XML sitemap is no guarantee the URL will be crawled and indexed. Yet, including a URL in a sitemap does mean it’s seen as more important, increasing the likelihood it’ll be crawled and, hopefully, indexed. Again, it’s that word: important.

My theory (and it’s purely a theory) is that including a URL in an XML sitemap is treated in much the same way as a link to that URL from a ‘neutral’ low PageRank website. By submitting URLs through an XML sitemap, Google will accord those URLs a slightly higher PageRank, and thus prioritise them a bit more for crawling and indexing. It’s a signal with a low volume (hence why often not all URLs in sitemaps are indexed), but it helps Google prioritise the right URLs on a site.

Having said that, nothing beats discovery of URLs through a normal web crawl. If a URL cannot be found by crawling the web, including it in an XML sitemap will not help. The amount of PageRank flowing to that URL from XML sitemaps is minuscule compared to the PageRank flowing through links on the web. If you really want a URL to be crawled and indexed, you need links pointing to it from other URLs – either within your site or from external sites.

Reasonable Surfer

There’s another aspect of PageRank and how it distributes to URLs on the web that I feel is under-appreciated. Many of us will have heard about the Reasonable Surfer patent, which has been around since 2004. In this patent, Google explains how to give more weight to links that are more likely to be clicked on, versus links that are unlikely to be clicked on by a person surfing the web.

Basically, if a link is prominent on a webpage and a user is likely to click on that link, then Google gives that link more weight.

But what does Google mean when they say ‘weight’? This is one of the areas of SEO that’s very misunderstood, and I have my own perspective on this.

In the original PageRank paper published by Larry Page and Sergey Brin in 1998, they described an important aspect of how PageRank is calculated as it flows through the web from link to link. This aspect is the PageRank Damping Factor:

PageRank Damping Factor

“We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85.”

What this means is that when Page A links to Page B, there is a percentage of Page A’s link value associated with Page B. In the original paper, the Google guys set this value to be 0.85; so 85%.

Then when Page B links to Page C, the factor 0.85 is applied again, which brings the link value of Page C down to 72.25%. And so on, until all link value has diluted.

This PageRank Damping factor is what I believe Google refers to when they talk about the ‘weight’ of a link. I believe that factor is one of the parameters Google plays with to give more or less weight to links in specific contexts.

So a link that is very prominent and has a high probability of being clicked on, will be subjected to a lower PageRank damping factor. For example, such a link will have a 0.90 damping factor, meaning only 10% of Page A’s link value is lost.

And links that have a low probability of being clicked on, such as links in the footer of a page, may have a much higher PageRank damping factor, and lose a lot more link value.

[On a side note, this is why I believe subdirectories are better than subdomains: I think the PageRank damping factor for internal links within a single domain is set lower than the damping factor for cross-domain links. Because a subdomain is essentially a separate site, links from a subdomain to the main site are subjected to a higher damping factor than normal internal links within the same domain.]

PageRank and Redirects

Lastly, I want to discuss how PageRank flows through redirects. On several occasions, Googlers have stated that redirects function in the same way as links. Quoting Matt Cutts a few years ago:

“The amount of PageRank that dissipates through a 301 is currently identical to the amount of PageRank that dissipates through a link.”

Essentially, this means the PageRank damping factor applies to redirects in the same way as it applies to a link. Many SEOs interpreted this as a reason to stop worrying about redirects, but these SEOs would be wrong. A link loses PageRank, and so does a redirect.

This is why it’s so important that you make sure your internal links don’t result in redirects, because you’re causing extra PageRank to be lost on your site. A direct internal link would be Page A > Page B, and lose only 15% link value. When you have a redirect in the middle, you end up with Page A > Redirect > Page B, and you’d lose twice the amount PageRank in the process; the original link loses 15% value, and then the redirect also loses 15% value.

In recent times, Gary Illyes has confused matters by saying the following:

30x redirects don't lose PageRank anymore.
— Gary "鯨理" Illyes (@methode) July 26, 2016

I think Gary is mistaken in this simplistic statement (or at least somewhat inaccurate), mostly because we see that fixing redirects to point directly to final destination URLs tends to result in improved crawling and ranking for those URLs.

Now I admit I could be entirely wrong about this, but when reproducible observations clash with ‘official’ statements, I tend to go with my observations.

It’s All About The URL

While we are starting to see Google rank results without URLs (especially for knowledge graph entries), for the time being URLs will remain the cornerstone of the web. A webpage is essentially a URL with its associated content.

Web search engines like Google are founded on URLs. This is why, when it comes to non-web internet technologies such as smartphone apps, Google still wants to crawl them using URLs. Quoting from their App Indexing documentation:

“To get your app’s content indexed by Google, use the same URLs in your app that you use on your website and verify that you own both your app and your website. Google Search crawls the links on your website and serves them in Search results. Then, users who’ve installed your app on their devices go directly to the content in your app when they click on a link.”

Here, too, it’s all about the URL. When you truly understand that aspect of the web, and how search engines are fundamentally URL processing machines, then everything else will fall in to place.