Using PageRank for internal link optimisation

Most marketeers know about PageRank because is using it as one of their main concepts to determine the value of external (inbound) links. For many years, this value (a number between 0-10) was used to compared the importance of a website, link quality and ranking potential of a domain or individual page.

During the years, marketeers were able to get these PageRank values via API’s or Google’s own Toolbar but since the beginning of 2013 Google gave less priority to sharing this data. In December 2013, the last public update of the “Toolbar PageRank” servers took place.

“PageRank is something that we haven’t updated for over a year now, and we’re probably not going to be updating it again gong forward, at least the Toolbar version.“

PageRank is dead!? A bit of history is what we need…

SEO’s declared PageRank dead, as marketeers tried to do the same thing with SEO in general. To understand why I still use PageRank as a metric during my SEO work, we have to go back to the basics. PageRank (Wikipedia) is an algorithm developed in 1999 by Larry Page, coincidentally one of Google’s founders. The purpose of the algorithm is measuring relative importance within a set of linked documents, like the World Wide Web for example.

The WWW is just one set of documents, but the algorithm can be applied to basically any collection of documents which reference each other. The research leading to PageRank was heavily influenced by earlier research on citation analysis. Think about scientific papers, which are often built on previous, historical research and are all linked to each other. By using PageRank we can determine the most influential paper or scientific author by calculating.

Twitter uses PageRank to build their lists of personalized recommendations for users to follow. Every reference counts as a vote, similar as links counts as votes for a certain domain or page. Google publicly used it to score domains and pages, not only based on the whole web as a set of documents, but also using smaller datasets, like a set of URLs from one specific domain: resulting in internal PageRank distribution.

During the earlier years, the influence of external and internal link value was quite limited but from 2003-2004 on this became more relevant and a real quickwin when dealt with correctly. Once SEO’s discovered this, “PageRank sculpting” was born. By using nofollow elements, tricking Google into not counting specific links, webmasters were optimizing internal link value distribution. By adding a nofollow tag to specific, sitewide links like contact forms, login pages and disclaimers, you were able to avoid that pages received any PageRank. I’ve never seen any clear proof this worked, from a link value perspective.

Guiding Google’s crawlers effectively through your website, optimizing crawl budget, does create better results because Google will not follow the nofollow links. In 2008 Google did an update to the original PageRank algorithm to prevent PageRank Sculpting:

So what happens when you have a page with “ten PageRank points” and ten outgoing links, and five of those links are nofollowed? “… Originally, the five links without nofollow would have flowed two points of PageRank each (in essence, the nofollowed links didn’t count toward the denominator when dividing PageRank by the outdegree of the page). More than a year ago, Google changed how the PageRank flows so that the five links without nofollow would flow one point of PageRank each.

Using internal nofollow links literally vapourizes PageRank. Say thanks to all those smart SEOs that advised you to add sitewide nofollow links to your footer. 🙂 The only way left to sculpt PageRank was simply not placing links to specific pages or hiding the links from Google. Put disclaimers, login forms etecetera in a dynamic AJAX popover for example. Using JavaScript to hide links in big faceted search menu’s. Since Google is getting smarter and smarter, I expect they will be able to render and understand most JavaScript based scripting within a year, or two. Nonetheless, optimising your site architecture and internal linking is still a valuable tool towards better SEO rankings!

If you want to know more about Google’s challenges in calculating PageRank for the whole web, have a look at Lectures on the Google Technology Stack 1: Introduction to PageRank. This session was also recorded.

How to get your current internal PageRank distribution?

PageRank is an algorithm that gives scores based on relationships between documents: in this case internal links. The first step is to get all the URLs within a website, including all the internal links. My favorite tool for doing that quickly is ScreamingFrog which is cheap and scalable.

Crawl a website (only URLs needed) and export with the All Inlinks export function. The topic of crawling a website effectively without creating an unintended DDOS attack is worth an article itself, but if you want to test a few crawlers, Aleyda Solis made a nice list at The Marketers Toolbox website: SEO Crawlers!

Once you have all your internal linkdata, you need a tool to calculate or visualize it. Until now, I found Gephi the most reliable and intiutive tool to use. Check the video below for detailed instructions on how to transform your internal linkdata sheet into a visual PageRank distribution graph; software works similar for both Windows and Mac:

In the below Graph I’ve mapped all internal linking, red nodes have the most PageRank and yellow the least. As you can see the website is literally divided into 7 clusters of category and product pages without any real cross linking resulting in a flat distribution of link value. In the ideal world you need the pages targeting the most competitive keywords to get the most PageRank attributed:

evenly-distributed-pagerank

If you have a heavy PC or Server available, and your dataset is not to big for Excel but your math skills are OK, you could try using Excel to iterate through the PageRank distribution. This is doable for the average website, and gives you a nice insight in what is happening when you add or remove a sitewide link in your website’s footer.

excel-iterations-pagerank

More information about calculating PageRank and using Excel can be found in Maths delivers! Google PageRank (pdf).

If you’re more into scripting with a programming language like PHP, the script shared in this Stackoverflow topic can help you out. I’ve tried it for a few websites, but it does costs you some server capacity since it needs to loop until the PageRank values are pretty stable, which is normally in the region of 30-50 iterations per dataset.

If you are using R, tried this package: page.rank {igraph}. Want to go for the big sets? Programmer Nayuki shares his Java based programs which he used to calculated internal PageRank for Wikipedia: Computing Wikipedia’s internal PageRanks, calculating the values across 11 million pages. Ideal for analysing big e-commerce platforms. The results are interesting to say at least, with the United States page ending up as the third most interlinked page, visit the before mentioned page to view the interactive illustration.

pagerank-wikipedia-distribution

Practical PageRank analysis

By calculating a website’s PageRank distribution and individual URL values, you can easily answer the following questions:

What is the effect of adding or removing a link to my main menu
What will adding a breadcrumb element do to the link value distribution?
How much is a homepage link worth if my site has 500.000 pages?
What will happen to my domains PageRank distribution if I add 100.000 product pages?
Are the category pages within the webshop getting enough internal links in comparison to static content pages?
Can poor internal distribution cause Google not being able to index all my pages?
Using structured data for marking up product data (stock, offer, price ranges etcetera) can be tricky due to delays. Optimize internal linking to flow towards the pages you want Google to crawl first and most often.

Launching a new website or migrating to a new platform?
Before launching a new website, I’ll always try to built a text-only version of the new platform, including all internal linking elements that will be there on the new platform. By doing that, I can see how the PageRank distribution will be in comparison with the old website.

Using a text-only, database driven solution, it easy to scale and adjust before running another crawl and recalculate ratio’s between specific pages you want to have ranking well. Combine this process with historical SEO performance data like rankings, traffic, search volume, user metrics and add some machine learning algorithms to the system and you’ll be the smartest and fastest SEO-ninja (not fan of using these type of characterizations but be real, this is cool stuff!) around the office!

Alternatives to manual PageRank calculations

Since the launch of cloud based crawlers, there are multiple options you can use which generate similar metrics:

1. DeepCrawl: this UK based startup is using DeepRank: “DR is a measurement of internal link weight calculated in a similar way to Google’s basic PageRank algorithm. Each internal link is stored and, initially, given the same value. The links are then iterated a number of times, and the sum of all link values pointing to the page is used to calculate their respective DeepRank. This process produces a numerical value between 1 and 10, letting you identify the pages and problems that indicate the greatest impact on your website.”
2. Botify: this French tool actually calculates three metrics per URL for you: Internal Pagerank, Internal Pagerank Position and Raw Internal Pagerank
3. Onpage.org: German tool and now finally available in English is using the traditional concept of PageRank

Have fun visualizing and optimizing your internal link structure!