JavaScript and SEO: The Difference Between Crawling and Indexing

JavaScript codeAs the web continues to evolve, there is increasing demand for websites to become ever more advanced and interactive. As a result, many developers choose to build websites using frameworks like React and Angular. This gives rise to the issue of how to optimise such sites for SEO. Basically, the question developers and SEOs ask is, can search engines like Google crawl JavaScript?

And that is the wrong question to ask.

To clarify, if you use JavaScript frameworks with server-side rendering, you’ve already solved the problem before it even arises. This article is about JavaScript framework implementations that serve the JS code to users and Googlebot, relying on client-side rendering. And that causes all kinds of issues.

Before I can explain why, it’s important that we first have a basic understanding of how search engines actually work. I’ve written about my ‘Three Pillars of SEO‘ approach before, which is based on a simplified model of web search engines like Google. I’ll summarise the main points here:

Three Search Engine Processes

In a nutshell, most information retrieval systems have three main processes:

  1. Crawler
  2. Indexer
  3. Query Engine

When it comes to JavaScript and SEO, the first two processes are what we want to focus on. Within Google, the crawler is known as Googlebot, and their indexing infrastructure is called Caffeine. They perform very different functions, and it’s important that we understand these to prevent any confusion.

The crawler is all about discovery. At heart, its purpose is straightforward: find all URLs and crawl them. It is actually a pretty complicated system, with subprocesses involved with (to name but a few) seed sets, crawl queuing and scheduling, URL importance, and monitoring server response time.

The crawler also has a parsing module which looks at the HTML source code of what is being crawled and extracts any links that it finds. The parser does not render pages, it just analyses the source code and extracts any URLs found in <a href=”…”> snippets.

When the crawler sees URLs that are new or changed since its last visit, it sends them to the indexer. The indexer then tries to make sense of the URL, analysing its content and relevancy. Here we also have a lot of subprocesses looking at things like page layout, canonicalisation, and evaluating the link graph to determine a URL’s PageRank (because, yes, Google still uses that metric internally to determine a URL’s importance).

What the indexer also does is render webpages and execute JavaScript. In fact, Google recently published a set of documents on their Developers site that explain how their Web Rendering Service (WRS) works:

It is the WRS within the indexer that executes JavaScript. The Fetch & Render feature in Search Console allows you to see exactly how Google’s WRS sees your page.

The crawler and indexer work close together; the crawler sends what it finds to the indexer, and the indexer feeds new URLs (discovered by, for example, executing JavaScript) to the crawler. The indexer also helps prioritise URLs for the crawler, with more emphasis on high value URLs that the indexer wants the crawler to regularly revisit.

Googlebot or Caffeine?

The confusion all starts when people – be they SEOs, developers, or even Googlers themselves – say ‘Googlebot’ (the crawler) but actually mean ‘Caffeine’ (the indexer). This confusion is entirely understandable, because the nomenclature is used interchangeably even in Google’s own documentation:

Googlebot or Caffeine?

When these WRS documents were published, I had to ask the question to Gary Illyes because the use of ‘Googlebot’ there confused me. The crawler doesn’t render anything – it has a basic parser to extract URLs from the source code, but it doesn’t execute JavaScript. The indexer does that, so the WRS is part of the Caffeine infrastructure. Right?

Gary Illyes on Twitter

Right. But the contradictory text in the WRS documentation remains, so it’s entirely forgivable for SEOs to confuse the two processes and just call it all ‘Googlebot’. That happens all the time, by even the most experienced and knowledgeable SEOs in the industry.

And that’s a problem.

Crawling, Indexing, and Ranking

When developers and SEOs ask the question if Googlebot can crawl JavaScript, we tend to think the answer is ‘yes’. Because Google does actually render JavaScript, and extracts links from it, and ranks those pages. So does it really matter that it’s not the crawler that handles JavaScript, but the indexer? Do we really need to know that different processes handle different things if the outcome is that Google ranks JavaScript pages?

Yes, actually. We do need to know that.

Despite the incredible sophistication of Googlebot and Caffeine, what JavaScript content actually does is make the entire process of crawling and indexing enormously inefficient. By embedding content and links in JavaScript, we are asking – nay, demanding – that Google puts in the effort to render all our pages.

Which, to its credit, Google will actually do. But that takes time, and a lot of interplay between the crawler and indexer.

And, as we know, Google does not have infinite patience. The concept of ‘crawl budget’ – an amalgamation of different concepts around crawl prioritisation and URL importance (Dawn Anderson is an expert on this, and make sure you also read my article on URLs here) – tells us that Google will not try endlessly to crawl all your site’s pages. We have to help a bit and ensure that the pages we want to be crawled and indexed are easily found and properly canonicalised.

JavaScript = Inefficiency

What JavaScript frameworks do, is inject a layer of complexity in to this equation.

What should be a relatively simple process, where the crawler finds your site’s pages and the indexer then evaluates them, becomes a cumbersome endeavour. On JavaScript sites where most or all internal links are not part of the HTML source code, in the first instance the crawler finds only a limited set of URLs. It then has to wait for the indexer to render these pages and extract new URLs, which the crawler then looks at and sends to the indexer. And so on, and so forth.

With such JavaScript-based websites, crawling and indexing becomes slow and inefficient.

What this also means is that the evaluation of a site’s internal link graph has to happen again and again as new URLs are extracted from JavaScript. With every new set of pages the indexer manages to pry from the site’s JavaScript code, the internal site structure has to be re-evaluated and a page’s relative importance is changed.

This can lead to all kinds of inefficiencies where key pages are deemed unimportant due to a lack of internal link value, or relatively unimportant pages are seen as high value because there are plain HTML links pointing to it that don’t require JavaScript rendering to see.

And because pages are crawled and rendered according to their perceived importance, you could actually see Google spending a lot of time crawling and rendering the wrong pages and spending very little time on the pages you actually want to rank.


Good SEO is Efficiency

Over the years I’ve learnt that good SEO is, in large part, about making search engines’ lives easier. When we make our content easy to discover, easy to digest, and easy to evaluate, we are rewarded with better rankings in SERPs.

JavaScript makes search engines’ lives harder. We are asking Google to work harder to discover, digest, and evaluate our content. And often that results in lower rankings.

Yes, JavaScript content is indexed and ranked. But it is done so almost reluctantly. If you are serious about achieving success in organic search, it pays to make things as simple as possible. And that means serving content and links in plain HTML to search engines, so that they can be as efficient as possible when they crawl, index, and rank your webpages.

The factually accurate answer to the question “does Google crawl JavaScript?” is no.

The answer to “does Google index JavaScript?” is yes.

The answer to “should I use JavaScript?” is it depends.

If you care about SEO, less JavaScript means more efficiency. And more efficiency means higher rankings. Where you place your emphasis will determine the road you should embark on. Good luck.

Make sure you read and bookmark our JavaScript & SEO Definitive Resource List.

About Barry Adams

Barry Adams is one of the chief editors of State of Digital and is an award-winning SEO consultant delivering specialised technical SEO services to clients worldwide.

15 thoughts on “JavaScript and SEO: The Difference Between Crawling and Indexing

  1. Great explanation of the difference between Googlebot & Caffeine regarding JavaScript and I agree with Justinas “a must-read for every SEO pro” – thank you.

  2. I wonder how the XML Sitemap, and/or a link to this sitemap in robots.txt, or submitted directly in Search Console, mitigates the decreased efficiency as mentioned between Googlebot and Caffeine. Especially in regard to perceived importance of web pages.

    1. Not really. XML sitemaps help improve the crawl priority of the URLs submitted, but don’t guarantee the URLs are crawled in the first place. In my experience, if a URL cannot be crawled from a web crawl, including it in a XML sitemap doesn’t really help.

  3. Very clear explanation, Barry. I’ve read some of your other stuff re: Google News, so I’m wondering if you’d share your thoughts regarding Googlebot-News and how you see the crawling >> indexing >> ranking process in that context. I know JavaScript isn’t rendered for news content (not enough time), but in what other ways do you see the crawling and indexing process being different for news content?

    1. Yes it’s very different for news content. The indexing process for news articles is much faster and therefore less thorough. This means the raw HTML source code needs to be clean and facilitate easy extraction of the article. So: clean use of heading tags, article content in uninterrupted blocks of code, as little code pollution as possible (especially if there’s a lot of code above where the article starts in the HTML source), etc.

      Also, due to the relative simplicity of the Google News system (it’s all about speed after all), ranking is more based on fairly straightforward keyword matching (especially in the title and headline) and the site’s topical focus.

      1. Thanks, Barry. Yes, I’ve made similar observations about Google News. I’m always trying to gain more insight into the news crawler itself, as I primarily work with media/news publishers. One pet peeve of mine is that many SEOs make broad statements about Google being able to crawl JavaScript when, in fact, Google News doesn’t render client side code at all. Most folks don’t make that distinction, which means I have to routinely re-educate my clients on that point. Thanks for the good article and response!

Comments are closed.