Common Crawl Errors for Google News
Search Engine Optimisation

Common Crawl Errors for Google News

11th July 2016

Google NewsSince a few years, Google Search Console (formerly Webmaster Tools) is showing News-specific errors in the Crawl Errors report. These errors, only available to verified websites that are included in the manually curated Google News index, lists problems that Google News finds with content on your news website.

Ironically, while shown in the Crawl Errors section of Search Console, these errors are actually indexing problems that Google News encounters on your website.

These errors can often be quite frustrating to deal with. While Google’s web search indexer is an advanced and highly intelligent piece of software, Google’s news indexer is a very different animal. In News, Google trades intelligence for speed. Google News is all about getting the latest news to its users as fast as possible, and as a result its indexer doesn’t have the luxury of analysing an article page in detail before deciding to rank it for a given news-relevant query.

So as a news publisher, you need to make sure your website has as little stumbling blocks as possible for Google News to crawl, index, and rank your content. SEO for Google News is all about making the news indexer’s work as simple as possible.

The error messages in Search Console are a great source of information. A few years ago Adam Sherk wrote a great piece outlining the various different errors you might see in the News-specific tab of your Crawl Errors report.

In several recent projects with publishers, I’ve encountered these errors and had to troubleshoot their root cause. I want to share some of my findings with you, because the causes of many errors that Google News encounters are not always immediately clear.

Google Search Console – News Crawl Errors Report

Google Search Console - News Crawl Errors

In Search Console, when you go to the Crawl Errors report in the Crawl section, you will see a News tab if your website is part of Google News’s index. It pays to look at these error reports regularly, because pages that are listed here are unlikely to be included in Google News, and as such you may be missing out on a lot of potential traffic.

Here there are two possible types of errors: article content and article title. Adam Sherk’s article lists all possible errors, so I won’t repeat them here. What I will do is show a number of different error messages, and what I found to be the root cause.

‘Article Too Long’

This is an error that used to be quite rare, but that I see cropping up more and more often. Google News has both a minimum and a maximum article length that it requires submitted news articles to fit within. The minimum length is 80 words, but Google doesn’t tell you what the maximum length is.

When I come across the ‘article too long’ error in Search Console, 9 times out of 10 the page in question is not actually an article page, but a news category page listing dozens of articles. In those cases I feel I can safely ignore the error.

Sometimes, however, the page in question is a normal news article, and not even a particularly lenghty one. I’ve seen these errors appear in Search Console for article that are less than 500 words long.

In these cases, there is a different cause. The actual article length is not the problem; it’s Google News’s inability to properly recognise which part of your source code contains the actual article, and which parts belong to other on-page elements like navigation, sidebar, recommended articles, comments, etc.

Contrary to regular Google search’s indexer, the Google News indexer doesn’t render your articles before indexing them – it just analyses the source code and tries to extract the article text. If your code is not particularly clean, your article text might not be properly extracted by Google News, and as a result it thinks other parts of the source code also belong to the article.

This can be the case if you embed image galleries in to your articles, as well as integrate on-site commenting features. These can be mis-interpreted as being part of the article content, and thus trigger this error.

‘Article Too Short’

On the other site of the spectrum, you sometimes see Search Console errors where Google News believes your article doesn’t have enough content. Such ‘article too short’ errors are common for articles that are just image galleries, but sometimes also occur with articles that are much meatier and should not trigger such an error.

Generally speaking, I advise my publisher clients to get at least 200 words in to an article to reduce the chances of triggering an ‘article too short’ error. Google says the actual mimimum length is 80 words, but I regularly see ‘article too short’ errors on stories between 80 and 175 words long. So I always recommend to build in a bit of buffer.

However, I occasionally see this error appear on 500+ word articles. In that case something else is causing the problem. Sometimes, the article’s source code is interrupted by an element, such as a ‘Related Articles’ box, as shown here:

Fragmented article

When this ‘Related Articles’ box sits in the middle of the article code in the page’s source code, Google can sometimes conclude that only the opening paragraphs above the box are part of the news article, and ignore what follows. This can then trigger the ‘article too short’ error in Search Console.

It doesn’t have to be a ‘Related Articles’ feature that breaks up the code – it can be an image gallery, advertising slot, or even a highlighted quote. Anything that breaks up the article content’s source code in to multiple sections can potentially cause Google News to misinterpret the article’s actual length.

My advice would always be to try and contain the entire article text in a single segment of source code, without any interruptions like related content boxes, ad slots, or galleries. If your articles do have such elements, ensure they’re included in the source code below the article content’s code. That way you can make sure Google News’s indexer can more easily recognise the code segment that contains the article versus other code snippets.

When your code is cleanly structured like that, you also severely reduce the chances of another common error: ‘article fragmented‘. This is when Google News recognises your article content is part of several different code snippets on your page, and struggles to index the entire article correctly.

‘Title Not Found’

The last type of error I want to elaborate on is the ‘title not found’ issue. This occurs when Google News cannot identify your article’s headline in the source code.

This could have a very straightforward cause, such as having multiple <h1> headlines on a single page, or the page in question is a category page rather than a news article.

But recently I’ve come across some unusual instances of this error, and have uncovered some less common root causes that I want to share with you.

Nested Tags: Twice in recent months have I seen Google News reject many of a news site’s titles because the <h1> headline contained a nested tag. Here’s an example of a nested H1-tag:

Nested H1 Headline tag

In this example we see a kicker text nested within the <h1> headline tag. This can then cause problems with Google News, as it gets confused by the nested tag and rejects the entire <h1> tag. With only one H1 on a page, Google News then has no other headline to index.

The solution is simple: don’t nest any tags within your H1 headline.

Title Too Long: Article headlines should never be too short, of course. But your headline can also be too long. While Google recommends to keep your title between 2 and 22 words, it’s not so much the amount of words that limits your title: it’s the amount of characters. 110 characters, to be precise.

You can try this out for yourself; use the Structured Data Testing Tool on a news article with valid NewsArticle markup, and play around with the ‘headline’ field. You’ll see that at 110 characters, the article markup will still validate, while at 111 or more characters you get an error.

Escaped Characters: Another issue I’ve seen recently which causes titles to be rejected by Google News is when a headline contains special characters like apostrophes, dashes, and quotation marks, which are escaped in the source code and embedded as ASCII strings. Here’s an example:

Headline with ASCII characters

The content in the <h1> tag is ‘This Is The Article’s Main headline’ but the apostrophe is escaped to ASCII code and embedded as &#039;. This then causes Google News to reject the headline as the article’s title, and you have another ‘title not found’ error in Search Console.

To prevent this error from occuring, make sure your CMS doesn’t replace special characters with ASCII code in your website’s HTML source code.

The Key Message: Clean Code Matters

As you can see from these examples, there’s much that can go wrong with your news website. Google News is not a very advanced system – it needs to be fast to be able to deliver the most recent and relevant news, and it sacrifices a lot to make that happen.

So you need to make life as easy as possible for Google News. And that means you need clean code more than anything else. Ensure your code is well-structured, straightforward, and as uncomplicated as you can make it.

When you do that, you’ll see the amount of Search Console errors reduce dramatically. That, in turn, results in greater visibility in Google News, and more traffic to your website.

And that’s what we’re after in the end.

Addendum: if you have access to a news site’s Publisher Center account, this piece from Glenn Gabe explains how you can use the built-in troubleshooting tool to help identify issues.

Tags

Written By
Barry Adams is the chief editor of State of Digital and is an award-winning SEO consultant delivering specialised SEO services to clients worldwide.
  • This field is for validation purposes and should be left unchanged.