Google and Content Farms – what could the new classifier uncover?
I first wrote about what I thought Google’s next step would be in March 2010. It was a post triggered during SES London 2010 while I was watching the pro automated spam agencies present. I felt that this would be the next step in Google’s on-going fight against web spam and it looks like this is now the case.
Matt Cutts confirmed that content farms are next on Google’s radar although he defines them as sites with “shallow or low quality content” which is a fairly broad brush and open to misinterpretation.
What constitutes shallow or low quality content?
The first thing that springs to mind is the huge amount of bot regenerated content that is being touted about. Upload your article into the software, replace a few words with tags and you’ll get 5000 articles that are “unique”. Distribute these with your backlinks inserted and off you go. In my opinion that isn’t the most natural link graph to be attaining and would be fairly high on the list for Google to tackle. I’m not saying it would be easy to tackle though.
The second thing that springs to mind is what about sites that use huge amount of freelancers to generate content at the lowest possible cost?
Demand Media listed last Wednesday on the NYSE with a business model that mainly consists on paying as little as possible for content while reaping advertising rewards as the visitors flood in through organic search. They closed the day higher than New York Times Co which indicates to me that the market has a lot more appetite to invest in this low cost content model than the long standing, reputable publishing house model due to the far lower operating costs. The problem from my perspective is that it’s not driven by journalistic principles it’s driven by the bottom line and maintaining quality and accuracy will be a challenge.
Demand Media was quick to point out that they don’t fit into this category and they do have a lot of great content but then again they have a great deal of drivel too, so where and how does Google draw the line?
So what is and isn’t feasible?
Well Matt’s original post talks about a document-level classifier but that has me wondering what signals a bot could use to determine the quality? Surely the topic has to play a part? 200 words on a local school fair is a lot of content whereas that isn’t nearly enough when trying to explain search marketing tactics to a layman. How could it possibly determine accuracy?
I can’t help but think social signals will play a large part in this in the future. A well-received article or piece of content will be shared far more than automated scraper content. On the flip side social signals are also fairly easy to imitate and automate which would introduce its own level of spam.
They will also need to work out intent as people criticising an article’s accuracy should, in theory, count against the article. It’s a minefield to do programmatically and would result in large scale collateral damage if they implemented it too broadly.
I personally feel Matt means that Google will crack down on scraper sites initially before looking at tackling the broader article spam/low quality content industry as it would be far more complex to implement that and would certainly not influence just 2% of queries.
It would be great to get other peoples opinion on what could be done with this update, so drop in a comment below!