Enterprise Link Spam Analysis Ian Lurie at #Linklove:
Recovery from a manual link penalty requires manual review of links. Lots of links. In this session, Ian Lurie will walk through an automated solution he built and how you can build something similar. He’ll also share insights about Google’s detection of suspicious versus legitimate links from a review of over 400,000 links across 100+ sites.
Ian Lurie, the “most acid” SEO blogger is on stage talking about Enterprise Link Spam Analysis, which something that usually translates into millions of backlinks to understand the nature of: easy, isn’t it?
That’s why Ian said to himself: “Let’s try machine learning”, which makes you feel like a moron. “How hard can it be?”: yes, Ian did himself that question! Ian presented many tools and links during his presentation. Luckily we can find them all in this bitly bundle.
And what about the most used reply clients may have? The infamous “these are good links!!!” exclamation.
Good links, seriously?
Ok, we must start and, when working in a re-inclusion, that start is getting all the links. For that step, Ian suggests using Google Webmaster Tools, Opensite Explorer and Majestic SEO, then we should create a great spreadsheet, using the link API plugin by SEOGadget so we can have all the metrics related to those links.
Finally we start evaluating the URLs.
The first thing we must evaluate are the domain names themselves, then we should analyze the titles: the weirdest they are the spammiest.
This approach is fantastic and correct, but on huge scale we fall in an outstanding problem of filtering the bad from the good links. How can you do it manually for 500.000 links?
This is a knowledge problem. Machine learning can be a solution in order to sort of all the noise in a backlinks profile.
First you must start with a question: is this page spam? The answer is in having a great classification, which is the result of correct Training Set + the Algorithm learning from that training set.
Ian cites the infographic Google published few weeks ago (How search works), and this page where live examples of spam pages are shown, because it can be a great source for the training set building.
Then it comes the he algorithm. What to choose? Text-based, supervised, unsupervised? The result is the classification, and it is equals to answers. Ian experimented with a first training set based just on words, but he failed, because there were too many false positives.
So he tried a second training set, moving from words into numbers. Counting verbs, words et al. Doing so, the training set and the logistic regression algorithm applied to worked.
Ian used also readability scale, the MajesticSEO page TrustFlow, Domain trustFlow and metrics in order to create a huge training set.
The result was isitspam.portent.com. Not perfect, but we can try it and make it learn more and more, because it is still learning.
Lesson learned by Ian
- Spam is not about spam pages or even spam links. A link can be present in different pages and being both spam and not spam.
- There is no spam, or a definition of spam. It is all about the context where the link is present. For instance, you may have links from edu, but maybe they are from a keyword stuffed page or link list page on those edu sites, hence considered spammy.
- Ian also learned that the Google tolerance for spam is decreasing. That means that what is not spam now, maybe is going to be considered spam in the next future by Google.
- Final lesson: clean your link profile now!
And if you want to use machine learning as Ian did, create it by verticals, because every niche owns its own context. And we should all try to understand it, because it is used more and more also in marketing due to the big data marketers have to deal with daily.