Panda in detail

Panda in detail

12th October 2011

This is a Guest post by Peter van der Graaf. Peter is a big fan of behavioral psychology and mathematics. He mainly helps his clients with their internal SEO evangelism, link building strategies and international SEO effort. The scale in which he and his partners perform search algorithm tests has the potential to give great insights.

I was surprised to find out how much is written about the Google Panda update and how little information is shared so far about what is really happening.

Machine learning

Google’s algorithm on the characteristics of unnatural pages is periodically updated by a machine learning background job. This means it is not a live algorithm! The much reported Panda versions 1.0 to 2.5 are algorithm changes which are first calculated on a training dataset and combined with the existing learnings they are exported to the live Google environment as more static algorithm tests.

This means that while bounce-rate (in this case: visitors returning to search results quickly) isn’t used as a direct ranking factor, it is used to teach the Panda new tricks. Signals like bounce rate are fed as bamboo to the Panda background system with the instruction to find out what patterns can be derived from characteristics that form thin content, unnatural text and excessive on-page advertising. The system picks various combinations of attributes combined to get a high degree of certainty for someone’s spammy activities.

For those familiar with “distributed tree learning”, look up the works of Google engineer Biswanath Panda. After whom the Panda update was named. He will explain how continuously splitting sites into groups with similar attribute values helps you afterwards derive which attributes effected a certain outcome (like high bounce-rate) the most. It also gives some indication of the thresholds to be used and it can signal when false positives or negatives are likely to occur.

If Panda will ever become a live (continuously updated) algorithm remains to be seen. It can even be that the derived tests become so effective, that no further updates are required.

Steep or sloping threshold?

Because Panda consists of large combinations of factors it seems to be more certain of its outcome. While existing algorithms for unnatural behavior used a sloping threshold in which the increasing evidence pushed you gradually towards lower ranking, Panda currently uses a more thorough approach.

Gradually increasing the degree of unnatural text maintained existing ranking for quite some time, but eventually resulted in a steep drop in ranking for all tested websites. Individual elements within the algorithm for thin content are hard to reverse engineer, but once you cross a certain point you are sure to be hit. Because signals are inspected in combinations that include link value attributes, not every site has the same threshold.

You might even argue that Panda has replaced a previous algorithm that had a sloping threshold, because many sites with thin content below the Panda threshold have returned in top-10 positions.

Domain, section or page based effect?

Panda affects large amounts of pages within the same domain. It doesn’t target long-tail keywords, but pages with these keywords tend to be in sections with many pages that have low quality content.

Sections of pages can be grouped by many factors like block element buildup. Once a threshold within these pages is reached, all pages in the section are affected, including ones with a slightly higher quality.

Once you have been hit, recovering requires more effort than just increasing quality below the threshold again. Changing domain however (including 301-redirect) seems to return your ranking if you barely stay below it. Just changing URLs within the same domain doesn’t seem to have this effect.

Solution against Panda plagues

Sites with large amounts of pages below a quality threshold are targeted by Panda. When you use sentences in which you only replace a couple of keywords compared to other pages; If you have a lot of content from other websites; If you make a lot of spelling or gramatical errors; And when you have excessive ads on your page be prepared for Panda claws. Assuring quality for all pages might be hard, but make sure you do this for all pages that are important for your visitors and for Google. All pages below a logical quality should be removed or excluded from the Google index (canonical tag/noindex/etc).

Pages with sentences like “no results found for [keyword]” are often crawlable by Google. Misconceptions of malintent like this should also be taken into account.

If that doesn’t work, you can always build a Panda trap.

Hopefully this article has clarified some misconceptions. Note that this is the consensus of many search experts and represents the supposed current situation. If there is any proof to refute this article, please comment. We’re all more than willing to learn.


Written By
This post was written by an author who is not a regular contributor to State of Digital. See all the other regular State of Digital authors here. Opinions expressed in the article are those of the contributor and not necessarily those of State of Digital.
  • This field is for validation purposes and should be left unchanged.