One step ahead in link-building: Make sure you’ll not get cheated

From what I heard, online marketers and especially SEOs might sometimes be interested in getting links pointing to a self-owned (or a client’s) website 😉 OK, fun aside. When building links one of the things that really annoys me, are people trying to cheat you – and I’m not talking a PR 4 vs. a PR 6 link but more stuff like removing a link after a striking a deal (which is more or less the easiest to detect), cloaking content based on user-agents/IPs or using methods like X-Robots headers or the robots Meta tag to prevent indexing pages.

This article is going to be a small check-list on things you’ll have to watch out for – or if possible – your link-building backend should have “an eye on” (and really make sure you run these tasks on a regular basis, say at least once a week):

To have some URL-patterns to show I’m going to use www.domain.com (as the root-domain) and www.domain.com/path/content.html (where the link is placed at) in the following examples. And just to be sure: I take it as basic knowledge that the link you dealt to be placed on that specific page is implemented as a hyperlink using the good old <a>-tag!

1. robots.txt
First of all I’d recommend making sure that the URL is not disallowed using robots.txt (which would be located at www.domain.com/robots.txt). In our case we’d have to watch out for multiple patterns including:

Disallow: /path
Disallow: /*.html$
Disallow: /*content
Disallow: /path/content.html

You get the idea – and also keep in mind to check ALL relevant user-agents because sometimes it happens that these pages will only be disallowed for a specific search engine, let’s say Google.

2. HTTP status code & X-Robots header
The second thing to do is sending a simple HTTP HEAD request to the URL where the link is placed at and looking at the response you should especially pay attention to:

The HTTP response code: If the HTTP response code is NOT a 200 (=OK) or a cached one (e.g. 304) there might be trouble.
The X-Robots header: Generally not to detect when just browsing the website but it’s pretty easy to do it in an automated way. Just grab the “X-Robots” header (if not present, you can skip right away, everything is fine for now!) and check for the value – if it does contain a “noindex” or “nofollow” string (or even both) you need to contact the link owner because for now the link is worthless (a/ the page is not being indexed or b/ the link is not being followed or c/ both).

2. Meta robots & canonical tag
The next things to watch out for are the Meta robots tag as well as the canonical tag. Both can be set in a way that your link is completely worthless:

Meta robots tag: It’s pretty much the same as for the X-Robots header; you need to watch out for a “noindex” or “nofollow” value. By the way, another pretty rare value is “none” – this is basically a shortcut for “noindex,nofollow” – make sure you also check for this one!
Canonical tag: Let’s continue having a look at the canonical tag – because if the value is different from www.domain.com/path/content.html the link doesn’t help – because as we all know a canonical is treated similar to 301-redirect and in this case search engines would pass all link juice from this URL to the one where the canonical tag point’s to.

3. The rel-attribute
Let’s move on to the link-level, shall we? As mentioned earlier the obvious thing to do is to check for the link itself – this could be done by using a regular expression – this would also allow you to validate the href-attribute and of course the anchor text value.
Additionally you’d have access to other attributes being used – like for example a rel=(*) to detected “nofollow”ed links (if you do care about).

4. Cloaking
Another thing that seems to be getting quite popular at the moment is to cloak-out links for search engines. Basically what some webmasters are trying to do is to deliver the content including the hyperlink to the users but to remove the link when a crawler accesses the website (to reduce the number of outbound links, I guess). The most common tactic is still doing it on a per user-agent basis which is pretty easy to detect – but if it’s getting more advanced like IP based stuff, etc. you’d probably also have to validate the Google cached version of that specific page to really ensure the link is present.

5. More stuff to watch out for
– Is that page being indexed at all? After a while (especially when it’s a new page) you should definitely do a search on Google for that specific URL to make sure the page is being indexed at all – if not, is that page being linked internally to get any link juice from the domain at all?
– It could also be very interesting to monitor the number of outbound links on www.domain.com/path/content.html – because if the number rises (big-time) it might be bad for you (probably you’ll get less link power, maybe the page from that point onwards is turned into an “obvious link-selling page”?)
– And last but not least it might be interesting to monitor if there are some of the common “ppc”-keywords / bad-words on that page. Maybe the page got hacked and you’re now in a very bad neighborhood? I’d love to know 😉

I hope this helps a little bit when dealing links or building a tool to make sure those stay in place. If you don’t have a tool to do all the tasks at once, here are some Firefox add-ons to do some of that work for you:

robots.txt: SearchStatus has a “show robots.txt” shortcut, viewing it is just two clicks away. Go get it here.
HTTP status code: LiveHTTP Headers could to the trick.
X-Robots & Meta robots: SeeRobots does a great job in visualizing both values without having to look at headers or the source – grab it here.
The canonical tag can easily be seen directly in Firefox (the blue icon near to the address bar), no add-on needed.
Rel-attribute: SearchStatus does “nofollow” highlighting (and much more) – see the link above.
To check if a link is using the appropriate tag it’s probably the easiest to verify using Firebug (here) and the “inspect HTML” function.
Cloaking: To check for user-agent based delivery you could use the web developer toolbar in combination with the user-agent switcher.