Please read the addition made at the bottom of this post as well as the comments on how this post evolved.
The post was published quickly as I was excited to share my findings from a recent “test” I was working on. The way I reported on this test was flawed and I rushed to conclusions about the degree of the finding. It is now clear to me that the contents of the test page have not been indexed.
I have left the post live and relatively unchanged to demonstrate the fact that I clearly made a mistake here, I’m willing to acknowledge that fact and there is plenty that both myself and others can learn from this post (primarily, how not to run and report on an experiment). The mistakes made were a result of over excitement rather than any intention to misinform and I sincerely apologise for my mistakes.
I was hoping to have the evidence to include this in my last post when I stated that there were some major concerns with Google+, but I can now say with certainty: Google will ignore Robots.txt and index a page* that you have specifically blocked once the page has received a +1.
My concern was two-fold after having read this in Google’s FAQ on Google+ in that the wording suggests that either: a. any page that contains the +1 badge could be indexed (still possible) or that b. any page that has been +1’d could be indexed (now confirmed*)
Why is this newsworthy or a particularly big deal? Because look at the page that is now indexed ( ), there is not a Google +1 badge on that page and their never has been, that page has never been linked to internally and has never been linked to externally (barring from Google+).
After speaking with my colleague Curtis before giving an internal presentation on Google+ we explored the fact that Google could ignore Robots.txt if a page had been +1’d. It would seem that Google’s justification for so doing is the fact that if a page can be +1’d, then it is deemed to be publicly of interest and therefore belongs in their index. The problem is: just because a page has received a +1 does not mean that the site owner ever wanted that page to appear in the index.
So, in order to prove our suspicion I set out to do the following:
- Created a text-only page on my own site without the Google+ button
- Installed the Google+ Chrome Extension (an official Google product) in my browser
- Used the toolbar to +1 the page
- Emailed Curtis and a couple of other colleagues with the strict instructions to only use the same methodology (i.e. use the toolbar), not to share an actual comment or push the +1’d page to any of their following (which they have confirmed they did)
- Waited to confirm the results
As you can see, the page is now indexed – and another page on a third-party website that had also been blocked by Robots.txt (and also lacks the +1 badge) also appears in the index as a result of wanting to confirm these results on another website. In the case of the third party website the page in question only received two +1’s (one from me and one from Curtis).
Why is this a Problem?
This is a problem because it creates a massive loophole that could be exploited by less well intentioned individuals within the industry – all someone would need to do is look at what pages you don’t want indexed, use the toolbar to +1 the page, and wait a few days.*
This is a problem because the fact that a page can be indexed this quickly by someone without any affiliation to (or control of) a website can overrule their (hopefully) carefully crafted rules.
This is a problem because it means just about any disgruntled employee/developer could get a dev server (or any other private/important pages not protected by password) indexed.
This is a problem because it completely defeats the purpose of robots.txt and this standard effectively no longer means anything at all.
How and whether Google respond to this remains to be seen, but it is indeed a very big problem and raises some considerable concerns about both indexation and security.
*Edit: As noted in Dennis’ comment below and a conversation I had with @wiep on Twitter, the page (i.e. the content of the page) has not been indexed (at this point), but rather the URL. I still believe this warrants an update to Google’s advice on Robots.txt to clarify and specifically acknowledge the impact of +1’s. We will keep a close eye on the situation and let you know of further developments.
My reluctance to put meta data on the page led to me rushing to conclusions and I apologise for the wording of my findings and misrepresentation of what I perceived to be happening. I very much appreciate the points raised below in the comments.