Clicky

X

Subscribe to our newsletter

Get the State of Digital Newsletter
Join an elite group of marketers receiving the best content in their mailbox
* = required field
Daily Updates

Updated: Will A +1 Get A Page Indexed Even When Blocked by Robots.txt?

13 February 2012 BY

89 Flares Twitter 0 Facebook 0 Google+ 89 LinkedIn 0 Buffer 0 Email -- StumbleUpon 0 Pin It Share 0 Filament.io 89 Flares ×

Please read the addition made at the bottom of this post as well as the comments on how this post evolved.

The post was published quickly as I was excited to share my findings from a recent “test” I was working on. The way I reported on this test was flawed and I rushed to conclusions about the degree of the finding. It is now clear to me that the contents of the test page have not been indexed.

I have left the post live and relatively unchanged to demonstrate the fact that I clearly made a mistake here, I’m willing to acknowledge that fact and there is plenty that both myself and others can learn from this post (primarily, how not to run and report on an experiment). The mistakes made were a result of over excitement rather than any intention to misinform and I sincerely apologise for my mistakes.

I was hoping to have the evidence to include this in my last post when I stated that there were some major concerns with Google+, but I can now say with certainty: Google will ignore Robots.txt and index a page* that you have specifically blocked once the page has received a +1.

My concern was two-fold after having read this in Google’s FAQ on Google+ in that the wording suggests that either: a. any page that contains the +1 badge could be indexed (still possible) or that b. any page that has been +1’d could be indexed (now confirmed*)

Why is this newsworthy or a particularly big deal? Because look at the page that is now indexed ( ), there is not a Google +1 badge on that page and their never has been, that page has never been linked to internally and has never been linked to externally (barring from Google+).

The Experiment:

After speaking with my colleague Curtis before giving an internal presentation on Google+ we explored the fact that Google could ignore Robots.txt if a page had been +1’d. It would seem that Google’s justification for so doing is the fact that if a page can be +1’d, then it is deemed to be publicly of interest and therefore belongs in their index. The problem is: just because a page has received a +1 does not mean that the site owner ever wanted that page to appear in the index.

So, in order to prove our suspicion I set out to do the following:

  • Created a text-only page on my own site without the Google+ button
  • Installed the Google+ Chrome Extension (an official Google product) in my browser
  • Used the toolbar to +1 the page
  • Emailed Curtis and a couple of other colleagues with the strict instructions to only use the same methodology (i.e. use the toolbar), not to share an actual comment or push the +1’d page to any of their following (which they have confirmed they did)
  • Waited to confirm the results

As you can see, the page is now indexed – and another page on a third-party website that had also been blocked by Robots.txt (and also lacks the +1 badge) also appears in the index as a result of wanting to confirm these results on another website. In the case of the third party website the page in question only received two +1’s (one from me and one from Curtis).

Why is this a Problem?

This is a problem because it creates a massive loophole that could be exploited by less well intentioned individuals within the industry – all someone would need to do is look at what pages you don’t want indexed, use the toolbar to +1 the page, and wait a few days.*

This is a problem because the fact that a page can be indexed this quickly by someone without any affiliation to (or control of) a website can overrule their (hopefully) carefully crafted rules.

This is a problem because it means just about any disgruntled employee/developer could get a dev server (or any other private/important pages not protected by password) indexed.

This is a problem because it completely defeats the purpose of robots.txt and this standard effectively no longer means anything at all.

How and whether Google respond to this remains to be seen, but it is indeed a very big problem and raises some considerable concerns about both indexation and security.

*Edit: As noted in Dennis’ comment below and a conversation I had with @wiep on Twitter, the page (i.e. the content of the page) has not been indexed (at this point), but rather the URL. I still believe this warrants an update to Google’s advice on Robots.txt to clarify and specifically acknowledge the impact of +1’s. We will keep a close eye on the situation and let you know of further developments.

My reluctance to put meta data on the page led to me rushing to conclusions and I apologise for the wording of my findings and misrepresentation of what I perceived to be happening. I very much appreciate the points raised below in the comments.

AUTHORED BY:
h

Sam Crocker is SEO Associate Director at OMD UK. Sam focuses on increasing traffic and conversions for websites whilst always keeping his eye on a company’s bottom line.
  • Dennis Sievers

    As far as I know, Google always indexes URL’s from blocked pages, just not its contents.

    • Sam Crocker

      Hi Dennis, a valid nuance, but I would still find this slightly alarming and deserving of an update to Google’s Webmaster Tools help on Robots.txt to include +1’s if nothing else:
      “While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.” (current)

      • Dennis Sievers

        @samuel

        I definitely think indexing these URL’s should be prevented by Google. They are not allowed to crawl the URL’s but keep on indexing them, while there is no actual value to it. So, I understand the issue you have with this.

        If they keep indexing the page with a noindex,nofollow on it, then we have “a situation”. I don’t they they will, but maybe it’s worth a test.

        • http://www.seoworkers.com John S. Britsios

          @Dennis Sievers, the “nofollow” meta robots directives was never a good choice, at least when it comes to Google and PageRank.. Using that directive you are creating “Dangling Nodes”. Is that what we want? Me not.

          • Dennis Sievers

            I was only referring to the test. And yes, you might want to keep certain pages from being indexed but still let linkvalue pass through, I’m perfectly aware of that.

  • http://Twitter.com/Surfpunkian Ian Daniels

    Good post Sam, worth adding meta data to a test page and re-running?

    • Sam Crocker

      Hi Ian, yeah, I certainly think so.

      Will get on that (will probably do it on a new page to prevent any potential link metrics muddying the waters)!

  • Pingback: +1′s Will Get a Page Indexed Even When Blocked by Robots.txt – Confirmed – State of Search | eWebmaster()

  • Scott

    I can also confirm that a website blocked by robots.txt can get indexed just by pinging Google….

    • http://www.seoworkers.com John S. Britsios

      @Scott , since you confirmed that a website blocked by robots.txt can get indexed just by pinging Google, can you tell us here if you see the title and the description of the indexed page in the search results? Or do you just see the URL?

  • http://www.screamingfrog.co.uk Screaming Frog

    Yeah robots.txt prevents bots from accessing the content of the URL, not it’s inclusion in the SERPs.

    So if you link to a page which is blocked via robots.txt, you can still see the URL reference in the SERPs.

    So your test shows that +1 might be acting in a similar way to a link, letting Google know about the URL which is interesting. :-)

  • http://andybeard.eu Andy Beard

    Please change the headline to something truthful

    Google has always allowed pages to be referenced in this way but that doesn’t mean they have crawled the page.

    For something to be “indexed” the page has to be crawled.

    • http://www.seoworkers.com John S. Britsios

      @Andy don’t you think would it be enough just to add on a page blocked by the robots.txt the meta robots directive “nosnippet” to prevent the referencing URLs showing up in their search results?

  • James Taylor

    You seen to misunderstand what “indexed” means in the context of Google. As a point of fact, Google has not “indexed” this content. Search for some of the page text to be sure. What’s happening here is something that has practically always happened. When Google knows about a page through links (or +1s, apparently) it will include the URL in results but no meta-data (such as a tile or description) because it can’t retrieve that data, it’s blocked by robots.txt. This is not new, and is expected behaviour if you’r familiar with how this works.

    In summary, if Google knows about a URL it will display that URL in results pages whether it has indexed it or not. Google knows about pages through links and +1s.

    Quite frankly, if you’re using robots.txt to prevent people accessing a “dev server or any other private/important pages not protected by password” you’ve got big problems.

  • http://scribble.scran.ac.uk/user63063/weblog/ John Meffen

    Google can do more than that, as SEOMofo showed, you can get google to show non existent pages in serps, by linking to them. Since the robot.txt specifically means that the googlebot will not request the content from the server.

    http://www.seomofo.com/experiments/spam-search-results.html

  • Sam Crocker

    Hi Guys,

    Quick further update as per all of the comments. I posted this first thing this morning without having properly confirmed whether the URL had been included in the search results or whether the page had in fact been indexed. As a result of the test page I created not including any meta data I thought this might have been the cause and was excited with the finding so thought I’d share before reviewing all of the possibilities.

    Whilst the finding is not as strong as I had initially hoped I still think there are some important things to highlight here:

    1. This is still an interesting find in my view as it confirms that Google is using +1 metrics to find content and potentially gauge the importance of a page (and would indeed have liked to have crawled this page had they been given permission). Not new, but good to confirm.

    2. When I run competitor intel I always check for (*.site.com) and run a site search to see if they’ve been foolish enough to index their dev server, the fact that this could get these URLs into the public domain remains true and remains a concern.

    3. This could still be used for evil.

    @Andy – Title changed as requested. Apologies if you found that misleading, was certainly not my intent and I definitely have now tried to highlight this with the edit, the change to the title and this comment.

    @James – You are correct on all your points. Google knows about this page through +1s and the links created by the act of +1ing a page. And whilst I couldn’t agree more that people using robots.txt to prevent people accessing a dev server or other private pages not protected by passwords creates a big problem… I’ve still seen it more times than you would think.

    Thank you for the feedback and apologies for any misunderstanding/over excitement on my part. I have toned down the wording to reflect this though I will leave the original post to have the same meaning to allow for a continued discussion and to make sure that it is clear that I have misspoke here/published too early.

    • http://www.linkfishmedia.com Julie Joyce

      Hi Sam,

      You’ve just shown everyone how to be a class act sir. Your time in NC was obviously well spent.

      Julie

  • http://stokedseo.co.uk Gaz Copeland

    Just to add my voice to the above, robots.txt is not the correct way to stop a page being indexed so there is no suprise in this result. G will also use as many ways as possible to find pages, including G+ so again, not a shock at all.

    If you got the same result using a meta robots tag now that would be interesting!

    • http://www.radicke.com David

      We have not used “disallow” in ages as it was not strong enough.

      Also, I do not trust “rel=canonical” to take care of duplicates (as they are first indexed, and then canonized away which is never as good as not having them indexed in the first place.)

      I really wonder why there is no “noindex” for robots.txt – that would make things a bit easier as the robots.txt can be edited so much easier than the CMS (like wordpress…)

      • http://www.seoworkers.com John S. Britsios

        @David, googlebot (unofficially) supports a Noindex directive in the robots.txt since years. But I would not rely on that without caution, because it can change any time.

  • http://www.gplus.to/robsearch Robert Nyberg

    Hi Sam,

    Great test and it feels the same way as you describe in your reply about robots.txt and that G still may index the url, due to backlinks. There are always ways to manipulate things and Google need to be aware and careful about stuff like this.

  • http://www.christianoliveira.com Christian Oliveira

    Hi Sam!

    I post about this same “issue” back in October too (http://www.christianoliveira.com/blog/seo/google-plusone-robots-txt-meta-robots/ – it’s in spanish ). As other people stated here, robots.txt prevent bots from crawling only, not index, but it’s nice this discussion appears as I think there is a lot of confussion with this behaviour. So someone should post about it specfically (and Google should be far more clear with this).

    Thanks for posting anyway!

  • http://keithbrown.com/ Keith

    Glad you changed and added the new info, although I understand your rush to publish. I think people here are being way too hard, you were just trying to share something you thought you stumbled on.

  • https://twitter.com/#!/sergeesteves serge esteves

    Hi Sam

    i will not repeat that robots.txt dont prevent from indexing ;). But the true conclusion is that a +1 (or a facebook like, certainly) act like a link. Interesting!

  • http://www.seoworkers.com John S. Britsios

    @serge esteves, can you explain how can a bot index a page, if the bot cannot access the page, being blocked via robots.txt or 403?

  • Pingback: La revue SEO N°5 par Laurent Bourrelly()

  • http://www.basvandenbeld.com Bas van den Beld

    So that was an interesting lesson we learned there :). It shows how excited we can be if we spot something and that sometimes you should take a deep breath before publishing. I think everybody has made mistakes like this in the past. I must say however Sam that you handled this very well!

    To give some more insight in to the discussion it might be nice to read through some of the things which have been said around the web about the matter, because there is much more to this topic than just the fact that we should have looked a bit further.

    Take a look at this discussion: https://plus.google.com/107576957488923607021/posts/PCqaNsmsabz, where Googler John Mueller commented and see what another Googler, Pierre Far, has to say about the matter here: https://plus.google.com/115984868678744352358/posts/ToENw39cgmG

    Lots to learn from this!

  • http://www.jorgegonzález.com jorge González

    Hi Sam,

    Yo estuve haciendo pruebas intentando indexar solo con +1 un dominio nuevo y no hubo forma con lo que alguno comenta no es cierto, es posible que pueda indexar una url interna de un dominio indexado.
    Por otro lado somos muy críticos y solo con los errores podemos avanzar ya que crean debate.

    • http://samuelcrocker.com Sam Crocker

      Pues gracias por el comentario Jorge – y por contar lo de tu dominio.

      A final la verdad es que es importante ser critico (y aveces cometer errores) por si no, no iremos adelante – como tu ya has comentado.

      Lo mejor de todo de mis errores ayer fueron los comentarios de algunos de Google (como los en las enlaces de Bas arriba) y que si que crearon un debate.

      La proxima vez, preferia que no fueran mis errores :), pero hay que aprender y yo sigo haciendo pruebas asi para aprender mas.

  • Pingback: Bien se référencer sur Google + | 1-ter-net()

  • Pingback: Chronique Référencement du 13/02/2012()

  • http://twitter.com/SeoKungFu Boris Krumov

    I don’t see a problem here – if you want a page not getting indexed, you should not allow any sharing or links if possible.

89 Flares Twitter 0 Facebook 0 Google+ 89 LinkedIn 0 Buffer 0 Email -- StumbleUpon 0 Pin It Share 0 Filament.io 89 Flares ×