Dutch researcher downloads 35 million Google Profiles
Aren’t they lovely, the new Google Profiles? And you can put so much information in it. Information which everybody can see. And download… We’ve discussed the privacy matters around the profiles before and I will be talking about the presentation I did at SMX about the profiles soon too. But there is a lot more to the Google Profiles. A Dutch researcher was able to download, export and import 35 million Google Profiles, with data.
The researcher Matthijs Koot, working for the University of Amsterdam, is writing a research paper about anonymity and privacy. For that research he decided to look at the Google Profiles. He noted that a lot of the information can be downloaded pretty easy.
Last February Koot created “a database containing ALL ~35.000.000 Google Profiles without Google throttling, blocking, CAPTCHAing or otherwise make more difficult mass-downloading attempts.” He was able to import all the data into one of his own databases. He used a sitemap from Google to download all the data.
The scary part is that the database contains “Twitter conversations (also stored in the OZ_initData variable) , person names, aliases/nicknames, multiple past educations (institute, study, start/end date), multiple past work experiences (employer, function, start/end date), links to Picasa photoalbums, …. — and in ~15.000.000 cases, also the username and therefore @gmail.com address.”
Google doesn’t mind
The information which has been downloaded is freely accessible for everybody. Google actually allows it themselves by allowing the profiles to be indexed. Koot publishes the code he used on his blog and is now hoping Google won’t kick him out of Blogger, on which platform he is blogging.
Google Netherlands responded already saying that there is nothing wrong here. The data which is stored in the sitemaps is after all already publicly visible. It is not a leak, the data is there already. This off course is the ‘easy’ answer. Yes, it is data which is public already, but should it be downloadable that easy? Also, the data which is in the sitemap can with some help be easily connected to personal data already gathered. If you have somebodies e-mail for example you can enhance the profile you have on them with the data in the Google Profiles.
With Google Profiles being pushed to be more of the ‘landing page’ for your online identity, Google also pushed the option to give the profile a nicer url, namely with your username in it. Google Profiles can either look like this: https://profiles.google.com/12345678901234567890 or this: https://profiles.google.com/USERNAME. The last one of course looks nicer, but also shows your username in the Google Profiles and can connect the data to your e-mail.
Google specifically mentions in their privacy settings that this can make your name more visible in the search results:
“To make it easier for people to find your profile, you can customize your URL with your Google email username. (Note this can make your Google email address publicly discoverable.)”
Because these connections can be made it is much more easy to actually make foul use of the data. This is something which spammers and phishing experts will gladly make use of. Even though Google ‘officially’ isn’t doing anything wrong, the data being out there and downloadable that easy is something which doesn’t seem right.
Again, it is clear that Google has to watch its steps and that you need to be careful about what you actually put on the web. It looks like with data elements like this being all over the web it will become inevitable that somebody will be starting to connect the dots, and the data…