Anyone could download Cambridge researchers’ 4-million-user Facebook data set for years

Anyone could download Cambridge researchers’ 4-million-user Facebook data set for years

The project “has close academic links,” the site explains, “however, it is a standalone business.” (Presumably for liability purposes; the group never charged for access to the data.) Though “Cambridge” is in the name, there’s no real connection to Cambridge Analytica, just a very tenuous one through Aleksandr Kogan, which is explained below. Like other quiz apps, it requested consent to access the user’s profile (friends’ data was not collected), which combined with responses to questionnaires produced a rich data set with entries for millions of users. This data set was available via a wiki to credentialed academics who had to agree to the team’s own terms of service. I will take responsibility for the use of the data by any students in my research group. Once that data left company premises, there was no way for the company to control it in the first place, but the fact that a set of millions of entries was being sent to any academic who asked, and anyone who had a publicly listed username and password, suggests it wasn’t even trying. A Facebook researcher actually requested the data in violation of his own company’s policies. “We suspended the myPersonality app almost a month ago because we believe that it may have violated Facebook’s policies,” said Facebook’s VP of product partnerships, Ime Archibong, in a statement. “We believe that academic research benefits from properly controlled sharing of anonymised data among the research community.” In a separate email, Michal Kosinski also emphasized the importance of the published research based on their data set. Facebook has suspended hundreds of apps and services and is investigating thousands more after it became clear in the Cambridge Analytica case that data collected from its users for one purpose was being redeployed for all sorts of purposes by actors nefarious and otherwise.

Twitter Sold Data Access to Cambridge Analytica–Linked Researcher
Twitter sold data to Cambridge Analytica-linked researcher
If a Facebook exec deletes his tweet, does it still make a sound?

A data set of more than 3 million Facebook users and a variety of their personal details collected by Cambridge researchers was available for anyone to download for some four years, New Scientist reports. It’s likely only one of many places where such huge sets of personal data collected during a period of permissive Facebook access terms have been obtainable.

The data were collected as part of a personality test, myPersonality, which, according to its own wiki (now taken down), was operational from 2007 to 2012, but new data was added as late as August of 2016. It started as a side project by the Cambridge Psychometrics Centre’s David Stillwell (now deputy director there), but graduated to a more organized research effort later. The project “has close academic links,” the site explains, “however, it is a standalone business.” (Presumably for liability purposes; the group never charged for access to the data.)

Though “Cambridge” is in the name, there’s no real connection to Cambridge Analytica, just a very tenuous one through Aleksandr Kogan, which is explained below.

Like other quiz apps, it requested consent to access the user’s profile (friends’ data was not collected), which combined with responses to questionnaires produced a rich data set with entries for millions of users. Data collected included demographics, status updates, some profile pictures, likes and lots more, but not private messages or data from friends.

Exactly how many users are affected is a bit difficult to say: the wiki claims the database holds 6 million test results from 4 million profiles (hence the headline), though only 3.1 million sets of personality scores are in the set and far less data points are available on certain metrics, such as employer or school. At any rate, the total number is on that order, though the same data is not available for every user.

Although the data is stripped of identifying information, such as the user’s actual name, the volume and breadth of it makes the set susceptible to de-anonymization, for lack of a better term. (I should add there is no evidence that this has actually occurred; simple anonymizing processes on rich data sets are just fundamentally more vulnerable to this kind of reassembly effort.)

This data set was available via a wiki to credentialed academics who had to agree to the team’s own terms of service. It was used by hundreds of researchers from dozens of institutions and companies for numerous papers and projects, including some from Google, Microsoft, Yahoo and even Facebook itself. (I asked the latter about this curious occurrence, and a representative told me that two researchers listed signed up for the data before working there; it’s unclear why in that case the name I saw would list Facebook as their affiliation, but there you have it.)

This in itself is in violation of Facebook’s terms of service, which ostensibly prohibited the distribution of such data to third parties. As we’ve seen over the last year or so, however, it appears to have exerted almost no effort at all in enforcing this policy, as hundreds (potentially thousands) of apps were plainly and seemingly proudly violating the terms by sharing data sets gleaned from Facebook users.

In the case of myPersonality, the data was supposed to be distributed only to actual researchers; Stillwell and his collaborator at the time, Michal Kosinski, personally vetted…

Pin It on Pinterest

Shares
Share This