Between our smartphones, cars, computers, and credit cards, a substantial portion of our lives are tracked. Where we eat, work, live and play; who we converse with; what we read and what we buy. One day, our own genes might be on the list.
In the right hands (ahem, not the NSA), it’s hard to underestimate the value of these kinds of data sets, aggregated over thousands or millions of people, to improve how we live at large. Scientists are tracking the spread of diseases with search data. Transportation planners are designing better road networks and reducing congestion with the help of data from Uber. And researchers are using mobile phone data to better understand the patterns of poverty. (Clearly, companies also value this data to improve their products and profits.)
The problem is when so-called “metadata” are shared with others or made public. Even when all personally identifying information, such as name, birthdate, address, is completely removed from a database–and all that’s left is the raw data, unattached to an individual–that data is not as entirely anonymous as we are usually led to believe.
Previously, researchers were able to the show that a mobile phone data set showing the location of anonymous users could be tied back to specific people–all it took was four pieces of outside information about an individual, such as a tweet they sent that pinpoint them in a given location at a given time.
In a new study released today in the journal Science, researchers at MIT, Rutgers University, and Aarhus University in Denmark were able to show that one of the most sensitive forms of personal data isn’t really anonymous either. They looked at three months of anonymized credit card transactions for 1.1 million people in 10,000 shops in an unnamed developed nation. The database didn’t have any names, account numbers, or other obvious identifying features. Each transaction had the day (but not the exact time) and the store where it was made.
What they showed is scary: Even with this anonymous data, they could reverse mine the database and relatively easily re-identify 90% of the real people in it, just knowing a few bits of information about them.
Here’s an example, which the authors, Yves-Alexandre de Montjoye, Laura Radaelli, Vivek Kumar Singh, and Alex “Sandy” Pentland, describe in the paper. Say a nefarious person was looking for “Scott” in a credit card transaction dataset. They know two things about him: that he was at a particular bakery on September 23 and a particular restaurant on September 24 (maybe he left a Yelp review or made publicly available Facebook posts). Someone could search through the anonymous data set and realize only one person fits that description, so that person must be Scott. Knowing that, the person now also knows every single one of his other transactions contained within–including private information he doesn’t want others to find.
In the study, the authors show that for all the people, on average, it only took knowing four data points like this to re-identify 9 out of 10 people in the database, because their patterns were pretty unique. They also found out that it is easier to pinpoint women than men, and easier to pinpoint high-income individuals than lower-income individuals. Importantly, even if they made the database “lower resolution”–like putting the week of the transactions instead of the specific day, it only made it a little bit harder to re-identify people, but still not impossible.
What does this all mean? It means that for metadata sets, which are often collected by private companies, to be shared widely with researchers, governments, and society at large, better means to protect people are required. The simple assurance that all “personally identifying information” has been removed should be taken as a guarantee.
On the other hand, the researchers say, requiring an absolute guarantee that a data set cannot be linked back to individuals probably renders it useless or discourage any sharing. “Finding the right balance between privacy and utility is absolutely crucial to realizing the great potential of metadata,” they write.