No matter what problem you’re talking about these days, Big Data is inevitably the answer. We hear it constantly. Data will help us live healthier lives. Data will allow us to design smarter cities. Data will make sure nobody makes a bad TV show ever again! When you trust in data, the thinking goes, you can’t go wrong, because data doesn’t lie.
But as Kate Crawford, a Microsoft researcher, reminds us, data can lie–and often does.
Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves.
For example, consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don’t represent the whole picture. The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster. Very few messages originated from more severely affected locations, such as Breezy Point, Coney Island and Rockaway. As extended power blackouts drained batteries and limited cellular access, even fewer tweets came from the worst hit areas. In fact, there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate. We can think of this as a ‘signal problem’: Data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.
If you work with data, it’s your responsibility to subject it to this type of scrutiny. Crawford urges data scientists not only to think about how data sets can be applied, but to question where they came from, too. Essentially to step back from the numbers and take a broader view of the circumstances that produced them.
But it’s important for all of us to be wary of data’s shortcomings. Big Data is, after all, big business, and its proponents will continue breathlessly trying to sell us on its merits. The reality, however, is that data isn’t a panacea; it’s simply a tool that can help us solve certain problems in certain situations.
Granted, it can be hard to think critically about these things in the midst of Big Data’s overwhelming buzz. So it’s nice that a new collection of essays on the topic gives us an incisive and equally catchy phrase to counterbalance all that hype. It’s a tidy reminder that data is never impervious to bias, in some form or another. You don’t need to read the book. Just keep its title in mind. It’s called “Raw Data” Is An Oxymoron.
[Illustration: Lighthouse via Shutterstock]