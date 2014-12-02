To anyone studying humanity, the big data generated by social media can be hard to resist. It’s readily available, and it might have the look of randomness, promising, perhaps, some representative sample of the opinions, emotions, and behaviors of millions of people. And all that data ends up cited in thousands of social science studies every year.

But that kind of data is often tainted by bias, argues a new paper published in Science, and data scientists and the public should be on alert.

The new research, a collaboration between computer scientists Derek Ruths of McGill University and Jürgen Pfeffer of Carnegie Mellon, grew out of similar challenges the professors had faced in their own research using social media data. Ruths had demonstrated “that the community was severely over-estimating the accuracy of their classifiers of political orientation of Twitter users,” while Pfeffer had showed that Twitter’s stream of public data isn’t representative of the public at large.

The research comes amidst an onslaught of social-media-based science. Every year thousands of social science studies on everything from media perceptions to disaster response cite Twitter, Facebook, and other networks. Meanwhile, the UN, DARPA and other government agencies rely on large-scale data mining of social networks to make predictions about poverty, health, and large events, and to gather intelligence.

And as academia clamors for more access to social networks’ proprietary data, the companies’ own data scientists are plumbing their streams for public and private insights. Earlier this year, Facebook stirred controversy after a study about “emotional contagion” on the network appeared in the Proceedings of the National Academy of Sciences. Facebook was accused of manipulating users without their consent, but the scientific merit of its study was not then a major issue.

If researchers are indeed pulling from the Twitter’s API, which the company says is “suitable for data mining”—and around which it dedicated $10 million to start a new lab at MIT this year—the biases from the social network might be lost without careful precautions.

The famous “Dewey Defeats Truman” headline. Image via the National Archives

By missing or misreading inherent biases, social scientists can make significant mistakes. The researchers compare the methodological problems with social media to those that plagued pollsters in 1948, who used the telephone to predict the outcome of the presidential election, leading to the famously incorrect headline “Dewey Defeats Truman.” The poll significantly underestimated the number of rural Truman supporters: at the time, many of them didn’t own phones.