To anyone studying humanity, the big data generated by social media can be hard to resist. It’s readily available, and it might have the look of randomness, promising, perhaps, some representative sample of the opinions, emotions, and behaviors of millions of people. And all that data ends up cited in thousands of social science studies every year.
But that kind of data is often tainted by bias, argues a new paper published in Science, and data scientists and the public should be on alert.
The new research, a collaboration between computer scientists Derek Ruths of McGill University and Jürgen Pfeffer of Carnegie Mellon, grew out of similar challenges the professors had faced in their own research using social media data. Ruths had demonstrated “that the community was severely over-estimating the accuracy of their classifiers of political orientation of Twitter users,” while Pfeffer had showed that Twitter’s stream of public data isn’t representative of the public at large.
The research comes amidst an onslaught of social-media-based science. Every year thousands of social science studies on everything from media perceptions to disaster response cite Twitter, Facebook, and other networks. Meanwhile, the UN, DARPA and other government agencies rely on large-scale data mining of social networks to make predictions about poverty, health, and large events, and to gather intelligence.
And as academia clamors for more access to social networks’ proprietary data, the companies’ own data scientists are plumbing their streams for public and private insights. Earlier this year, Facebook stirred controversy after a study about “emotional contagion” on the network appeared in the Proceedings of the National Academy of Sciences. Facebook was accused of manipulating users without their consent, but the scientific merit of its study was not then a major issue.
If researchers are indeed pulling from the Twitter’s API, which the company says is “suitable for data mining”—and around which it dedicated $10 million to start a new lab at MIT this year—the biases from the social network might be lost without careful precautions.
By missing or misreading inherent biases, social scientists can make significant mistakes. The researchers compare the methodological problems with social media to those that plagued pollsters in 1948, who used the telephone to predict the outcome of the presidential election, leading to the famously incorrect headline “Dewey Defeats Truman.” The poll significantly underestimated the number of rural Truman supporters: at the time, many of them didn’t own phones.
A more recent example is Google Flu Trends. When it debuted in 2008, Google’s model seemed to predict where and when the flu had spread in the U.S. with relative accuracy, based on where people were Googling for flu-related searches. By the 2012-2013 flu season, however, it overshot the number of flu cases 95% of the time, according to David Lazer, a Northeastern University computer science professor. Even after Google updated its algorithm the next year, it still overshot by 75%, Lazer said.
“Although not widely reported until 2013, the new Google Flu Trends has been persistently overestimating flu prevalence for a much longer time,” Lazer wrote, noting that the company had not disclosed the 45 search terms it used in its tracking.
One of the most obvious pitfalls of social media for students of human behavior—and one that remains widely unaddressed—is the attraction that different social networks have among different groups of people. For instance,
- Pinterest is dominated by females, aged 25 to 34, and, according to Pew, 27 percent have a household income over $75,000;
- Twitter, according to Pew, is popular amongst Blacks, non-Hispanic, aged 18-29, and;
- Facebook is most popular among women, 18-29 years old.
One central challenge is that social platforms are interested in their own big data for specific, self-serving, purposes—mostly, keeping users on their websites. That’s not optimal for science. Whether it’s the publicly available data—the stuff of APIs—or even the wealth of data viewable by the companies themselves, certain biases are baked in.
Ruths and Pfeffer offer examples: “Google stores and reports final searches submitted, after auto-completion is done, as opposed to the text actually typed by the user,” they write. “Twitter dismantles retweet chains by connecting every retweet back to the original source (rather than the post that triggered that retweet).”
The design of the network isn’t the same as the design of a good experiment, and it’s governed by a host of factors that have little to do with the public interest.
Consider Uber’s “data science” team. Like Google’s mappers, they possess a stream of data that might appear useful, say, for urban transportation planning. But the company keeps that data close, and has raised privacy concerns over how it has monitored users’ movements. The data is also specific to the Uber network itself, rather than a reflection of how most people move through cities.
There are other issues too: social media bots that appear to be human, the fact that human behavior isn’t easily measurable in data, and the use of overly broad criteria in measuring social media data. The point, say the researchers, is one that statisticians and computer scientists have been making for years: Don’t believe the hype surrounding “big data.”
The paper offers a few suggestions for reducing biases in social media, the first of which is adjusting for platform specific biases. But even if these biases can be overcome, would these platforms want to?
Ruths doesn’t think so. “Ultimately, social platform providers want to provide a compelling and natural experience to users,” he tells Co.Labs, “so platform engineering would benefit from a better understanding of how people engage with their platforms and what aspect of user behavior is a result of the platform, versus a result of simply how people behave (either in isolation or socially).”
After the publication of its study this summer, Facebook data scientist Cameron Marlow clarified that the company’s research wasn’t conducted with science in mind. “Our goal is not to change the pattern of communication in society,” he wrote. “Our goal is to understand it so we can adapt our platform to give people the experience that they want.” In a blog post, the founder of OKCupid defended his company’s own emotional experiments by pointing out that, if you’re using the Internet, with it’s incessant A/B testing, you’re bound to become part of a data set somewhere along the way.
Facebook and OKCupid may have stirred up controversy for playing with people’s emotions in the name of research, but their research portends an even more worrisome Internet trend: bad science.