An unprecedented investigation into disinformation on Facebook has hit turbulence over questions about how much data to release to outside researchers, curtailing efforts to stem one of social media’s most pernicious threats ahead of the 2020 elections.
The problem, according to Facebook and outside experts, is privacy: providing academics with even anonymized data about the links that users share on the platform—crucial to understanding how falsities spread—could in theory harm or identify individual users. The social networking giant is working to establish a template for sharing the data that ensures privacy, but academics are worried about the implications of the scrubbed data. Now the funders—who include the Laura and John Arnold Foundation, The Charles Koch Foundation, Omidyar Network’s Tech and Society Solutions Lab, and The Alfred P. Sloan Foundation—are reconsidering their commitment, putting some projects in jeopardy.
“Without additional data, empirical investigations of social media manipulation in 2020 will not be possible,” says Tom Glaisyer, managing director of the Public Square program at the Democracy Fund, another of the project’s funders. “A healthy democracy needs a well-functioning digital public square and researchers need access to additional data if we are to have an election cycle that the American people can trust.”
The effort was launched last spring with great fanfare by Facebook and with the establishment of an independent nonprofit, Social Science One, which is designed to ensure that outside researchers can trust the data and retain academic freedom. The social media giant’s CEO, Mark Zuckerberg, backed the project, which promised to give researchers access to a petabyte, or 1,000,000 gigabytes, of anonymized user data. The aim, Facebook said last April was to “help people better understand the broader impact of social media on democracy—as well as improve our work to protect the integrity of elections.”
But the project quickly ran into problems. Facebook found that there were technical challenges involved in providing researchers remote access to its data, as well as legal and ethical concerns about handing over detailed data. “Legally and in every other way it was impossible for them to do what they promised to do, and they were wrong and they realize this and they have admitted this publicly,” says Gary King, a professor at Harvard and co-chair of Social Science One. “Facebook, of course, always has final authority about what they do with their resources, including their data.”
In April, Facebook began implementing a relatively new technique, differential privacy, that adds noise to the original data before it is published. In theory, the approach provides a certain mathematical assurance of privacy to individual users in a given data set, while retaining general patterns within the data that are useful to researchers.
Scrubbing the data, however, has slowed its release and will necessarily limit its utility. “From a privacy point of view, it protects privacy,” says King. “But from a researcher’s point of view, it’s biased.”
In August, as BuzzFeed first reported, the funders told the academic body overseeing the program that unless Facebook released key data by the end of September, they would cease their funding and recommend “winding down the project.”
“They are harming democracy by not following through”
A week before the deadline, Facebook released an initial batch of data that had been treated with a new privacy-protecting technique: 32 million URLs that were shared publicly more than 100 times on Facebook between January 1, 2017, and February 19, 2019. But that dataset is only about 7 gigabytes, a tiny fraction of the 1,000,000 gigabytes of data that had been initially promised.
The data Facebook plans to eventually release will not be sufficient to understand certain crucial aspects of disinformation campaigns, says Ariel Sheen, a doctoral student at Universidad Pontificia Bolivariana in Medellin, Colombia, whose research has been submitted to Social Science One’s review process, but who has not yet received data.
“You don’t make giant promises and then not follow through. If Facebook were a boyfriend or a girlfriend that was acting this way, you would break up with them,” Sheen says. “I say this as someone who has already looked at the data extensively: They are harming democracy by not following through.”
Sheen has been collecting his own data from Facebook to study networks of propaganda targeted at Latin American and U.S. users that he believes are orchestrated by the Venezuelan government. Without the data Facebook promised at the outset, the investigation will be significantly limited and could lose its funding.
The research groups already awarded $50,000 grants as part of Social Science One are studying misinformation from a range of angles, including how messages on Facebook may have influenced civic engagement, elections, and news consumption in countries including Taiwan, Brazil, and Germany.
Since the deadline late last month, Glaisyer said the funders have been discussing their support for further research, including a second round of 13 projects that have yet to be announced. Researchers who have already received funding won’t be asked to return it, and those that can make use of the more limited dataset, like investigators in Italy and Chile, will continue to be funded.
However, Facebook, the funders cautioned in August, can’t “offer a definitive timetable for when the full set of proposed data can be made available.”
The ongoing uncertainty over the future of the project has led to more frustrations among academics, raised questions about Facebook’s commitment to transparency and frayed an already fragile trust between the platform and academics.
“It really remains to be seen if this is a marketing project for Facebook to increase its goodwill, as it’s being seen to do something, or whether it actually pans out as being something that’s actually going to be beneficial to researchers and therefore to society,” says Sam Woolley, a disinformation researcher and assistant professor at the University of Texas at Austin who is not directly involved in Social Science One. “Everyone supports it, but if it doesn’t result in anything, it’s almost worse than if Facebook had never done anything.”
Moving slow and not breaking things
Even if the funders withdraw, Facebook says it intends to continue working with academics to release more detailed data over the coming months, and to improve the system it has built for sharing its data. Over 20 full-time employees and a handful of consultants are working on the project within Facebook, said Chaya Nayak, head of Facebook’s Election Research Commission & Data Sharing Efforts.
“What we want [researchers] to know is that we’ve been making significant, significant investments at the company in order to enable a dataset that gets as close as we can to that dataset from last year, but also preserves the privacy of our entire user community,” she says. “We’re producing at a slower pace because we’re trying to move slowly and carefully and do this the right way.”
She pointed to other transparency and privacy efforts the company had undertaken in the wake of Cambridge Analytica, and Facebook’s more aggressive approach toward political disinformation campaigns. Last week, it said it had removed four disinformation campaigns from Russia and Iran that were targeting audiences in the U.S., Latin America, and North Africa, and exploiting divisive topics like religion, Black Lives Matter, and Trump. But observers say the platform isn’t doing enough. Last month, researchers at the Oxford Internet Institute reported that the number of known social media propaganda campaigns around the globe had more than doubled to 70 in the last two years and that Facebook remained the most popular platform for disinformation.
In recent weeks, the company has also come under fire for a set of policies that exempt political candidates, public officials, and satire and opinion articles from Facebook’s fact-checking reviews and community standards, raising more concerns about the spread of disinformation during elections. Twitter and Google have similar rules. The controversy exploded after President Trump’s campaign published an ad on Facebook that spread false claims about former vice president Joseph R. Biden Jr.
“I think lying is bad,” Zuckerberg told Representative Alexandria Ocasio-Cortez during a hearing on Capitol Hill last week. “And I think if you were to run an ad that had a lie , that would be bad. That’s different from it being in our position the right thing to do to prevent your constituents or people in an election from seeing that you had lied.”
The social media companies’ efforts to bring more transparency to political messaging have also drawn criticism. A report released earlier this month by the London-based watchdog Privacy International found that Facebook, Twitter, and Google have taken “a blatantly fragmented approach” to transparency around political ads, so that “most users around the world lack meaningful insight into how ads are being targeted through these platforms.” Facebook subjects political ads to more transparency in 35 countries, including the U.S., but for about 83% of the countries in the world, the platform does not require political advertisers to be verified, for political ads to carry disclosures, or for ads to be archived.
Woolley says that without other critical data, the companies’ ad archives may not only be insufficient but could obscure the problem.
“Advertisements are the tip of the iceberg in terms of where and how propaganda gets spread on Facebook,” he says. The ads archives “suggest to us that ‘this is where the problem lies and that we’re doing everything we can to provide you with the information that you need to analyze how manipulations work.’ But in reality, a lot of misinformation, a massive amount of it spreads on groups pages, for instance, or in the comment sections of articles, or direct messages. But we don’t know the percentage, because Facebook doesn’t share the data on those things.”
Senator Mark Warner, a Democrat from Virginia, and the vice chair of the Senate Intelligence Committee, said in an email that Facebook and other social media platforms are beginning to realize that the problem of disinformation “isn’t going away—and they’ve started making some efforts to address these challenges.”
“However,” he says, “there’s so much more we need to do to safeguard our democracy.”
Last year Warner, along with Senator Amy Klobuchar and Senator John McCain, introduced the Honest Ads Act, which would establish regulations for digital political advertising, including greater transparency from companies like Facebook.
“In Congress, we need to require greater accountability from social media platforms on everything from the transparency of political ad funding, to the legitimacy of content, to the authenticity of user accounts,” says Warner. “And if platforms refuse to comply, we need to be able to hold them responsible.”
“Not even our own employees get both”
Getting the holy grail of social data was never going to be simple. For years, King, director for the Institute for Quantitative Social Science at Harvard University, had been working closely with Facebook to gather its data for his research, and strategizing ways that Facebook could share more data with academics. A company he founded, Crimson Hexagon, has claimed to possess more social media data than any entity except for Facebook itself.
“They have the best data ever,” King told me years ago, referring not just to Facebook but Google and Twitter too. “The question is whether we can come up with incentive-compatible ways of making it so that they meet their business goals while also meeting the public or scientific goals for other uses of their data.”
But when King arrived at the company’s California headquarters last April for further discussions, Facebook’s executives were distracted by another cache of data, shared in an incentive-incompatible way: information on 87 million Facebook users that, according to a whistleblower, had been surreptitiously hoarded by Donald Trump’s dubious political consultancy Cambridge Analytica.
From a privacy perspective, it may have been precisely the wrong time for Facebook to open up its kimono to outside researchers. After all, the Cambridge scandal had been sparked by a pair of scientists at Cambridge University, who used their academic credentials to gather data on millions of Facebook users for a political and military consultancy that would later on behalf of Trump and Brexit. (At the end of 2015, around the time that The Guardian first reported on alleged abuse by Cambridge Analytica, Facebook’s research division hired one of the researchers involved; he has since left the company.) The company would eventually agree to a record-breaking $5 billion settlement with the United States over privacy violations.
Even before the Cambridge Analytica scandal, Facebook had begun locking down its API and cracking down on tens of thousands of suspicious developers, but also limiting the data that outside academics had long enjoyed. Even King’s startup would come under scrutiny: in July 2018, shortly after the debut of Social Science One, Crimson Hexagon was suspended by Facebook for suspected violation of its policies prohibiting surveillance. Crimson’s clients have included Adidas and Anheuser-Busch InBev, as well as the Department of Homeland Security, the State Department, and entities in Turkey and Russia.
The following month Facebook reinstated Crimson Hexagon, explaining to Fast Company that an investigation found no evidence of misbehavior, but it didn’t reveal the results of its inquiry. King, who had been chairman of Crimson’s board until late last year, when it merged with the company Brandwatch, is now a board observer at the new company, but he declined to comment on the incident. Social Science One has no connection to Crimson Hexagon, says King, a message reiterated on the project’s website.
Facebook would have also been wary of what publishing its detailed data might reveal about the inner workings and impacts of its platform, and the backlash that could follow.
“Even in the circumstances where Facebook wanted to provide information, I think that there’s a fear from their side that it will result in calls for regulation that will eventually lead to attempts to clamp down on their approach to free speech,” says Woolley.
While Facebook locked down its data, King says that the exploding scandal also prompted the company to look for other ways to demonstrate its commitment to transparency, especially before regulators arrived. “The advantage this time was that we were able to use a crisis in incentive-compatible ways to the advantage of Facebook, academia, and the general public,” he says.
King presented his idea for more expansive data sharing to Facebook officials, but it wasn’t until he was in his hotel packing up to leave when he got an email from Facebook asking for his help investigating election interference.
On a subsequent phone call with Zuckerberg, King asked for two things: expansive access to Facebook’s data, and no restrictions on publication.
You can have one or the other but not both, Zuckerberg told him. Not even our own employees get both.
“We went back and forth on that,” says King. “I wanted to help, and wanted to convince them to make data available, but there was zero chance of me doing the study if they got to veto publication if they didn’t like the results.”
Then King hit upon an idea for how to make it work: an independent body of distinguished academics that would act as a trusted third party, tasked with reviewing proposals and resulting research and helping to vet the data. Facebook approved the project’s co-chairs, King and Nate Persily, a Stanford law professor and co-director of the Stanford Cyber Policy Center, and required the group to sign non-disclosure agreements. But the company agreed to exercise no power to reject publications, “except for express violations of law or endangerment to user privacy.”
As the effort began to pick up steam at Facebook, however, it bumped into a series of problems. Privacy experts working for and with Facebook, along with other privacy experts working with Social Science One, Facebook attorneys and the firm’s policy and technical experts advised the company that it did not have systems that could easily give researchers around the world access to specific data sets without exposing other data about or risking identification of users: Even without names or user IDs, intricate data can still reveal the identities of the people within them.
One of the consultants Facebook hired for the project was Cynthia Dwork, a computer scientist at Harvard who helped pioneer modern differential privacy. Differential privacy effectively obscures any particular user’s information in a given dataset by adding small statistical bits of noise throughout, while maintaining certain useful patterns within the data. Companies like Apple and Google are exploring the technique and the U.S. Census Bureau intends to use it to share data from the 2020 census.
In the initial 32-million-URL dataset Facebook treated with differential privacy and released in September, researchers can find information about whether a link was fact-checked, the number of users who labeled it fake news, spam or hate speech; how many times the URL was shared publicly; how many times it was shared publicly without being clicked on; and the country in which it was most shared.
It does not however include other critical data for misinformation research, including “the number of times users shared public links privately with their friends.”
One question still to be resolved by Facebook and Social Science One is how protected the data should be. Differential privacy relies upon a numerical privacy score, which determines how much noise is added to the data. The noise helps protect the identity of individual users, but too much noise could make research difficult, if not impossible. Facebook and Social Science One are still deciding what that number should be.
That caveat is a problem for researchers like Sheen, who say they had expected access to far more data, including about other URLs that were shared privately as well as a wider range of demographic information about users.
For Sheen, the point was to examine that data in order to understand not just the content but the behavior of disinformation campaigns, and to map the ways those campaigns had impacted real-world discussions and events in Latin America and the U.S. Behind the propaganda campaigns, Sheen says he and his team have found more than 3,000 sockpuppet accounts linked with Telesur, a Latin American television network largely financed by the Venezuelan government.
“With that data, we can see so many events related to Venezuelan activity,” he says. “It would allow us to use Facebook data as a map to show the threads of specific messaging campaigns. Without that data, we can just say that this stuff happens, but no way of calculating its actual impact.” At protests like the ones that erupted in Ferguson after the police killing of Michael Brown, “we can know that this many people were there and participating in that, but we can’t say or show what kind of messaging on Facebook had an impact on that event.”
Camille Francois, chief innovation officer at Graphika, said that the lack of robust, detailed data about disinformation made it harder for researchers to do deeper analysis, including understanding the perpetrators and their behavior. “This in turn speaks to the problem of designing systems that guarantee security & privacy while ensuring that academic researchers, infosec researchers, human rights investigators can access the data they need to help tackle the issue,” she wrote on Twitter recently. “How far are we from that? Very far.”
Sheen has proposed another idea to Facebook officials that would allow their research to proceed: he and his team would view unadulterated data sets at the company’s headquarters, with Facebook employees supervising their access. Nayak, of Facebook, said the company had considered that option, but that providing on-site, supervised access for outside researchers in a way that ensured the privacy of user data would be overly complicated and onerous.
Still, King thinks Facebook may be willing to allow some researchers to examine more detailed data sets that are not treated with differential privacy techniques. Only differential privacy systems can mathematically guarantee a level of privacy protection. “In practice, however,” King says, “turning mathematical ideas into real, practical systems involves something more than math, and so this approach can be helpful.”
During a symposium on election interference last month, Senator Warner was also asked about the possibility of establishing facilities where researchers could investigate more sensitive social media data.
“In the short term, I think the most important thing we can do is to continue to try to work with [researchers] to bring pressure on the platforms to be more forthcoming,” Warner said. “It’s hard to legislate that access, so I’m hoping persuasion can be used. Ultimately, if not, we have looked at legislative solutions as well.”
Despite his misgivings about Facebook’s commitment to transparency, Sheen doesn’t support more regulation. He also appreciates Facebook’s privacy concerns. But, he says, there are greater costs to keeping more detailed data about disinformation and other abuse hidden from researchers.
“Facebook not providing the data is a serious, genuine threat to American political norms,” he says. “To me, this is one of those pivotal moments in tech or social science. If they don’t follow through they would be on the wrong side of things.”