Big Tech has a glaring double standard when it comes to web scraping

A couple years ago, I received an email from Google telling me that my browser extension, RegretsReporter, was at risk of being removed from the Chrome Web Store for violating Google’s policies. It made me deeply uneasy—not because our extension was doing anything malicious, but because it seemed one of the most powerful companies in the world might be coming after me for my work.

Just three weeks before, I had published research that was covered by the press around the world showing how YouTube’s recommendation algorithm had been routinely surfacing harmful content to people—research that was powered by the very browser extension that Google was now threatening to shut down.

It was a familiar playbook. In 2019, a few months before the European Parliamentary elections, Facebook deliberately changed their code to block access to ad transparency tools run by several different nonprofits, including Mozilla (where I am a senior fellow), ProPublica and WhoTargetsMe. One year later—and a month shy of the 2020 U.S. Presidential election—Facebook sent a cease-and-desist letter to researchers at New York University who were also using a browser extension to monitor political advertising on the platform. In each of these cases, Facebook’s aggressive strategy centered on a technique called web scraping.

Web scraping uses automation to retrieve and archive content from websites: content like the ads that you see on Facebook, the Tweets that appear in your feed, or the recommended videos on your YouTube homepage. Independent researchers have used web scraping to reveal large-scale disinformation operations, horrifying malfunctions in platform algorithms, and more. But scraped data isn’t only beneficial for public interest research—it also has enormous commercial value.

Scraping is the bread and butter of the “social listening” industry, which collects and analyses social media data on behalf of companies who want to find out what people think of their brand and keep tabs on trends that might impact their business. Multimillion-dollar companies like Brandwatch and Meltwater use a variety of methods to collect this data, including web scraping, and sell access to their data tools though subscriptions that cost thousands of dollars per month. Yet while researchers are routinely served cease-and-desist letters for the same practices, social listening companies are considered trusted partners of social media companies.

One reason for this double standard may be the vast sums of money that flow from social listening companies to platforms. In addition to web scraping, many social listening companies pay platforms hundreds of thousands per month to access enterprise-level APIs that give them access to even more social media data. Plus, the business of “social listening” in itself is about getting brands to spend more and more money advertising on social media—a motive that makes it easy for platforms to look the other way.

Except, of course, when these companies also start to threaten their bottom line. Facebook recently settled a two-year-long legal battle with two “marketing intelligence” companies, BrandTotal and Unimania, which they argued violated their terms of service by scraping data. In a counter-lawsuit, BrandTotal claimed that Facebook’s legal actions were anticompetitive, intended to squash smaller companies from providing advertising analytics to their customers. After all, Facebook brought forward the lawsuit right after the company raised $12 million in venture funding and secured contracts with major brands like L’Oreal.

Facebook and other companies tend to invoke user privacy concerns when they go after people for scraping data, particularly in light of the 2016 Cambridge Analytica scandal. And while it is true that web scraping can pose significant privacy risks, especially for nonpublic data, the argument has become a convenient scapegoat for attacking projects (or companies) that they don’t like. The legal ambiguity around web scraping, particularly for public interest research purposes, perpetuates this game of whack-a-mole in which platforms wield the mallet and researchers scramble to predict who it will come down on next. Ironically, this dynamic is even worse for online privacy: people who use web scraping to access social media data must do so in the shadows, and without that data the privacy practices of platforms themselves become impossible to scrutinize.

While access to social media data is dwindling (except for those who are willing to pay for it), its importance to individual and societal well-being is only increasing. Next year, more than two billion people will head to the polls as U.S. Presidential and European Parliamentary elections converge with federal elections in India, South Africa, Mexico, and more than 50 other countries around the world. Digital platforms will remain key battlegrounds where these races play out, and in the absence of oversight may become the frontlines for conflict and mass manipulation. Fortunately, Europe’s Digital Services Act contains some provisions granting researchers access to social media data, yet this access is confined to research groups whose applications are approved by yet-to-be-appointed government agencies across the EU. For those who want to monitor the digital space from the outside, like journalists and civil society researchers, legal risks remain.

In spite of these risks, my team at Mozilla has continued to maintain RegretsReporter, which more than 70,000 people have used to collect and send us data about the recommendations they are seeing on YouTube. Like most public interest researchers, we hope that the reputational damage that Google would suffer from shutting us down is a significant enough deterrent to keep them from doing so. Last year, our research was cited by lawyers petitioning the U.S. Supreme Court to take up a case that will decide whether companies like YouTube could be held legally liable for content that their algorithms recommend—a decision which could have huge consequences for Google’s core business. The jury is still out on how this may influence their calculus, though if they do shut us down we could always consider a strategic rebrand as a sleek social listening tool for big companies.

Brandi Geurkink is a senior policy Fellow at Mozilla, a research fellow with the Siegel Family Endowment, and a board member of the Coalition for Independent Technology Research.

Recognize your brand’s excellence by applying to this year’s Brands That Matter Awards before the early-rate deadline, May 3.

POV: Big Tech has a glaring double standard when it comes to web scraping

Explore Topics