Bots Are Scraping Your Data For Cash Amid Murky Laws And Ethics

Is the data you share publicly on social networking sites like an announcement in a public place, where speech and information gathering are protected under the First Amendment? Or is it more like something uttered on private real estate, where the owner can prohibit trespassers as they wish?

That quandary recently emerged in a California courtroom, where two of the country’s most eminent constitutional lawyers squared off in a high-stakes battle between a data giant and a tiny startup.

For years, hiQ, a data mining company in San Francisco that helps employers predict which of their employees are thinking about jumping ship, built its business on the back of a valuable cache of data: public user profiles on the professional networking site LinkedIn.

Then in June, a year after Microsoft completed its $26.2 billion acquisition of LinkedIn—backed by its own data analytics ambitions—the happy if unofficial relationship came to an abrupt end when the startup received a letter ordering it to immediately stop scraping LinkedIn profiles. Suddenly, hiQ’s entire business model was in jeopardy. The startup devoted to helping employers keep talent would itself begin to shed it: The three-year-old company said it lost 10 of its 24 employees since LinkedIn blocked access.

To make its case, LinkedIn claimed that by scraping public user profiles on the professional social networking site—something hiQ had done so for years with the knowledge of the giant professional networking site—the startup was violating the Computer Fraud and Abuse Act (CFAA), a 1980s law that criminalized unauthorized intrusions into computer systems in an era before large networks were common. Effectively, LinkedIn claimed, hiQ was a hacker.

Instead of simply accepting the cease-and-desist letter, or waiting for LinkedIn to sue, hiQ went to court. With the help of one of the country’s pre-eminent constitutional lawyers, Harvard Law School professor Laurence Tribe, hiQ argued that its right to access public profile data was protected by the First Amendment. To make its case, LinkedIn hired Donald Verrilli, who recently served as the Obama administration’s top appellate lawyer.

Last week a California judge sided with hiQ, issuing a preliminary injunction that ordered LinkedIn to again let the startup scrape its data, at least until the case is adjudicated. In his judgment, U.S. District Judge Edward M. Chen equated LinkedIn to a store owner who hangs a sign in a window and then seeks to prevent certain people outside from seeing it.

hiQ’s Keeper software uses LinkedIn and other data to help employers retain talent. Image: hiQ

“The data here is data that LinkedIn has told its users is public,” explains Nate Cardozo, senior staff attorney for the Electronic Frontier Foundation. HiQ argued that it scrapes only public LinkedIn data (a total of 175,000 profiles, it says), that it does not need to log in to an account to see the data, and that LinkedIn does not claim a proprietary interest in its users’ public profiles—that LinkedIn doesn’t “own” your data any more than hiQ does. “It’s data that someone who is not logged in to LinkedIn or doesn’t have an account can still see,” says Cardozo.

Despite hiQ’s temporary victory, the preliminary decision did not resolve the question of whether the CFAA applies, nor did it address claims that hiQ had breached LinkedIn’s terms of use, as well as claims based in copyright or trespass.

But to Cardozo, LinkedIn’s interpretation of the CFAA amounts to a flagrant abuse of the anti-hacker law. Passed in 1984 and updated in 1986, the law was designed in a time of mainframes, before the world wide web even existed. It prohibits access without authorization, and access exceeding authorization, but Congress never intended the law as a multipurpose tool for restricting access online. Otherwise, any user could be considered a hacker for something like, say, using borrowed passwords, even with the permission of the password holder.

“LinkedIn is one of the companies that holds a view of the CFAA that—and they might dispute me on this, but this is the position they’ve taken—essentially a violation of terms of service can revoke authorization, therefore causing a violation of the CFAA,” he says. “And if you violate those Terms of Service, your access to the site is therefore unauthorized.”

The 9th Circuit has rejected this view repeatedly, says Cardozo. The Department of Justice once held that violations of Terms of Service violated the CFAA, but it backed off this interpretation after the United States District Court for the Central District of California acquitted Lori Drew in the “Myspace Suicide” case for “hacking” and other alleged crimes.

What’s needed now, he argues, is for the Supreme Court to review the CFAA—something it may well do in two data scraping cases slated for this fall that allege violations of the CFAA. Amicus briefs in the two cases, from the EFF in Nosal v. United States and from the Cato Institute in Power Ventures v. Facebook, urge the highest court to consider the CFAA’s continued relevance in the age of Big Data.

Whose Data Is It Anyway?

But addressing the CFAA leaves aside larger questions related to how our data is collected and shared—inadvertently or not—online. The updates, photos, and likes we share on our publicly visible social media contain valuable clues to who we are and what we’re after. Crawling and scraping that data is now a common activity on the web, a key component of web search software, data mining and advertising, finance, law enforcement, and academic research.

But to whom does that data really belong, and is it okay for others to copy all of it, automatically and at a large scale?

Cardozo dismisses that concern in hiQ’s case, noting that it’s simply taking data that’s out there—information that is already available to people online without LinkedIn accounts—and aggregating it. “The question here is: Is what hiQ is doing legal?” he adds. “If the answer is yes, then LinkedIn will be ordered to not stop them.”

But that analysis avoids long-standing questions surrounding the ethics of web scraping. Casey Feisler, a faculty member in Information Science at University of Colorado Boulder who has used data scraping in her research, emphasizes this distinction. “It is also important to keep in mind that even if violating TOS is legal it might not always be ethical,” she wrote recently in a Medium post .

Feisler cites the case of Yik Yak as an example of scraping ethics. Her department asked Yik Yak, an anonymous chat app, if they could scrape the app, but Yik Yak declined.

“Consider why this particular platform might not have wanted researchers to scrape,” says Feisler. “Yik Yak was ephemeral by design. Its users had an expectation that that data would not be archived or available beyond its appearance on the platform. Though researchers might not have intended to make the data available, they could have — particularly since in some disciplines it is customary to publish datasets along with analysis.”

Granted, Feisler is talking about scraping social media user data for academic purposes, not for more questionable purposes like marketing or law enforcement.

Consider other recent data scraping cases related to craigslist. In April, the classifieds site obtained a $60.5 million judgment against a real estate listings website that had allegedly received scraped craigslist data from another entity. And earlier this month, craigslist reached a $31 million settlement and stipulated judgment with Instamotor, an online and app-based used car listing service, over claims that Instamotor scraped craigslist data to populate its own listings and sent unsolicited promotional emails to craigslist users. Craigslist did not allege a violation of the CFAA; rather, it argued that Instamotor had breached craigslist’s terms of use and violated federal laws around spam emails.

More sophisticated data scraping efforts have prompted concern, if not legal action. As the Daily Dot reported last year, U.K. startup Tenant Assured, a had planned on scraping Facebook users’ accounts to give landlords information on personalities and “financial stress.” In 2015, the Intercept reported, a company contracted with U.K.-based SCL Group harvested profile data from at least 30 million Facebook users for the purposes of personality analysis and political campaigning, before Facebook ordered the company to stop. SCL is the parent company of Cambridge Analytica, the Trump campaign data science firm that has touted a giant cache of data on millions of Americans.

Last year, concerns over the use of public social media data by law enforcement led Twitter to shut down access to an API used by Geofeedia, a social media analysis company that works with police to provide real-time monitoring of online discussion around events like emergencies and protests. In March, Facebook also updated its rules to warn developers that they could not “use data obtained from us to provide tools that are used for surveillance.”

Follow The Money

In its effort to cut off hiQ, LinkedIn has also argued that it wants to defend its users’ privacy from automated bots operated by unknown third parties, insisting that users control how their data is used. But Judge Chen didn’t quite buy that argument. “LinkedIn’s professional privacy concerns are somewhat undermined by the fact that LinkedIn allows other third parties to access user data without its members’ knowledge or consent,” he noted in his injunction.

Indeed, the battle over scraping is taking place against the background of a bustling multi-billion dollar battle to collect and mine our data. It’s an industry that’s left us with a slew of difficult questions about user consent and privacy—and one in which a company like LinkedIn also has an enormous stake and a sizable advantage.

In pointing to the comparative harms raised by LinkedIn’s move, the Court alluded to that backdrop. “hiQ unquestionably faces irreparable harm in the absence of an injunction, as it will likely be driven out of business,” the Court’s injunction reads. “The asserted harm LinkedIn faces, by contrast, is tied to its users’ expectations of privacy and any impact on user trust in LinkedIn. However, those expectations are uncertain at best, and in any case, LinkedIn’s own actions do not appear to have zealously safeguarded those privacy interests.”

LinkedIn’s privacy concerns are a matter of public relations, argues Cardozo. The company’s real motive, he says, has more to do with an aggressive business in collecting and analyzing data—and selling or licensing that data to other entities willing to pay for it.

“LinkedIn has essentially made a business of [the CFAA], not just in this context but several others,” he says. “I believe that LinkedIn—and I don’t have any inside knowledge—would want hiQ to pay for the data. I believe it’s simply money that is at issue here.”

Protect yourself: Because hiQ and other data scrapers only scrape data from public LinkedIn profiles, you can guard your data by changing your privacy settings. Once logged into LinkedIn, edit your public profile on the right-hand side of the page, and select “Make my public profile visible to no one” in the Customize Your Public Profile section.

Recognize your brand’s excellence by applying to this year’s Brands That Matter Awards before the early-rate deadline, May 3.

Whose Data Is It Anyway?

Follow The Money

Explore Topics