Fast company logo
|
advertisement

The value of data has always been up for debate—but now we’re testing its true worth.

Reddit’s move to put its API behind a paywall could shift how AI is trained

[Photo:
Burst
/Pexels]

BY Chris Stokel-Walker4 minute read

The models behind generative AI tools like ChatGPT, Midjourney, and Stability AI that have wowed the world all share an open secret: They’re trained on vast volumes of data scraped from the internet.

Although some AI companies, including OpenAI and Israeli AI company Bria, have paid for access to the training data that makes their models work so well, others continue to rely on unfettered, free access to the world’s textual and image-led output.

But the parameters around that access are now changing. This week, Reddit cofounder and CEO Steve Huffman told The New York Times that the popular internet forum would begin charging companies that cull its data for AI training purposes. “The Reddit corpus of data is really valuable,” Huffman said. “But we don’t need to give all of that value to some of the largest companies in the world for free.”

The ramifications of that decision could have significant knock-on effects on the way that AI is trained. Reddit is the so-called front page of the internet; it’s where the world’s conversations take place. That’s rich pickings for companies developing large language models (LLMs).

Reddit has recognized its value. Per Huffman, Reddit will impose a paywall around its application programming interface (API), the method through which companies developing AI models are able to download data from the social platform. The level of pricing, and when it would happen, has yet to be determined, the executive said—though carve-outs would apparently remain for academic researchers to freely access the site’s content.

It marks a shift in approach that could change how AIs understand our world and, as AIs become more commonplace, how we humans do, too.

“The time of the free API may be over,” says Andres Guadamuz, an intellectual property law researcher at the University of Sussex. “The move makes sense for companies such as Reddit. In the absence of licensing agreements for training, API access is the next best thing to try to recover some money.” (Reddit didn’t respond to Fast Company’s request for comment.)

There is plenty of cash already flowing through some of AI’s biggest names. Alongside Microsoft’s $10 billion stake in ChatGPT creator OpenAI and the Googles and Amazons of the world, PitchBook’s analysis of deals announced and closed in the first quarter of 2023 suggests around $1.7 billion of investment was poured into the AI startup space, with a further $10.7 billion announced but not yet completed between January and March.

However, that cash isn’t evenly distributed. The world of generative AI is so new and buzzy that people in their own homes are generating ideas for tools and models that could prove transformative in the future. The concern is that people who occupy a middle ground—outside academia, but too small to be able to fork out for API access—could be left behind.

On the flip side, Reddit putting up a paywall to access its API could also encourage the developers of LLMs to think more closely about what they’re training their technology on, says Catherine Flick, a researcher in computing and social responsibility at De Montfort University in England.

advertisement

“Scraping Reddit is lazy and gives companies a highly biased dataset with lots of very problematic data, as we’ve seen multiple times now,” Flick says. Indeed, Reddit’s dataset was shown to have negative gender, religious, and ethnic biases in a 2021 study, and gender biases against women politicians in a 2022 research paper.

With the red tape around Reddit’s data, Flick hopes those who are developing and training LLMs might think about where they get their source data from and look for websites that more accurately reflect the whole of society. When the free, easy option isn’t available, people can decide to find better, if more challenging to discover, alternatives. Of course, the opposite could be true: People could just as well continue to take the free route but do so through platforms that are even less representative of society. 

“It also raises the question about where companies without capital to bootstrap their dataset will go—perhaps less reputable sites that are even worse than Reddit,” Flick says. Overall, she says that the problem “shows that there needs to be tighter regulation on what data goes into training LLMs and other machine learning applications.”

Flick would also like to see the folks at Reddit thinking more deeply about how they implement the paywall. Besides simply throwing up a monetary barrier, she says Reddit could require would-be scrapers to apply for access, and be judged on the merit of their project. 

“It would be good to see the ability for subreddits and users to opt in to large dataset scraping too as a lot of subreddits are used for quite intimate discussions,” she says. “But I suspect that’s a bit too much to hope for from a company that’s about to go public and needs to show it can return for investors.”

Recognize your brand’s excellence by applying to this year’s Brands That Matter Awards before the final deadline, June 7.

Sign up for Brands That Matter notifications here.

PluggedIn Newsletter logo
Sign up for our weekly tech digest.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Privacy Policy

ABOUT THE AUTHOR

Chris Stokel-Walker is a freelance journalist and Fast Company contributor. He is the author of YouTubers: How YouTube Shook up TV and Created a New Generation of Stars, and TikTok Boom: China's Dynamite App and the Superpower Race for Social Media. More


Explore Topics