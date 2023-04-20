The models behind generative AI tools like ChatGPT , Midjourney , and Stability AI that have wowed the world all share an open secret: They’re trained on vast volumes of data scraped from the internet.

Although some AI companies, including OpenAI and Israeli AI company Bria, have paid for access to the training data that makes their models work so well, others continue to rely on unfettered, free access to the world’s textual and image-led output.

But the parameters around that access are now changing. This week, Reddit cofounder and CEO Steve Huffman told The New York Times that the popular internet forum would begin charging companies that cull its data for AI training purposes. “The Reddit corpus of data is really valuable,” Huffman said. “But we don’t need to give all of that value to some of the largest companies in the world for free.”

The ramifications of that decision could have significant knock-on effects on the way that AI is trained. Reddit is the so-called front page of the internet; it’s where the world’s conversations take place. That’s rich pickings for companies developing large language models (LLMs).