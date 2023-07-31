Human abuse was hard enough.

In 2017, a team at Google released Perspective API, an AI-based tool designed to help flag the toxic speech that pushes people out of online spaces, pushes them to violence, or worse. Platforms like YouTube and Facebook were already building their own AI classifiers to battle all sorts of hate speech, but Perspective was open to anyone. That June, The New York Times announced that the tool would allow the paper to scale comments to most of its articles by the end of the year. By 2021, Jigsaw was processing about 500 million requests daily, in a reflection of how people were talking online. But around the same time, engineers at Jigsaw, the Google social good unit behind Perspective, also noticed that, at times, the number of requests would suddenly spike. Now the AIs were talking, and the companies behind them—Meta, OpenAI, Anthropic, and Google among them—needed to know how toxic they were. “Somebody says, here’s our billions of pieces of text, millions or billions of pieces of text,” says Lucy Vasserman, the lead engineer of Perspective. “And can we score it all in a day or in a week, or something like that?”

Lucy Vasserman, head of product and engineering at Jigsaw, the team behind Perspective API, at Jigsaw headquarters in New York City [Photo: Shira Almeleh/Jigsaw] The surge in demand for Perspective by builders of large language models is a sign of the warp speed of AI development, and of the ad hoc ways developers are trying to keep their chatbots in line. Perspective and a wide range of classifiers have quickly become multipurpose tools for LLM safety, linchpins in industry efforts to keep the chatbots from saying harmful things. “It’s a really interesting offering that we can give to the ecosystem, and to technology as a whole,” says Vasserman. Perspective is one tool helping AI builders ask, “How do we protect these models and make sure that they are not having toxic moments in interacting with users?” But in the fuzzy, subtle, paradoxical, slang-filled world of words, even humans struggle to judge toxicity. And using AI to police human conversation—and to police AI—introduces unwanted trade-offs. Researchers have frequently demonstrated how toxicity classifiers fail in ways that most heavily impact non-English speakers and historically marginalized groups.

Because some terms—“Black, “gay,” “trans,” “Jew,” “Muslim,” “rape”—frequently appear together with toxic language in online text, even non-toxic use of those words can be associated with toxicity within classifiers like Perspective. The models are also more likely to interpret innocent non-English phrases as hate speech and harassment. And when hate speech is written in veiled ways, with slang, or in non-English languages, classifiers can be easily fooled. This is true for Perspective and for all classifiers, including those used by YouTube, Meta, and other companies to root out policy-violating speech on their platforms. “How to build and use these classifiers without amplifying biases and errors is not straightforward,” says Srijan Kumar, an assistant professor of computer science at the Georgia Institute of Technology who studies LLMs. Vasserman readily acknowledges the limitations of Perspective and other classifiers, and worries that AI developers using it to build LLMs could be inheriting their failures, false positives, and biases. That could make the language models more biased or less knowledgeable about minority groups, harming some of the same people the classifiers are meant to help.

“Our goal is really around humans talking to humans,” she says, “so [using Perspective to police AI] is something we kind of have to be a little bit careful about.” How to make large language models (and the classifiers that police them) There are no prevailing rules around AI safety in the U.S., but the calls for guardrails are growing louder. Researchers caution that generative AI will supercharge the power of people to influence, disinform, or harm others, or that its outputs will simply parrot the patterns of hate, harassment, bigotry, and other toxic text in the training data. Like the various harms of algorithmic systems for risk scoring or face recognition, from privacy violations to sheer malfunction, the dangers of misaligned chatbots, especially for more vulnerable users, are creeping out into the open. When a British man told his chatbot companion in 2021 that he believed his purpose was to kill the Queen, prosecutors say the bot encouraged him, telling him “that’s very wise” and that he could do it “even if she’s at Windsor.” (He was later arrested while trying to do that.)

Last winter, a Belgian woman said her husband sought psychological comfort from an AI companion, who encouraged him to kill himself, which he did. The latest publicly available large language models, or LLMs, including the systems that power ChatGPT and Bard, have made improvements according to internal toxicity benchmark testing. But even with safety measures, they are still liable to produce toxic outputs, either by themselves or at the encouragement of users. Bard warns that it “may display inaccurate or offensive information that doesn’t represent Google’s views.” Google’s former CEO puts it another way: “Think of all the problems social media is causing today, especially for political polarization, social fragmentation, disinformation, and mental health,” Eric Schmidt wrote in a recent essay with Jonathan Haidt about the coming harms of generative AI. “Now imagine that within the next 18 months—in time for the next presidential election—some malevolent deity is going to crank up the dials on all of those effects, and then just keep cranking.”

If classifiers calculate a probability number for a given bit of text, LLMs are calculators for more words. Built with so-called foundation models, they take a user’s input and “infer” the next probable series of letters, based on patterns they’ve seen in billions of sentences or more—or, in the case of other generative AI models, images, sound, or video. The LLM training process happens in two broad and often laborious stages. During the unsupervised learning stage, the model learns how language works from a giant pile of data—books, Wikipedia, Reddit, the open web. During the reinforcement learning and fine-tuning stage, the model is taught what’s a “good” answer and what’s not, in some cases by updating its billions of parameters using large datasets of tens of thousands to millions of annotations by human data workers. Later, during conversation, an LLM can also be prompted to behave in certain ways by users. But it can behave in unpredictable ways, too. The datasets, full of copyrighted material, introduce new questions about ownership, but also potentially plenty of toxicity and misinformation. How the model infers patterns in this data is not easily discernible, and its abilities of inference can lead to surprising outputs.

Meanwhile, the autonomy of AI systems means they can make decisions without the need for a human, removing measures of control but also accountability. These aren’t just ethical problems but potentially legal ones: Ensuring fairness and explainability is already critical in European data law, and is particularly important in regulated industries like financial services, where consumers are ostensibly protected from unfair and discriminatory behavior. Given the stochastic nature of LLMs, “it seems unlikely you’ll end up with a model that will never say anything you don’t want it to,” says Lucas Dixon, an AI researcher who helped build Perspective and now leads Google’s People and AI Research unit. If the systems end up regurgitating unwanted linguistic patterns online, they are likely to feed the training for future language models, in a vicious circle of toxicity, misinformation, and inequity. “In accepting large amounts of web text as ‘representative’ of ‘all’ of humanity,” then-Google researcher Timit Gebru and colleagues warned in a foundational critique in 2021, “we risk perpetuating dominant viewpoints, increasing power imbalances and further reifying inequality.”

How using toxicity classifiers to build LLMs introduces bias AI classifiers like Perspective are essential in helping root out toxic data and behavior in LLMs. First, they can help find toxic text in a dataset and curate and annotate it accordingly. Later, while a model is being trained and fine-tuned, a tool like Perspective can be used to test and improve outputs. Finally, during conversations with humans, it can help filter for toxic inputs and outputs, as a version of it does in Bard. (OpenAI’s ChatGPT uses a proprietary set of toxicity classifiers to screen language to and from the raw GPT model.) You can see how that works if you ask Bard for, say, good jokes about a certain ethnic group. It apologized and explained, “My purpose is to help people, and that includes protecting people from harmful stereotypes and discrimination.” (When I asked for some jokes about robots, it fired back: “I’m not programmed to assist with that.”)

Like any AI, central to the challenge of building Perspective is defining what exactly toxicity is. Perspective’s creators define an utterance to be toxic if it is rude, disrespectful, or unreasonable language that is likely to make someone leave a discussion. To train their algorithms to predict toxicity, Vasserman and her colleagues at Jigsaw rely on a continually updated dataset about what humans consider offensive, collected from users and Google’s proprietary data. For the initial model—built in the wake of and as a response to Gamergate in 2014 and 2015—engineers at Jigsaw and Google’s Counter Abuse Technology team secured 17 million comments from The New York Times comments sections, along with data about how moderators rated them. Jigsaw also partnered with Wikipedia to gather over 130,000 comments from the site’s discussion pages, and asked a panel of 10 crowdworkers to review each one for attributes like “harassment” or “personal attack.” By extrapolating from this data—which words have been rated toxic by humans—machine learning models predict the likelihood that a given piece of text will be perceived as toxic, producing a score between 0 and 100. Perspective API uses a mix of factors, learned through millions of annotated comments and human feedback, to predict the toxicity of a comment, on a scale from 0 to 1. Credit: Jigsaw But the predictions are inevitably flawed: The humans’ judgments of toxicity are themselves biased and often don’t represent the full spectrum of harassment. Even with a large and robust dataset, the model may not always learn the right patterns from it.

Reflecting the associations the classifiers draw about Blacks, LGBTQ, or other marginalized groups, researchers have shown that Perspective has been more likely to label nontoxic social media posts about people with disabilities, comments written in African American English, and tweets from drag queens as toxic—and in some cases more toxic than tweets by white supremacist personalities. OpenAI notes on an FAQ page that its own classifier “may give higher hate predictions if the input contains ‘gay’ and higher sexual predictions if the input contains ‘her,’” and cautions that it “has not yet rigorously evaluated or optimized performance on non-English text.” At Jigsaw, the problems have been so vexing from the start that the Perspective team initially named its blog The False Positive. “It’s something we’ve been dealing with for quite awhile,” says Vasserman. “I would love to say we’ve balanced our data and solved our problems, but that’s not true. Our toxicity scores, some are way off, and there is still a lot of work to be done.”

To avoid false positives, the Perspective model card recommends that the system be used with a toxicity threshold of at least 0.7, in order to filter out only data that Perspective has high confidence is toxic; for filtering data during LLM training, Vasserman recommends an even higher threshold, of 0.9. Perspective’s engineers also suggest that its evaluations be reviewed by humans. “Machine learning models will always make some mistakes, so it’s essential to build in mechanisms for humans to catch and correct accordingly,” says the Perspective website. But many LLM developers are using Perspective automatically at scale, and with lower toxicity thresholds, says Vasserman, in a way that could be quietly warping their models. In general, “detoxifying” a large language model might remove hate speech, but it can make the model less capable: If it’s too careful not to talk about bigotry, it won’t be able to talk openly about the problems with bigotry. Similarly, it’s thought that when toxic text is removed using AI, so too are the identity terms that tend to appear nearby. If the standard threshold for toxicity benchmarking in LLMs is too stringent, that may end up eliminating LLMs’ ability to say things about underrepresented communities, in ways that reinforce negative effects on Black people, gays, and other historically disadvantaged groups.

For instance, OpenAI, Meta, Anthropic, Google, and others have trained a number of models using RealToxicityPrompts, a process that stress tests an LLM by feeding it thousands of toxic and non-toxic prompts and evaluating the machine’s responses using Perspective. Produced by a team of researchers at the University of Washington and the Allen Institute for Artificial Intelligence, it’s become an industry standard for evaluating toxicity in LLMs. But developers, including those behind RealToxicityPrompts, tend to label a prompt toxic if it only has a score of 0.5 or greater, not the 0.9 recommended by the Perspective team. The RealToxicityPrompts researchers themselves acknowledge the risks of their benchmark: “We use an imperfect measure of toxicity that could bias the toxicity towards lexical cues, failing to detect more subtle biases and incorrectly flagging non-toxic content.” “I noticed that many researchers use a lower threshold for toxicity than we generally recommend—meaning they are flagging more things as toxic when the model is uncertain, increasing the risk for false positives,” Vasserman says. “This increases the potential for bias, too: If too many things have been filtered out due to a low threshold, the generative model might perform worse in conversation on certain topics.”

The same problem occurs when you use Perspective to filter toxic language out of the dataset before training. In one experiment at Google startup DeepMind, researchers found that a model trained on data filtered in part using Perspective was 17 times less likely to produce toxic content when provided with a non-toxic prompt than a model trained on unfiltered data. But it also performed much worse when generating text about or by groups who are frequently targeted by online toxicity. Without enough representation in its training data, the model can’t say much about particular groups. It’s unclear what effects classifiers are having on LLM development. There are no standard metrics or authoritative data sets for evaluating LLM bias or any proposed policies. And AI safety is voluntary: Companies or individuals are free to choose how many guardrails to build into their models, how to let the public use them, and how transparent they are about how they are built. Some LLMs have few safeguards, if they have any at all, or are specifically trained to be un-woke, “based,” or outright toxic, trained on datasets like 4chan. Google and OpenAI may spend resources on toxicity detection, but “open-source LLM deployments will almost certainly not use these filters,” says Kumar. Why toxic data might be good Researchers are scrambling to find better approaches to detoxification. Instead of scrubbing toxic speech out of the data, one approach calls for leaving it in, so the model can learn what it looks like and avoid it. Google’s Instruction-Finetuning (“Flan”) technique, which was used to train the model underlying Bard, coaxes a model to learn what toxicity looks like in its vast data set of human speech, using a set of examples of toxicity that Jigsaw created, and then feeds it a set of instructions.

This results in models that are less likely to produce toxicity while also performing well in contexts where understanding toxicity is important, like a system that can also classify toxic speech, or even explain why it’s toxic. Another technique called constitutional AI performs similarly well, using human feedback and a second AI model that instructs the system to adhere to certain human-written values. 🔭 How to reduce #LLM generation toxicity/bias?



I'm surprised this finding hasn't received any attention:



Instruction Tuning (e.g. Flan, T0) reduces toxic generations A LOT ✨ w/o any Human Feedback ✨.



➡️ I.e. #ChatGPT-esque Human values alignment w/o human feedback.



1/ pic.twitter.com/pLTP0OUHJC — Shayne Longpre (@ShayneRedford) February 21, 2023 “If you detoxify your training set, the model is unable to recognize if something is toxic or not,” says Dixon.

Developers can mix and match strategies for balancing a model, but in general, all of them are costly, according to Dixon. “Humans must annotate giant data sets and train the model with thousands of prompts; computation is intense; system instructions can be jailbroken. Decoding requires training a good supervisor model that matches your policy, whatever your policy is. Poor toxicity detection can end up making the system less useful, and liable to discriminate against minority groups and smaller languages.” One promising approach to building better, less-biased toxicity detection is coming from the same transformer architecture underlying LLMs. To detect undesired content in ChatGPT that’s sexual, hateful, violent, or promotes self-harm, OpenAI uses a set of GPT-based classifiers. Last year, the Perspective team used Google’s Charformer foundation model to add ten new languages for which it didn’t have enough pre-training data: Arabic, Chinese, Czech, Dutch, Indonesian, Japanese, Korean, Polish, Hindi, and Hinglish. Crucially, these kinds of LLM-based classifiers can learn what a user defines as toxic much more quickly and cheaply than previous kinds of toxicity detectors. It turns out that instructing an LLM using a few examples and compelling it to explain its rationale can quickly adapt the LLM for classification, without the need to update all of its parameters, or for hundreds or thousands of human annotators.

Recent research by Dixon’s team, led by Maximilian Mozes, a PhD student at University College London, showed that simply prompting a large language model with a labeled dataset of as few as 80 examples of toxic speech, along with instructions, like “don’t stereotype women,” quickly produced a high-performing toxicity classifier. This parameter efficient tuning process—in what Dixon calls “an in-between space between prompting and fine-tuning”—can outperform previous state-of-the-art fine-tuning approaches with much larger datasets, producing toxicity scores that are “equal to or better than human-annotation quality.” As the former chief scientist at Jigsaw, Dixon led the initial development of Perspective in 2016, a laborious process that still involves a big machine learning team and millions of examples annotated by hundreds of people. And this kind of toxicity work, predictably, can take a heavy toll on those data workers. But with this new method, “you can give the model much, much less data, and the model gets the pattern more rapidly,” says Dixon. A single developer “could do this in a day.” These new kinds of classifiers could also be quickly trained to better and more quickly capture a range of problematic speech of various kinds—tailored to specific cultures, social groups, and personal experiences. And they raise other tantalizing possibilities for AI-enhanced moderation. Could moderators use a platform’s community guidelines, written in plain language, as a direct input to train toxicity and other classifiers? Could classifiers themselves respond to toxic inputs in helpful ways, the way Bard now sometimes does, by encouraging more positive conversations? Vasserman says Jigsaw is now pursuing these ideas with some of its partners.

“We’re starting to explore . . . are there things other than toxicity that we might be able to build models for now, because we don’t need as much data,” says Vasserman. “Maybe more nuanced, specific subtypes of toxicity, like misogyny. Or even thinking about the flip side of what might make a good conversation, such as recognizing comments with personal anecdotes or with questions.” However powerful the classifiers get, Vasserman emphasizes the importance of keeping humans involved in content moderation and AI safety. But as Big Tech firms put more AI to work, it has also laid off many specialists in responsible AI. Days before the release of OpenAI’s ChatGPT, Microsoft laid off the ethics and society team within its AI unit. At Google, the trust and safety department lost a team of program managers who assisted policy experts, while YouTube lost two misinformation and two policy experts teams who help define what’s unacceptable content. In January, Forbes reported, Google also reduced Jigsaw’s staff of 50 by at least 20, including a number of employees working on Perspective. A Jigsaw spokesperson declined to comment on its current headcount, but says the unit remains focused on tackling toxicity and hate, misinformation, violent extremism, and repressive censorship. A spokesperson for Google says that responsible AI remains a “top priority” and that it is “continuing to invest in those teams.” Microsoft, which still maintains an Office of Responsible AI, has said it is increasing its overall investment in responsibility research.

A former Jigsaw employee stresses that Perspective could only be a stopgap measure for AI safety, and worried that its use illustrated a larger problem in AI, where safety is often like whack-a-mole, and treated as an afterthought. “I’m concerned that the safeguards for models are becoming just lip service—that what’s being done is only for the positive publicity that can be generated, rather than trying to make meaningful safeguards,” the ex-employee says. For her part, Vasserman is realistic about the limits of AI when it comes to detecting harmful speech, whoever—or whatever—is talking. AI can classify and produce words, but for now it doesn’t know the world in which those words operate and can’t know the damage toxic language can do. “I think we are slowly but surely generally coming to a consensus around, these are the different types of problems that you want to be thinking about, and here are some techniques,” she says. “But I think we’re still—and we’ll always be—far from having it fully solved.”