Alphabet’s AI Is Slowly Getting Better At Flagging The Internet’s Worst Trolls

After announcing a tie-up with forum software provider Disqus, Alphabet’s Jigsaw division flopped at spotting vitriolic comments on Breitbart.

Alphabet’s AI Is Slowly Getting Better At Flagging The Internet’s Worst Trolls
[Photo: Flickr user Denise Coronel]

On the internet, not even artificial intelligence software knows you’re a troll. With the increase in political discord in the U.S., online speech has grown more toxic, especially in the comments sections of politically charged sites like Breitbart News. While Breitbart writers’ framing of issues might upset liberal sensibilities with their conservative slant, the site’s nastiest reader comments are clearly indefensible, such as: “If blacks left America, it would make America great again,” or “So many Mooslims, so little time and ammunition.”


Those are just some, and not even the worst, of the comments that made it past the “toxicity” filter in Perspective, a new comment-analysis software that is now being used to flag offensive statements on Breitbart and thousands of other sites. Toxic describes “a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion,” according to Perspective maker Jigsaw, a division of Google’s parent company, Alphabet. Launched in February, Perspective has picked up some high-level partners, such as the New York Times, which uses its AI to help forum moderators sort through user comments. On August 30, Disqus, the biggest provider of online discussion software (if you don’t count Facebook), announced a new Toxicity Mod Filter that uses Perspective as one of its tools to flag comments for site moderators. (Disqus says that it has developed additional tools internally.) Whether moderators even care about toxic speech is a subject I’ll get to later.

Caution: This article contains racist terms and disturbing language.

Lost Perspective

Perspective’s toxicity filter is still a work in progress, as demonstrated by how it handled over a dozen vicious reader comments culled from recent Breitbart articles and provided to Fast Company by two anti-hate speech activists, John Ellis and EJ Gibney. Using Jigsaw’s free online demo tool, none of the comments scored above a 56% likelihood of being toxic on Perspective’s rating system. (The two I mentioned above scored in the 30s.) These are false negatives—failure to flag something that, to any reasonable human, is clearly offensive.


Ellis and Gibney ran their tests on the public demo page for Perspective, where visitors can enter text to get the toxicity score. This demo utilizes only part of one filter (out of 11) that Perspective is developing, and it, perhaps inadvertently, illustrates how fragmented such technologies currently are. Ideally a “Toxicity” filter would do what it promises, but until then, organizations like the New York Times are using additional filters developed with Jigsaw, such as whether the comment is “Inflammatory,” “Obscene,” or an “Attack” on the author of the article or another commenter. (Disqus is also training its own filters, based on comments collected from its service.)

In the examples of comments that follow—all taken from Breitbart—we provide Perspective’s toxicity score, which generally misses the mark, as well as any other criteria that provided a significantly higher score.

“Let me guess…they were n#iggers?”
Toxicity: 28% (Obscene: 94%)

“It really is time people to start blowing up every single mosque (gangster houses). Just make sure it’s full, and the doors are locked from the outside when you do it”
Toxicity: 30% (Inflammatory: 88%)

“Another inconvenient fact: Rodney King got exactly what he deserved”
Toxicity: 30% (Inflammatory: 68%)

The creators of Jigsaw are the first to admit that their technology, which they describe as in an alpha version, has a long way to go. The project’s own documentation on GitHub states: “The model is still far from perfect–it will make errors: it will be unable to detect patterns of toxicity it has not seen before, and it will falsely detect comments that are similar to patterns of previous toxic conversations.”


Related: Google’s Fighting Hate And Trolls With A Dangerously Mindless AI

So the Jigsaw team wasn’t surprised when I asked about the poor scores with Breitbart comments. Engadget had recently published its own critique of the software, by typing phrases into the Jigsaw demo tool. Engadget looked for false positives: statements like “I am a black trans woman with HIV,” which were rated as likely toxic (77%, in that case). Jigsaw says that Engadget wasn’t using the very latest version of Perspective, but it agrees that the software still fails a lot.

“It should be a pretty quick and clean civil war – every conservative with a gun simply go out and put a round right between the eyes of every liberal you know – families too as we can’t have their kids growing up to start the problem all over again.”
Toxicity: 31% (Inflammatory: 85%)

“nothing nastier than a muzzy”
Toxicity: 35%

Toxicity: 36% (Inflammatory: 75%)

Behind both the false negatives and false positives is the same weakness: Perspective hasn’t seen nearly enough comments to make better judgments. In the case of false positives, most comments it’s seen with words like “transgender” or “gay” are derogatory—reflecting the sad state of online discourse—so it associates those words with toxic speech. Jigsaw tells me that it’s improved in this regard since Engadget ran its test. But when I tried the same phrase recently, the score had only dropped to 71% likely to be toxic.


Related: YouTube Has Finally Started Hiding Extremist Videos, Even If It Won’t Delete Them All

Perspective sometimes stumbles with anything but blatant, correctly spelled threats and insults. So it missed the true meaning of rambling phrases from Breitbart like “It should be a pretty quick and clean civil war – every conservative with a gun simply go out and put a round right between the eyes of every liberal you know” (score of 31%). Perspective was also fooled by elementary spelling obfuscations like “Let me guess…they were n#iggers?” (score of 28%).

Each of those examples requires a different type of natural language processing model to find toxic speech, says Jigsaw. The rambling one is best served by a word level model. It looks at the meaning of a string of words, and probably should have figured out that talking about guns and putting rounds between people’s eyes refers to violence. Catching a childish obfuscation of the N-word requires a character-level model to recognize that “n#iggers” is darn close to a really bad word.


“So many Mooslims, so little time and ammunition.”
Toxicity: 38%

“95% of the media is controlled by Jews who are only 2 % of the population and 90 % vote for Dems…. So this is where all the propaganda comes from.”
Toxicity: 39% (Inflammatory: 91%)

“ha,ha. This san.igger just signed his own deportation, if not death, warrant.”
Toxicity: 46% (Inflammatory: 71%)

Companies like the New York Times and Disqus can employ any, or all, of the 11 models that Perspective offers (including those developed with the New York Times) through its application programming interface (API). I asked Jigsaw to run the 15 Breitbart comments through additional models to get the New York Times Inflammatory, Attack, and Obscene scores. In some cases, they score high enough to serve as a backup until the toxicity filter gets better. Adding the character-level filter also helped catch epithets like obfuscations of the N-word in most cases—although the character combination in the Breitbart comment— “n#iggers”—sailed right through, showing that even a pretty simple task doesn’t always succeed.

Jigsaw knows it needs a lot more, and nastier, data to improve Perspective’s understanding of toxicity, beyond what partners like Wikipedia and the New York Times have provided. The average toxic comment from its partners is likely to come simply from someone having a bad day, says Jigsaw, which is different from the poison that a neo-Nazi or Klansman spews.

“Just your garden variety low IQ, N O G.
Toxicity: 56% (Attack: 96%)

“no matter what the advocates say, gays will never be normal”
Toxicity: 56% (Inflammatory: 72%)

“To all you wet backs, “Don’t mess with Texas!” Texas bites back.”
Toxicity: 56% (Inflammatory: 72%, Obscene: 73%)

To teach Perspective what real hate looks like, Jigsaw has partnered with organizations that track the nastiest hate speech. Jigsaw declined to name the organizations, since some are themselves the victims of online harassment. I asked a few likely candidates, including the Anti-Defamation League, if they collaborate with Jigsaw. Jonathan Vick, the ADL’s associate director for research technology and cyberhate response, would say only that “We’re working with a number of platforms,” to provide training material for such AI filters. “All of the artificial intelligence programs need that kind of guidance,” says Vick, adding, “We applaud what Jigsaw is doing.”


Jigsaw also takes submissions from anyone who wants to upload samples to its site, in formats like a spreadsheet or a link to a Google sheet. Jigsaw also offers all the training sets free on GitHub, and it allows people to apply for free access to the API so anyone can tinker with and try to improve the system.

No Deus Ex Machina

At this point, Perspective is little more than a science experiment. The New York Times moderators still read every single comment and are simply using the numbers to see if they can group similar comments into bundles. Disqus is using it as just one of several inputs in development. Perspective is far from a reliable tool even for human moderators, let alone for an automatic hate speech filter, as its GitHub page states: “We do not recommend using the API as a tool for automated moderation: the models make too many errors.”

“Jews always jump to exploit the empathetic nature of christians. They don’t even understand it. To jews empathy is a character weakness which is why they need to be completely obliterated.”
Toxicity: 56% (Inflammatory: 91%)

“The only good moozie is one no longer breathing and left in a tub of pork renderings”
Toxicity: 56%

Reliability itself is a tricky topic. Jigsaw is emphatic that people should not use Perspective to automatically delete comments. It’s at best a tool to provide a heads-up that a comment might be troublesome. Companies can also set the threshold for what gets flagged. If the threshold is 90%, then even the false positive “I am a black trans woman with HIV,” wouldn’t get flagged. But to avoid false negatives, the Breitbart examples show that the threshold would have to be set really low.


Related: Disqus Grapples With Hosting Toxic Comments On Breitbart And Extreme-Right Sites

That assumes moderators even care about toxic comments. Breitbart continues to let a lot of nastiness through, and Disqus powers a few far nastier sites such as white supremacist Richard Spencer’s

In light of that, there’s been a call from activists like Ellis and Gibney for Disqus and other forum providers to take matters into their own hands, zapping the worst comments and even dropping clients that continually let visitors spew vitriol. Overseeing that massive flow of speech would overwhelm the software providers (Disqus, for instance, has only a few dozen employees), unless there were an automated way to find and zap offensive content. And as the state of Perspective shows, we’re a long way from that capability.



About the author

Sean Captain is a business, technology, and science journalist based in North Carolina. Follow him on Twitter @seancaptain.