How To Fool A Neural Network

Machine learning is vulnerable to trickery–and scientists are racing to understand why. “If we can do this, so can the bad guys,” says one researcher.

How To Fool A Neural Network
[Source Image: courtesy Anish Athalye]

An autonomous train is barreling down the tracks, its cameras constantly scanning for signs that indicate things like how fast it should be going. It sees one that appears to require the train to increase its speed–and so it does. A few heartbeats later, the train narrowly avoids a derailment. Later, when a human investigator inspects the sign in question, they see something different–a warning to slow down, not speed up.


It’s an extreme rhetorical, but this example illustrates one of the biggest challenges facing machine learning today. Neural networks are only as good as the information they’re trained on, which has led to high-profile examples of how susceptible they are to bad data riddled with bias. But these technologies are also vulnerable to another kind of weakness known as “adversarial examples.” An adversarial example occurs when a neural net identifies an image as one thing–while any person looking at it sees something else.

The phenomenon was discovered in 2013, when a group of researchers from Google and OpenAI realized they could slightly shift the pixels in an image so that it would appear the same to the human eye, but a machine learning algorithm would classify it as something else entirely. For instance, an image might look like a cat to you, but when a computer vision program looks at it, it sees a dog.

Why is this quirk so important–and potentially risky? Imagine if an autonomous vehicle is driving down the street, and instead of a stop sign, it sees a speed limit sign. What would it mean if someone could craft a financial document that appeared one way when a person looked at it, but showed entirely different numbers when it was scanned into a computer? Or if someone with malicious intent created a weapon that, when scanned by a TSA camera that was using deep learning to process images, appeared to be something innocuous–like, say, a turtle?

At the time when they were discovered, adversarial examples weren’t cause for worry.  Many researchers thought it was an edge case, a random theoretical quirk. After all, creating an adversarial example at the time required having full access to the innards of the algorithm it was going to deceive. Researchers were only able to build these examples with digital images–the super precise control you could have over an image in its digital form would get instantly distorted if you tried to print it out because the resolution of a printer couldn’t capture the shifts in pixels at such a detailed level. For instance, though you might be able to successfully fool an algorithm into thinking a dog is a cat using a digital image, it wouldn’t be fooled if you printed out the image and asked the algorithm to identify it then. Altering an object in the real world seemed even more far-fetched. It appeared to be an impossible challenge to create an object with such subtle shifts in its shape that an AI would confuse it for something else. Plus, even if you accomplished that, it wouldn’t work the second you changed angles.

Or so the research community thought. Earlier this month, a group of students from MIT successfully 3D-printed an object that looks like a cute little turtle–but is classified by machine learning algorithms as a rifle. “We’ve shown they’re not weird corner cases or oddities,” says Anish Athalye, a PhD student at MIT. “You can actually fabricate these physical objects that fool these objects in real life.”


The students created an algorithm of their own that can produce physical adversarial examples, regardless of blurriness, rotation, zoom, or any change in angle–both in images that are printed out and in 3D models. In other words, their turtle-rifle isn’t just a one-off. For instance, they 3D-printed a baseball that a computer is convinced is an espresso. It can reliably fool Google’s InceptionV3 image classifier, which can identify images of 1,000 different objects.

These are the kinds of algorithms that already exist on our phones and computers, making photo libraries searchable and making it easy to tag friends in images online. When asked what it is doing to combat adversarial examples, Google pointed out that Google’s researchers are already working on this problem and the company is running a competition for creating an image classification algorithm that isn’t fooled by adversarial examples for the machine learning conference NIPS this week.

This 3D-printed baseball looks like an espresso to a computer. [Image: Anish Athalye]
The students’ work brings what was once considered a theoretical concern into the real world. “The stakes are high,” says Athalye, who has a background in computer security. “We’re not working on breaking real systems yet, but we’ve gotten a lot closer than people thought was possible.”

And Athalye and his colleagues are not the only ones. A team of academics from University of Washington, the University of Michigan, Stony Brook University, and the University of California Berkeley were able to print out stickers and attach them to stop signs in such a way that image classification neural networks identified them as something else. These tiny changes to the stop sign might look like graffiti to a driver (or passenger), but an autonomous car would see a yield sign or a speed limit sign. Besides just disrupting traffic, this could be dangerous: If a car doesn’t see a stop sign and runs right through an intersection, it could run into another vehicle and put people’s lives at risk.

“[Adversarial examples] should be a real concern in practical systems,” Athalye says. “If we can do this, so can the bad guys.”


Part of the problem is that researchers don’t fully understand why adversarial examples occur–even though many are able to create examples of their own. And without a deep comprehension of the phenomenon, it has been difficult to craft defenses so that image classifier neural nets aren’t susceptible to machine learning’s most boggling quirk.

That doesn’t mean researchers aren’t trying. According to Bo Li, a postdoctoral researcher at UC Berkeley who worked on making the stickers that change how algorithms see street signs, there have been upwards of 60 papers dedicated to finding defenses against adversarial examples in all different contexts.

Some are optimistic that eventually researchers will be able to find a solution and a way to guard against this kind of vulnerability. Li, for one, remains positive that security researchers will be able to defend against specific threats with specific software solutions–likely not a catch-all that will protect against all attacks, period, but defenses that will prevent particular types of threats.

And Nicolas Papernot, a computer science grad student at Pennsylvania State University, points out that researchers are starting to find solutions, however limited. “I’m extremely optimistic that we can make progress and eventually achieve a form of robust machine learning,” he tells me in an email. “The security and machine learning communities have engaged in a very productive exchange of ideas. For instance, this year three different groups of researchers reported significant progress on three key tasks for benchmarking vision models: handwritten digit recognition, street view house numbers recognition, and the classification of color images of objects and animals.”

Others aren’t so sure.  Nicholas Carlini, a PhD student in computer security at UC Berkeley, has made it his mission to go around breaking other academics’ adversarial example defenses. “I’m a bit of a bad guy,” he says. “I say sorry, no, it doesn’t work. As an attacker, I can modify the wavelengths to do my attack based on your defense.”


[Image: Anish Athalye]
Carlini has published several papers on his attacks to date–including one from May 2017 in which he dismantles 10 different proposed defenses–and is working on more. He views his work as a way of killing off lines of research that he believes aren’t going to bring the field forward in anyway and refocusing research firepower on the more promising methods. One such method that works in limited situations forces the attacker to distort the adversarial example photo so much that it’s plainly noticeable to the human eye. Still, the few effective ways of defending against adversarial examples are built to fend off specific, limited attacks–and won’t translate across data sets.

There is a bright side, at least according to Carlini. “In the general security community, we had this problem where systems were being used for 20 years before people started really trying to attack them, and we realized they were broken way too late, and we’re stuck with the problems,” he says. “It’s exciting that before these things are actually being deployed, people are working on it now because it means we have a hope of solving things, at least a little bit, or at least understanding the problem before we put them into practice.”

Even the 3D-printed turtle only works because Athalye and his colleagues had access to the internal mechanisms of the neural net they were able to fool–a “white box” situation. It remains to be seen whether researchers will be able to create physical adversarial examples without that knowledge, a so-called “black box” scenario. That’s the situation that real-world attackers would be in, because TSA certainly wouldn’t be publishing the code to whatever neural networks it might use to scan luggage in the future. Athalye says that figuring out how to make adversarial examples without looking at the image classifier’s code is his next goal.

Despite how scary this might seem, Athalye doesn’t believe we need to worry–not yet, at least. “There’s no reason to panic. I think our work was timely and it has the potential to harm systems,” he says. “But we haven’t crashed somebody’s Tesla.”

About the author

Katharine Schwab is the deputy editor of Fast Company's technology section. Email her at and follow her on Twitter @kschwabable