Microsoft Software Recognizes Images Better Than Humans Do

And it’s a foundational step in the next wave of interaction.


It’s not the most gorgeous photography you’ve ever seen. ImageNet features 1.2 million pictures of mundane items–a photocopier shoved in the corner of an office, a bowl of oatmeal on a table, a pile of logs, a giant sign shaped like an ear of corn, an elbow. But ImageNet is important: It’s the central collection of images scientists around the world use to teach their software image recognition, and then test it, too.


Every year, algorithms get better at identifying what’s in these images. But Microsoft Research has just announced a major milestone: Its software was able to identify the contents of 100,000 test images in ImageNet with a 4.94% error rate, while humans have scored a 5.1% error rate in the same test in the past. In other words, Microsoft hasn’t just beaten every competitor in the industry; they’ve beaten humans at their own game.

“That is the current best [result] I have heard about,” confirms Alex Berg, Assistant Professor at UNC Chapel Hill who helps manage the ImageNet set–though he pointed to Baidu, with their 5.33% error rate, as getting very close to Microsoft Research’s milestone, and potentially reaching theoretical peaks in the test itself. “There is some noise and ambiguity in the dataset, and so further small improvements in accuracy may not be meaningful.”

The advantage Microsoft’s system has over humans comes largely down to what the researchers call “fine grain” material, like distinguishing 120 different species of dogs. But error rates and theoretical peaks aside, the real takeaway here is that software is getting extremely good at recognizing what everyday things actually are with an incredible amount of specificity. And this is a key development when it comes to the future of interface.

As digital glasses like the Microsoft Hololens and Magic Leap make their way to market, they’ll lean largely on the promise of augmenting our reality–adding interface and information to all of the mundane objects around us. And there are really two ways that the systems can do this without adding RFID broadcast chips to every box of cereal on the grocery store shelf.

The first is geolocation. The Hololens patent application describes building a cloud-connected map of the entire world. So if you, say, walk through a park, every tree will be indexed and tagged in the map’s database, and the glasses can then deliver relevant information on the fly as you pass by any point.

The second is image recognition–the same sort of technology Facebook uses to tag the faces of your friends. In this scenario, if you looked at a stop sign with your augmented reality glasses on, your glasses would just know it’s a stop sign, just like a human would, through its own visual logic.


No doubt, future augmented reality systems will use a combination of these two technologies–cross-referencing eachother for accuracy–but image recognition is so important because there are moments in ours lives that the Googles of the world will never be able to pre-map and index. Say you’re making guacamole in your messy home kitchen. Image recognition could spot your cutting board, your knife, and the jalapenos, avocados, and cilantro. Then, if you had no idea what you were doing, augmented reality software could guide you through the process of making guacamole–maybe even adding cut lines to the produce so you diced and deseeded appropriately.

Designers could imagine dozens of ways to walk someone through this, but the process only works if a pair of smart glasses can understand that your avocado just rolled on the floor, and that you’re actually holding the knife the wrong way in the first place.

Home automation, too, can benefit wildly from object recognition. Security webcams are already being used to track movements and recognize faces in a home, but imagine if, say, Microsoft’s Kinect camera were outfitted with an algorithm that could identify every object in your living room. You could ask, “Xbox, where did I put my keys?” and the Xbox could scan the room and tell you. (More creepily, the Xbox might watch and catalog where you move every object in your home at all times, so it knew the answer before you asked the question.)

On that note, Berg cautions that systems like Microsoft’s human-beating image analysis platform still have some kinks to work out. While these pieces of software can correctly identify a toilet as well as a human can, they’re still not always great at providing spatial context and actually calling out where that toilet is in the image. “Although folks are making really fast progress there, too!” he says.

But again, these are moments of user experience magic entirely dependent on better image recognition software. We can dream up all of the sci-fi scenarios that we like, but none matter until the foundational technologies are in place to actualize them. And this is what makes Microsoft’s achievement so significant.

About the author

Mark Wilson is a senior writer at Fast Company who has written about design, technology, and culture for almost 15 years. His work has appeared at Gizmodo, Kotaku, PopMech, PopSci, Esquire, American Photo and Lucky Peach