Smart as they are, computers are still as blind as a bat. It’s why search engines index the web using text and why you still have to fill out those annoying captchas. But with advances in machine learning and image recognition, computer vision is slowly getting to the point where it will be useful to us.
Flickr flexed its computer vision muscles recently with the launch of Park or Bird? The one-page web app was built in response to an XKDC comic poking fun at the limitations of computers when it comes to understanding the content of images. It allows people to upload a photo and automatically determine if the image was taken in a national park (using location meta data) or of a bird (using Flickr’s computer vision).
The hack itself wasn’t anything more than a fun response to an Internet comic, but it offered a taste of some impressive technology that Flickr is working on internally. And it’s not just R&D: Computer vision has found its way into Flickr’s product roadmap and will be something all of us will soon be exposed to more, whether we use Flickr or not.
Flickr’s image recognition technology uses a type of neural network called deep convolutional neural networks. Google is also investing in this type of deep learning technique, and has acquired at least two companies companies that specialize in this technique (Jetpac and DNNResearch) in order to improve the image recognition capabilities of its photo app.
“These methods have evolved rapidly over the past few years, thanks to some key algorithmic improvements and the availability of more powerful computing infrastructures,” says Simon Osindero, an AI architect in the Flickr Vision and Machine Learning group at the Yahoo-owned company. “They currently work well particularly for object, scene, and attribute recognition in photos.”
Having parsed millions of images, Flickr’s deep learning algorithm has learned to recognize 1,000 different objects in images. It does this by passing them through a series of layers, each of which transforms the original image and performs progressively more and more complex computations on it.
As the team explained in a blog post:
As the image passes through these layers, they are ‘activated’ in different ways depending on the features they’ve seen in the input image, and at the top of this network–after the image is transformed by the bottom layer, and that transformation of the image is transformed by the next layer, and that transformation of the transformation of the image is transformed by the next layer, and so on–a short floating-point vector summarizing all of the various activations at each layer is output. We pass this floating-point vector into more than 1,000 binary classifiers, each of which is trained to give us a yes/no answer to identify a specific object/scene class.
This wizardry relies largely on proprietary technology developed internally at Flickr, but it also uses some open source tools, such as a deep learning framework from UC Berkeley called Caffe. It’s a project that Osindero says has come very much in handy, and his team is returning the favor by contributing to the Caffee code base.
“CUDA and GPU computing in general are also invaluable, allowing us to reduce model-training times by one or two orders of magnitude,” he says.
So why does this matter? For Flickr, the ability to recognize objects in images is hugely valuable. They already incorporate this technology in their own photo search and the system can only get smarter from here. For a company that offers 1 terabyte of free storage and aims to be users’ “camera roll in the cloud,” giving people the option to easily search through their rapidly proliferating images is a pretty big perk.
Beyond photos, there are scores of use cases for things like facial recognition and machine vision, some of them more benign than others. But the photo-sorting problem alone is significant enough on its own to warrant the attention of Flickr’s engineers. As our personal collections of photos explode (a trend that isn’t likely to decelerate anytime soon), keeping them organized and accessible will only get harder. Indeed, this is the entire premise behind Apple alum Tim Bucher’s new startup Lyve.
“Flickr is pioneering this effort with a team focused on vision and machine learning,” says Osindero. We are still learning how to balance computer knowledge with human annotations in the best way. It’s a challenging problem, but we are sure the ability to manage photos will greatly improve over time.”