Every day we make complex inferences based on our surroundings. Is that a safe street to walk down? Is the nearest McDonald’s to the left? We use a contextual understanding of, and judgments about, our environment to look beyond merely the "visual scene" and decide what stores and services we expect to find nearby, and even the likely economic climate of the neighborhood.
Now a computer can do the same thing by simply looking at a picture from Google Street View.
A deep learning project by researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) fed 8 million images from Google Street View into an algorithm. The result is a computer that can accurately predict the distance to the nearest McDonald’s in the fewest steps possible, and the crime rate of an area, by looking at an image.
This represents a change in the way we should think about image recognition. "A lot of the existing computer vision research to date has focused on what’s inside an image—for example, does a particular image contain a cat or part of a face?" says Aditya Khosla, a fourth-year computer science PhD student who worked on the project. "We wanted to look at what we can learn from the image through inferences."
The study started with Khosla and his three colleagues picking eight cities from around the world—including Boston, Chicago, Hong Kong, London, Los Angeles, New York, Paris, and San Francisco. Each of these cities was divided into a series of locations roughly 16 meters apart. For each point, four images were taken showing the view from north, south, east, and west.
Next the team obtained the location of establishments of interest using Google Places. Despite having a range of possibilities to choose from, they settled on McDonald’s restaurants as their reference point of choice—largely because McDonald’s were found in all eight of the cities they had chosen.
"We wanted something that would be found everywhere but would also be slightly tough to guess the location of," says Joseph Lim, a fellow PhD student who also worked on the project. "At one point we considered Nike stores, but these often tend to be located in the shopping mall, which is typically in the center of a city. We wanted an added level of complexity."
Aggregated crime data, meanwhile, was gathered from organizations like San Francisco CrimeSpotting. This allowed the construction of crime density maps, which could be used for training. Of the 8 million image samples from Google Street View, half were used for training the algorithm, and the other half for testing it.
Results have proven impressive. Using deep learning tools, the team was able to create an algorithm that recognized what it was looking at, and could use this to draw conclusions. While humans proved better at navigating to their nearest McDonald's in the fewest possible steps, the algorithm consistently outperformed people when being shown two photos and answering which scene takes you closer to a Big Mac.
A demo of the human vs. algorithm experiment can be seen here.
"The opacity of the algorithm means that it’s hard for me to know exactly what the high-level descriptors are which suggest a McDonald’s is nearby," Khosla says. "An example might be the ratio or number of taxicabs, though, which suggest that you are in a highly populated commercial part of a city—or if the algorithm detects an ocean in the image, which means we are likely on the outskirts of the city."
However, Khosla admits that the project was more about kick-starting research than creating an optimized algorithm. "It’s a complex task for machine learning because of the abstraction involved," he says. "What we’re trying to do is show that studying images should be about more than just analyzing what is visible. If the goal of artificial intelligence is to build machines that can mimic human intelligence, this level of abstraction is the obvious next step."
The team also has some ideas about where research could go next—and how this could be scaled into a real-world project.
"I can see one useful application being to town planners," says Lim. "It may be, for example, that instead of approximating how close you are to a McDonald’s restaurant it would be possible to pinpoint the public services people would expect to find in a location—but which may not exist. You could work out where a school or hospital would be most beneficial, and then build it there."
Another possible application relates to context-aware maps, which could be of interest to companies like Google and Apple. Rather than simply confirming the physical artifacts around you, this map could fill users in on the high-level subtleties about the place they’re traveling through. "If you’re driving around, it may be useful to be made aware of the likelihood of crime in a particular area," Khosla says. "Since all of the Google Street View data is available, it would be possible to make those maps crime-aware. If you’re plotting a route you might want to avoid areas above a certain crime threshold."
Because the algorithm is able to take findings from one place and apply it to another, it would even be possible to draw conclusions about parts of the world that do not routinely publish crime or other statistics.
Ultimately, what is needed to move this research forward is more data. "I think we could apply this to everything from property prices to the political inclination of an area," Khosla says. "What is needed is more data. Because of the current gaps in our data, some of these things are much harder to verify than others. The more data you could build into our model, the more accurate it would be."