Imagine a tourist and a guide exchanging texts as the visitor tries to find the Manhattan coffee shop where she’s meeting a friend. Text by text, she mentions businesses or landmarks she spots, and after each, the guide instructs her to turn right, left, or go straight. Finally, she arrives.
In the age of Google Maps on our smartphones, this seems unnecessary, but if you’re trying to build an artificial intelligence system that uses natural language communication to help us solve everyday problems, this is precisely what you would build to help advance the state of the art.
That’s why Facebook’s artificial intelligence scientists built Talk the Walk, a research project that aims to teach AI systems to communicate using natural language in much the same way a baby does—by naming what it sees.
Talk the Walk tasks two AI agents—a tourist, and a guide–with navigating to a location in Manhattan by conversing, with the tourist bot explaining what it “sees” and the guide bot responding with navigational instructions. But the system is also able to parse what the tourist is saying, even if it mixes in colloquial language a human might use. That’s because, Facebook writes in a blog post on the project, “A series of carefully scripted responses isn’t likely to capture the nuanced inaccuracies and muddled messaging inherent to genuine, person-to-person conversations.”
The idea is that this could be a more efficient way to teach AI systems like this to communicate effectively, rather than by training them on pure-text data sets, Facebook believes. And in its experiments, the company’s AI research team found that its bot guide was more accurate than humans performing the same navigation task.
Facebook, like Google, Microsoft, Apple, and other big tech companies, has a strong interest in developing AI systems that communicate well so that users can use their voices or type in casual language to the bots. But even as it’s working on systems that can understand natural language, Facebook has also built bots that have created their own language.
On the town, with a “novel attention mechanism”
The Talk the Walk project was one of the first to work with 360-degree visual information, says Douwe Kiela, a scientist on the Facebook AI Research team in New York. The tourist bot utilizes 360-degree imagery taken by the researchers in five Manhattan neighborhoods, which represents the real world of the street an actual tourist would see, while the guide bot uses a standard 2D map with generic waypoints—”bank,” “coffee shop,” “deli”—to deliver navigation instructions.
Facebook built what it calls a “novel attention mechanism” known as Masked Attention for Spatial Convolution, or MASC, that is designed to help the guide bot zero in on the place on the map that the tourist bot is describing. In the experiments, MASC was twice as accurate as a non-natural language communication approach.
“This task is very important for AI research because it’s very hard,” Kiela says, “and because it combines all these interesting problems—three-hundred-sixty visual perception, map-based navigation, visual reasoning, and natural language communications via dialogue.”
Kiela also says Facebook’s project is the first to ever combine all these challenges in a single task. As to why no one’s tried it before, he says that machine learning has only recently progressed to the point where it is capable of effectively solving this type of natural language problem. As is the case with much of its AI research, Facebook plans on open-sourcing the dataset underlying the project in the hopes that the artificial intelligence community can further advance the work.
Part of the work, Kiela explains, involved building new algorithms that relate what the tourist bot says to what the guide bot sees on its map. The idea was to teach the guide to focus its attention on the map based on whatever the tourist says, and ultimately, zero in on the area of the map in which it thinks the tourist is located.
For now, the work is only a research project, but over time, it seems certain that Facebook wants to integrate this kind of functionality into its virtual assistants—and expects other large tech companies to do the same.
Of course, the online and mobile maps wars are intense, and heating up, with Apple and Google going head-to-head for users’ navigational eyeballs, buying up new tech, and dominating users’ phones.
Now that it’s ready to show the world its work, Facebook plans on continuing to work on the task and its data set in an effort to further combine all the different, complicated components that go into building a human-comprehending robot. “This line of research,” Kiela argues, “is the future of AI research. It’s essential to understand [these components] if we ever want to get to the meaning of language.”
Adds Kiela, “If you want AI to be effective, it needs to better understand what humans need, and it better have some understanding of the world that humans perceive.”