Whether you use OK Google, Siri, Alexa, or Cortana, you reach a point, sooner or later, when your voice assistant doesn’t know what to do.
Sometimes, that’s because you’ve phrased your command in an unexpected way: “Alexa, I could really go for some Van Halen right around now,” instead of “Alexa, play Van Halen,” for example. Other times, it’s because you’ve asked the computer to do something it just doesn’t have the capability to do. Either way, the user ends up going away thwarted, and the computer doesn’t learn anything. For both user and computer, it’s just a fail state, not a learning experience.
As design lead for all of Google’s search products, Hector Ouilhet looks to his three-year-old daughter, Anna Julia, for inspiration on how Google can help solve this kind of problem.
Hector and Anna Julia don’t always understand each other—Anna Julia is at an age when she likes to make up new words, and even a gesture as simple as pointing at the floor can have four or five different contextual meanings. Likewise, Anna Julia doesn’t always understand what Hector expects of her. But Hector and Anna Julia figure out how to adapt to each other’s needs.
The question that fills Hector’s days is simple: How can Google be more like Anna Julia? Not just in the way Google responds to a user’s request but in training users to view Google as an intelligent, learning, evolving entity. Just like a three-year-old.
What makes Google different from Anna Julia? There are a million answers to that question, but they all boil down to the fact that Anna Julia is a little girl and Google is . . . what exactly?
“When you’re talking to a 3-year-old, or even a 90-year-old, you have a set of expectations about what he or she can do,” Ouilhet says. In other words, we have a mental model of a people’s capabilities that is always evolving, according to what capabilities they (or people like them) have already demonstrated.
But what is Google? It’s everything and nothing. It’s a company, an operating system, and a bundle of algorithms in a bleeping black box, all at once. It does a million things yet comes with no instructions. How can users build a mental model of something as big and as powerful and as random as Google so that they know how to interact with it?
Ouilhet is convinced that the answer is for Google to act more like a person. That’s why, during his tenure in Mountain View, Ouilhet has been working hard on pushing Google’s zero UI initiatives, such as the OK Google voice assistant. It’s not an issue of trying to anthropomorphize Google, or—god forbid—turn it into an AI. “It’s about helping people wrap their head around, what is this thing I’m talking to?” says Ouilhet.
Google isn’t alone in chasing voice as a solution to the problem of how we communicate with computers. Apple has Siri, Microsoft has Cortana, and Amazon has Alexa. But while voice assistants are theoretically a more natural way of interacting with computers than taps or mouse clicks, Ouilhet says they also introduce a lot of friction and cognitive dissonance in users’ minds. Voice assistants sound like people, but they don’t act like people.
Let’s say you have an Amazon Echo. You can tell your Echo to do a lot of things—play jazz, set a timer, dim your smart lights, add an item to your Amazon wish list, or even find a word that rhymes with potato. But ask Alexa to, say, turn off your oven or call your mother, and it’ll just fall over. It’ll do the same thing if it doesn’t quite hear you, or you don’t voice your command exactly the way it expects, because computers— unlike people—suck at coping with ambiguity.
This isn’t just what happens with the Amazon Echo. It’s a problem with all voice assistants, including Google’s. So after some initial experimentation, people using voice assistants tend to stick to a few things they know work and never try anything else. But all of these products, at the end of the day, are capable of evolving. They’re all driven by data and algoithms. The more data a company such as Google has on how someone is trying to use a voice assistant, the more quickly Google can adapt to meet a user’s expectations.
Ouilhet looks at his relationship with Anna Julia for inspiration on how to get over the roadblock of perception for voice UIs.
Like Google’s own voice assistants, Anna Julia doesn’t always understand her father’s requests, or know how to go about completing them. Sometimes, she’s daydreaming when he wants her to do something, so she misses what he says. But Anna Julia responds very differently when she can’t do something her dad wants compared with, say, an Amazon Echo.
Let’s say Ouilhet is cooking dinner, and he asks Anna Julia to set the table. She’s only three, so she doesn’t know how to do it yet, but Anna Julia would never just shake her said and say so. She’s too eager to please. Plus, it’s not like she doesn’t understand everything Ouilhet asked her to do. She knows what a table is, she just doesn’t know what “set” means in this context. So she might reply: “Daddy, do you mean you want me to sit down for dinner?”
There are three things going on here. One, Anna Julia is capable of atomizing her father’s request to set the table into discrete parts, including the parts she understands (“table”) and doesn’t understand (“set”). Two, she understands context: Dinner is being prepared, and Ouilhet usually asks her to sit down at the table around this time. Three: Anna Julia isn’t sure whether she misheard the word “set” or just doesn’t understand it, so she says so. But at the same time, she’s eager to please, so she makes her best guess as to what Ouilhet wants anyway.
The end result is that even if Anna Julia turns out to not be able to do what her dad wants, she still adds information to her mental model of her father. This makes it more likely she’ll be able to do what he wants her to do in the future if he asks her to set a table. And Ouilhet himself also gets a better understanding of Anna Julia in a way that encourages him to keep trying—she might not understand setting the table today, but tomorrow, she might. Because this isn’t just about a single request. It’s about building an ongoing relationship.
Along with improved voice transcription technology, Google is getting better every day at understanding its users’ requests. Knowledge Graph is helping Google do more than just match search queries with search results, but actually understand semantics: what things are and what they can do. Google Now and Now on Tap, meanwhile, are all about understanding context—building up Google’s mental model of individual users, so that it can custom-tailor its results. And although this is the sort of thing that gives privacy advocates headaches, Google is getting better every day at remembering what its users have historically done, and using it as a model for what they’ll do in the future.
But the challenges in voice search and zero UI aren’t really technological. Ouilhet argues that they’re mostly about designing voice UXs that feel less like talking to a computer than talking to a person. Or rather, voice UXs that can act like Anna Julia—eager to please, eager to learn, not afraid to make guesses, and which strive to learn about us while encouraging us at the same time to keep on learning about them.
Ouilhet admits Google isn’t there yet, but he thinks in the next few years that they will be. “Humans are great at salvaging meaning from context, by peeling it like an onion,” he says. “Imagine how powerful computers will be when they’re just as good at peeling that onion.”