Voice user interfaces are everywhere today. Nearly every major tech company has poured resources into developing them. Most smartphones are equipped with one. And increasingly, VUIs are fixtures in people’s homes—one industry estimate puts the total number of “voice-first” devices at 33 million.
But it’s still early days for this nascent technology, and despite tech companies’ investments in their development, the role that Siri, Alexa, Google Assistant, Cortana, and now Samsung’s Bixby will ultimately play in most peoples’ lives still remains to be seen. So what do we know about the way people are using VUIs right now? While it will take years for there to be enough data for researchers to understand the way that people interact with voice-first devices, several recent studies and surveys provide a glimpse into how people use them–and handle all their limitations–today.
What They Can Do? No One Knows
Today, one of the biggest challenges with VUIs is also the simplest: It’s hard to understand what they can and can’t do.
In a study from Microsoft Research U.K., presented at the Conference for Human-Computer Interaction last year, two researchers conducted in-depth interviews with a group of 14 users who regularly use voice assistants in their daily lives in order to assess the gap between user expectation and experience. They found that the majority of user complaints had to do with the fact that people didn’t know what the assistant could actually do–unlike a graphical user interface where a person can see and explore their options.
The researchers concluded that given how new the technology is, most people don’t have a mental model of how VUIs are supposed to work. There’s also a total lack of transparency about how it does what it does and if it learns from the user over time, leaving people without any way to gauge their expectations.
“I felt let down that I didn’t get any feedback from it . . . It has a captive audience it could have just told me,” one participant told the researchers. “Just a few examples of what could be done. The things it can do are so broad that I just feel lost.”
Future Of Shopping? Not So Fast
A significant part of the hype around voice-first devices is that they herald the future of shopping–simply ask Alexa to order batteries or paper towels for you, and it does. But in a recent survey published by the design agency Huge, only 14.3% of the 500 respondents used their smart home device to shop.
Why so few? The top two reasons were the lack of a screen and that the person didn’t know how to use the device to shop–8.33% of respondents didn’t even realize that they could shop.
Of course, Amazon recently released its Echo Show, which does feature a screen. But while the device is still Alexa-enabled, including a screen takes it out of the realm of pure voice UIs and lends further support to the idea that voice-only digital assistants won’t be a significant part of the future of retail. That makes sense, too, since even on Amazon, comparison shopping is still alive and well. How could you trust a computer to know what you care about most when it comes to household goods, clothes, or anything else?
People Use Voice UIs For The Same Things As Their Phones
So what do people do with their voice-first devices? According to a survey on Amazon Echo usage of 1,300 early adopters from the market research firms Creative Solutions and Experian, the most popular use cases are asking the device to play a song, control smart lights, and set a timer. A survey of 1,500 users of IFTTT, meanwhile, says that besides those three, another of the most common things people do with their voice assistants is to get the weather. In another survey of 518 consumers, the market research firms found that people use Siri and Google Voice Search very differently from how people use Alexa. The top three use cases for both of those voice UIs are to search for something on the internet, to get directions somewhere, and to call someone.
Crucially, that means that people use their voice-first devices to do relatively simple tasks they could also do using a graphical interface on their smartphones. That would indicate that voice is an auxiliary technology, not a primary one–perhaps meaning that it doesn’t quite deserve the hype it’s gotten.
Voice UIs Require A Lot Of Work From The User
While some people have embraced voice as a means of communicating with their technology wholeheartedly, others are less engaged. One reason? It takes a lot of work to figure out what to say and how to say it so that the device actually does what you want it to.
“All users engaged in some level of ‘work’ to ensure successful interaction,” write Ewa Luger and Abigail Sellen in the Microsoft study. More intriguingly, the researchers found that the amount of work and time put into understanding the VUI appeared to directly influence the level of user satisfaction. This “work” consisted mostly of altering natural speech by simplifying language or sentence structure, speaking slower, and even changing an accent to communicate more clearly with the device. It’s almost like pushing a button, but you have to figure out what the button is first.
The amount of mental work it takes to achieve this isn’t insignificant. A pair of studies from the University of Utah in 2014 that focused on voice-controlled systems in cars found that voice UIs can distract people while they’re driving for up to 27 seconds. And since using VUIs while multitasking is one of the most-cited use cases for them, the mental work they require can have outsized effects.
The Human Factor
Since it’s such a new field, designers are trying to establish best practices for how to design voice interfaces. But their ideas aren’t always based on empirical data, so it’s unclear if anthropomorphic VUIs–like Alexa and Siri–will always be the rule. Some research suggests that people sometimes have an aversion to humanness.
“Another important design point is that socialness or humanness is not always preferred,” writes the Stanford professor Clifford Nass. In a study on the pronouns and voice, active or passive, that people use to refer to voice-enabled devices, Nass found that consistency was the most important factor in how the user perceived the voice UI system, how relaxed they were, and their behavior. “Designers should pair human-like voices with human-like, personal scripts (after all, most humans use “I” and “me” quite often), while they should combine machine-like voices (i.e., synthesized speech) with less human-like, passive scripts,” he writes.
Humanization of machines isn’t always necessary–and another study from Cornell shows that a majority of users don’t even personify devices like the Echo. The study analyzed more than 500 reviews of the Echo on Amazon’s website and found that 73.4% of reviewers only referred to the Echo as an object, while 15.1% referred to it more like a person (and 11.5% used both personal and objective pronouns). Overall, 51% of reviewers didn’t personify the Echo at all, while only 19.5% of reviewers personified it entirely.
These studies reveal that some of the common wisdom about what people want from voice UIs isn’t necessarily true. Users aren’t really sure how to use them yet, so they default to simple tasks like playing music and setting timers. Shopping with them is still a long way off. The full potential of voice UIs is still to be realized. For now, they’ll remain little more than simplistic, voice-controlled smartphones.