Conversations with Amazon Alexa and Google Assistant are supposed to be personal. In a promo video for the Google Home connected speaker, for instance, a husband and wife ask the omnipresent AI about their respective agendas for the day, and get individualized answers in return.
But that’s not how Home and Amazon’s Echo speaker work in the real world today. When someone speaks, neither company’s virtual assistant can detect who’s talking. Echo requires an extra step of switching profiles to get personalized information, and offers no verification aside from an optional PIN for making Amazon purchases. Google Home doesn’t support multiple profiles at all.
It’s safe to assume Amazon and Google are interested in identifying users by the sound of their voices. A report last week by Time‘s Lisa Eadicicco even suggested that Amazon has been developing voice identification for Alexa, though the story gave no timeline and didn’t say that the feature would actually launch. According to the companies that make voice recognition hardware and software, getting these connected speakers to understand who’s talking is trickier than it might seem.
When you talk to the Amazon Echo, it doesn’t simply transmit everything it hears up to the cloud. (Thank goodness.) Instead, the device uses local processing power to pick out the “Alexa” wake phrase and any subsequent commands, which in turn head to Amazon’s servers for interpretation.
Local processing can also perform cleanup duties, using algorithms to reduce background noise, echo, and reverb while making the speaker’s voice more prominent. That way, Amazon has an easier time understanding the wake word and whatever else has been said, even from across the room with other people talking and background noise such as TV sound.
Here’s the problem: Eliminating noise, echo, and reverb also tends to distort the voice that a device is trying to hear, so identifying individual speakers becomes harder, says Vineet Ganju, Conexant’s vice president of voice and audio. (Conexant provides speech recognition chips and software to hardware makers, and has partnered with Amazon on a development kit for devices with Alexa built in.)
“On one hand, you have the benefit of being able to isolate voice from a noisy environment, so you’re actually able to do something useful with the voice,” Ganju says. “But on the other hand, you are losing some of the characteristics of the voice signal itself, so that makes it a little bit more difficult to do follow-on processing.”
Todd Mozer, CEO of Sensory, also acknowledges that determining who’s talking can be tricky for far-field devices like Echo. Sensory provides voice recognition solutions to device makers, including the ability to identify different users, but he notes that the performance gets worse as the signal-to-noise ratio increases.
“For speaker ID, the effects of noise and degraded signals from noise processing are pronounced, and the combination of speaker verification, far-field use, and noise processing remains largely unproven in the market,” Mozer says.
The problem of identifying a specific voice isn’t unsolvable, but there are different schools of thought on what the solution should be, along with different sets of associated challenges.
Leonardo Azevedo, NXP’s director of consumer and industrial applications processors, believes device makers could analyze the raw audio separately from the processed version. The raw audio feed would be used to identify the speaker and send that information to the cloud alongside the processed audio. (NXP offers hardware and software to device makers that want to include Amazon’s or Google’s voice assistants.)
“They have the [unaltered] audio input…coming in,” Azevedo says. “If they add the things in their algorithm to be able to [identify the speaker], when they send the command to the cloud after it’s been processed, they could say, ‘Oh, this is Leo,’ or ‘This is Justin,’ and the cloud knows who it is.”
Still, Azevedo acknowledges that this solution isn’t necessarily easy. Running a separate algorithm to identify the speaker risks slowing down the virtual assistant’s responsiveness. To that end, NXP is working with Amazon and Google to speed up the types of calculations that happen locally, potentially allowing for multiple passes that isolate for different attributes.
Analyzing the raw audio in the cloud is also an option, but this would also make responses take longer. Azevedo believes at least some of the speaker identification should happen on the device itself. “The more you can do locally, the better you can do [it] locally, the less time to send it to the cloud,” he says.
Conexant, meanwhile, believes it can solve the problem by improving its own local processing algorithms, and by working with companies like Sensory to account for pre-processing in their speaker identification solutions. Through experimentation, the companies could figure out noise reduction patterns that don’t take away the speaker’s unique characteristics.
“The speaker identification technology is robust to some kinds of changes, and is very sensitive to other kinds of changes,” Ganju says. “So what we do on our side is determine which kind of changes the speaker ID technology’s more robust to, and we’ll focus on those things more aggressively. And we figure out which are the ones they’re more sensitive to, and do those types of things less aggressively, or maybe not at all.”
Even if the basic recognition challenges get resolved, Amazon and Google would still have work to do. Google, for instance, would have to add support for multiple profiles on its back end, and both companies would need the ability to switch between profiles on the fly.
Sensory’s Todd Mozer points out another obstacle: Users would ultimately have to teach their virtual assistants to understand who’s who. This could make setup more complicated in a product that’s supposed to be relatively frictionless.
“Doing speaker ID in a shared product is a bit more complex because you don’t want to train the verification or adapt to the wrong users, and the training process adds some complexity for multiuser products,” Mozer says.
It does seem like only a matter of time before Google Home and Amazon Echo figure out how to identify different speakers, but the companies may have higher priorities such as language support. Google Home, for instance, only supports U.S. English, while Amazon Alexa only supports U.S. English, U.K. English, and German. NXP’s Leonardo Azevedo believes both companies are pushing to get their virtual assistant hardware into more countries, which would help them ramp up sales but could in turn delay their work on speaker ID.
“When we talk to both Google and Amazon, they both want to do that kind of thing,” Azevedo says of speaker identification. “The question is when they roll it out.”