Amazon announced a large batch of new products on Wednesday, making it clear once again that it wants to spread its Alexa digital assistant into as many consumer tech categories as possible—not just smart speakers, but everything from earbuds to eyeglasses to rings. But there was another storyline woven into the announcements in Seattle. More artificial intelligence, specifically natural language AI, is finding its way into Alexa and in more ways.
For starters, Amazon says it’s been using neural networks to make Alexa’s voice sound more human when it translates text (like your text messages) into speech. Rohit Prasad, who heads up Alexa machine learning and artificial intelligence, told me that this technology has allowed Amazon to take a totally different approach to generating speech.
In the past, Alexa’s algorithms broke down language into word parts or vocal sounds, then tried to string them together as smoothly as possible. But it always sounded somewhat choppy and robotic. Now, Amazon is using neural networks that can generate whole sentences of text in real time, says Prasad. This creates a vocal sound that’s more fluid and more human-sounding. (Apple’s Siri and Google’s Assistant have also achieved more natural voices recently through similar means.)
It’s this same natural language modeling that will very soon give Alexa a completely different voices. Amazon says it will start with celebrities, with Samual L. Jackson being the first. Amazon will sell Jackson-as-Alexa as an add-on service starting later this year.
Amazon’s Jackson voice is at least partially driven by a natural language model. The model learns from Jackson’s voice—he recorded a bunch of samples in a studio—to generate a voice that mimics his distinctive tone while providing the answers and information the assistant would normally provide. But Amazon also “curated” a set of complete Jackson utterances for the assistant to use when the time is right.
Jackson will likely be just the first of many celebrity voices that Amazon will offer as alternatives to the standard Alexa voice. (Google, meanwhile, let the Google Assistant talk like John Legend early this year, also due to advances in using AI to synthesize voices.)
The talking doorbell
Amazon also added some machine learning tricks to its Ring doorbell cams. In a new service Amazon is calling “Doorbell Concierge,” the devices will soon be able to detect various kinds of people who show up at the front door unannounced. The demo I saw featured three kinds of visitors–a guy delivering a package, a Girl Scout selling cookies, and an unidentified man. The Ring engaged them all in a short dialogue to find out what they wanted, and a neural network in the background used what they said to determine what kind of a caller they were. It did this based only on what they said, not on camera imagery. The categorization then informed the Ring device what to say to each one. For instance, it told the delivery guy where to put the package, after asking if he needed a signature. And it asked the unidentifiable man if he would like to leave his contact information.
The new Concierge feature isn’t quite ready for market yet. When it’s released, it will likely be able to recognize a small set of types of callers. But that set will probably grow.
Alexa is listening
Last year, Amazon expanded Alexa’s hearing to detect more than just human commands. As part of its Guard home security mode, the sensitive microphone array used in Echo speakers began listening for the sounds of glass breaking and smoke alarms going off when nobody was in a home. Now Amazon has added the ability to listen for human-related sounds in the home while Guard is set to its “away” mode. These include the sounds of footsteps, coughing, and doors closing when there’s supposed to be no one home. Alexa can send an alert to a user if it detects one of these sounds.
In all these cases, a deep learning model is taking the audio input from the microphones and flagging potentially dangerous sounds. Amazon could train the assistant to listen for many other types of sounds. For example, Alexa devices could begin listening for the sounds of falls or labored breathing in places where elderly people live. Whether Amazon moves in this direction is anybody’s guess, but the fact that the company is steadily adding things that Alexa can listen for is telling.
A relatively new area in natural language research is using neural networks to detect emotion through words and intonations. Amazon has been focusing on the sound of frustration in the voices of people talking to Alexa. When it detects frustration, Alexa may conclude that it’s given an answer the user didn’t like and then search for another way to answer. Prasad said Amazon has its own set of labeled recordings of people sounding frustrated, which it uses to train the neural networks.
But it’s a hard problem. The assistant has to know how to react after detecting a frustrated person. And if it takes another stab at providing an answer, the assistant better be fairly certain that the second answer is useful. And there are times when the assistant has to say “Sorry, I don’t have the answer.”
“We are starting to experiment with these different ways of responding, and once this is launched, you will see many different flavors,” Prasad said.
This kind of emotional awareness will likely start showing up in many kinds of assistants. Any assistant should be capable of knowing when it’s done something wrong and be able to open up a feedback loop in order to get better.
The frustration detection feature will likely show up in Alexa next year.