It’s hard to put your finger on why, but the voices our computers use to speak just sound wrong. Even with the best voice programming, like Amazon’s Alexa or Apple’s Siri, computers sound–well–robotic when they talk. But that could change soon. Neural networks are now tackling the problem of making computer speech sound more natural, filling sentences with nonverbal sounds like lip smacks, breath intakes, and irregular pauses.
DeepMind, an Alphabet-owned world leader in artificial intelligence research, recently published a blog post about WaveNet, a convolutional neural network (like DeepDream) that can reduce the performance gap between computer and human speech by about 50%, researchers say. In other words, in blind opinion tests on how good WaveNet’s speech sounded compared to humans, it did significantly better than other text-to-speech methods. But how?
As the DeepMind team explains in their post, most computer voices, like Siri’s, are made up of huge databases of recorded speech fragments, recorded from a single speaker and then recombined by a computer to form sentences. This gives decent results, but has drawbacks. The initial databases are expensive and time-consuming to construct, and can’t be modified without recording a new database from scratch. That’s why, incidentally, you only hear so many computer voices out there. Apple can’t just program Siri to, say, speak with a sexy James Bond accent. The company would have to record hundreds of hours of someone who spoke with that accent first. This approach also contributes to computer speech’s uncanny valley problem, creating mostly accurate computer voices that still feel somehow *wrong*, and therefore repulsive, to our human ears. Unlike a human, no matter how many times a computer says a word, it will always say it with the exact same pronunciation and cadence.
Here’s where WaveNet comes in. By feeding Google’s own voice database–the ones used in OK Google–into a neural network, DeepMind was able to train WaveNet to actually recreate the sounds it needs to make up a sentence, with millions of incredibly slight variations. If that confuses you, think of it this way: Whereas most computers speak by piecing together blocks of prerecorded sounds, DeepMind essentially remembers those sounds and says them out loud when it needs to use them. It’s a key difference that makes WaveNet’s text-to-speech samples sound more natural than even industry leaders, like Google’s. Compared to the usual method, WaveNet’s computers talk in a more flowing, regular cadence. Take this sample sentence: “The Blue Lagoon is a 1980 American romance film directed by Randall Kleiser.” Compared to WaveNet, Google’s default attempts to say this sentence make each individual syllable of this sentence sound as if there’s an air gap between them. WaveNet, on the other hand, glides from phoneme to phoneme, like a human. Seriously, check it out for yourself.
Movies like Spike Jonze’s Her or even 2001: A Space Odyssey present a future in which talking to a computer is as natural as talking to a human, but the truth is, despite our natural instinct to anthropomorphize our computers, most of us find it frustrating to interact with our UIs through speech. It’s early days yet, but it’s easy to see how approaches like WaveNet could make virtual assistants sound more natural–starting with Google’s own. Given DeepMind and Google are both owned by Alphabet, don’t be surprised if these improvements start rolling out to Google’s text-to-speech functionality sooner rather than later.