Siri, Alexa, and the Google Assistant all use spliced up human voices to tell you the weather or wake you up in the morning. It’s standard text-to-speech technology, where voice actors record thousands of sentences and then a computer chops up the recordings into pieces that can then be algorithmically rearranged.
But these voices sound like the robots that they are. For IBM’s latest AI project, in which the company created an AI that would be able to hold lengthy debates on a wide variety of topics with a real person, the company needed a voice that would be persuasive and dynamic–one that would sound more human.
But how do you create an AI voice that’s layered with real emotion?
That was the challenge for Andy Aaron, a researcher at IBM who led the search for the perfect voice for Project Debater, as it’s called. Aaron isn’t a computer scientist or an engineer; he’s a sound designer who previously worked on dozens of Hollywood films and television shows before a friend convinced him to try out creating text-to-speech voices for IBM. Aaron was hooked immediately and has worked at IBM for two decades on all kinds of projects, including creating the voice for Watson.
But unlike Watson’s voice, the voice Aaron envisioned for Debater was in a different category altogether. Unlike his typical text-to-speech projects, which require a voice actor to read a few thousand sentences before handing the work off to an algorithm, Debater’s voice needed to be far more complex. To understand which components were necessary, Aaron and his team watched dozens of real, human debates and analyzed a variety of tones people use to make their arguments: an anecdotal voice, rebuttal voice, a voice you’d use when addressing the audience directly, and more. Then, he set out to find the person who had enough control over her voice that she could speak incredibly consistently while also talking in these different cadences.
“This is the hardest narration job anybody will ever have,” Aaron says. “It’s really difficult material to read and it’s endless.”
To find his voice, Aaron met with about 20 people, split evenly between men and women, and had them read an incredibly difficult script that included tongue twisters and foreign names cold without looking at it beforehand. Then, the five most promising actors each recorded 1,000 lines and Aaron created makeshift computer voices from these sentences. Once he had the rudimentary versions of the actors’ computer voices, he programmed each of these voices to say another 10 sentences. Whichever actor-synthesized computer voice sounded the best got the job.
“Apparently the computer chose me,” says Eliza Foss, an actor and voice-over narrator whose dulcet tones won Aaron over and became the basis for Debater’s robotic voice. Just from our brief conversation over the phone, I could see why: Her voice is soothing, smooth, and confident. It’s a voice you could listen to for hours.
Foss doesn’t normally lend her vocal chords to tech companies. Over the course of her career, she’s recorded over 100 audio books, appeared on Law and Order, and even played a voice in Grand Theft Auto V. Her last gig was as an understudy for a play at Lincoln Center in New York City.
But she’s not totally foreign to text-to-speech either: About 15 years ago, she participated in what she characterized as one of the first text-to-speech projects for a company in Edinburgh, Scotland. “I believe my voice is sold all over Europe to hotels and cars and things like that,” she says.
Becoming an AI’s voice was unlike any job she’d done before. Foss and Aaron spent between 30 and 40 hours in the recording studio over the course of a month. She had to be technically perfect–one mistake, and they’d have to repeat the entire paragraph. The hardest part was keeping her voice at exactly the same tone for hours at a time. “I had to be consistent, have my voice in same range, not move, and really speak for long periods of time without making a mistake,” she says. “That’s unusual in recording, to be that consistent and not too emotional.”
It’s ironic, given the nature of her task: to make a voice that portrays emotion more than most text-to-speech robots. But we are still talking about a computer after all. “We don’t want to use the technology to fool people,” Foss says. “We want you to be reminded that you’re not talking to an actual human being.”
Foss wasn’t present for IBM Debater’s unveiling earlier this summer, but she’s watched the videos of the computer and its performance. Luckily, it doesn’t sound too much like her, which she says would have creeped her out. “I thought she sounded great,” Foss says. “I was surprised at how much humanity she had.”
There’s only one point that Foss isn’t happy with: Unlike Siri and Alexa, Debater doesn’t have a name. So Foss gave her one, in homage to her older AI brother: “Watsina.”