The San Francisco-based startup Voicery is only a few months old, but CEO and cofounder Bobby Ullman says he’s already had hundreds of requests from companies that are interested in developing their own branded voices. That’s because Voicery offers something most companies probably didn’t know they needed even just five years ago: a customized digital voice that sounds like an actual human, not a computer.
Ullman is a computer scientist who formerly worked at Palantir, and his cofounder, CTO Andrew Gibiansky, has experience in machine learning and worked on speech recognition at the Chinese company Baidu. The duo, who are childhood friends, applied to Y Combinator with a similar idea and honed it into Voicery in the Silicon Valley accelerator program.
Unlike the canned voices you’re likely to hear on customer service calls today, Voicery’s AI-synthesized voices sound human enough to convey carefully designed emotions that can act as an extension of a company’s brand. As more of our interactions with companies shift away from the visual and toward the verbal–whether thanks to Echo and Google Home or automated customer service systems–the tone, quality, and cadence of a company’s voice is becoming the new face of the brand.
Speech’s Uncanny Valley
Voice can be a powerful branding device–think of the familiar Jack in the Box voice or the rumbling voice of Allstate’s Dennis Haysbert. Yet you’ve probably cringed at how awkward Alexa sounds when she tells a joke. That’s because it’s incredibly hard for synthetic voices–which mimic human speech–to convey believable emotion with their halting, robotic cadence. Most of these computerized voices use an older method of speech synthesis called the concatenative model, which entails a voice actor recording up to 200 hours of speech, all that speech getting digitally chopped up into small bits of sound, and finally reconstituting it into whatever you need it to say.
Voicery’s model works differently. It only needs a few hours of a voice actor’s speech, on which it trains a deep neural network to imitate that person’s voice. The entire process, from casting an actor, to having them read sets of phrases, to actually training the computer, takes about two weeks. Creating a single synthetic voice’s neural net model takes four days. At the moment, Voicery has three production-ready synthesized voices, drawn from voice actors or from audiobooks that are all in the public domain.
The technology, as it stands right now, is remarkable. On Voicery’s website, you can take a quiz that asks you determine which voices are human and which are machine. There are very slight giveaways; my results showed that I couldn’t tell the difference between Voicery’s AI and a human speaking a third of the time.
Warmth, Emotion, And Charisma
For companies looking to ensure that their brand is consistent across every interface, the impact of such tech could be great. What if that halting, monotone voice on the other end of the line responded like a person when you called your insurance company about a claim? Better yet, what if it responded with the voice of Allstate’s Dennis Haysbert?
“Toyota is going to have a voice,” Ullman says. “Self-driving cars are going to have a voice, a feeling, and a personality. That’s very important to your interaction with the car. All these things are going to start becoming iconic.”
The believability–and charisma–of a voice is more important than you might expect for companies that want to build relationships with their users. If a health-tracking startup’s Alexa skill sounds more like your friend yelling encouragement than a robotic “you can do it,” perhaps you’ll feel a stronger affinity toward that brand.
Meanwhile, computerized voices of the moment leave very little room for expressing personality or diversity. Even the Google Assistants and Siris of the world share the same neutral female voice that lacks much emotional cadence, making them virtually indistinguishable. There are efforts to imbue chatbots and voice interfaces with personalities, but the synthetic quality of their tone tends to squash any room for rapport with users.
Because its synthetic voices are so hard to distinguish from the real thing, Voicery’s tech could improve other forms of media beyond advertising–like automating audiobooks, making more media audio-accessible, and even making voice dubbing in films much easier. “The problem with text-to-speech with media use is you can’t listen to it for very long because it’s repetitive and boring,” Ullman says. “With this new technology, it sounds much more realistic and it’s much more enjoyable. It’s creating a new market. It could change the way people consume media.”
Who Gets Synthesized?
Like other AI that can generate fake videos, there are ethical questions about what kinds of voices the startup should synthesize. Researchers have already created videos of Obama giving fake speeches using video and audio clips from his eight years in the Oval Office. Ullman intends to draw a hard line on what kinds of voices Voicery will or will not create. “As these tools get better, you have to care about the ethics, and it’s important that people maintain ownership of themselves and their voice,” Ullman says.
So far, the company has only worked with voice actors, who are told exactly what their voices will be used for, or with audiobooks in the public domain. But it won’t pull people’s voices from media or movies (like the Obama researchers did), partly because the quality usually isn’t good enough and partly because the startup views it as unethical. On Voicery’s website, it states that it will never imitate someone’s voice without their consent.
For the time being, Voicery’s next step will be scaling to meet the kind of demand it’s seeing. Ultimately, Ullman hopes that Voicery can build a library of hundreds to thousands of off-the-shelf voices in a variety of languages–making it something of a platform for anyone who needs to license a synthetic voice. Along with this library, they’ll also work with companies to create voices that are exclusive to them; this B2B service will be their primary business model.
Like people working on giving conversational agents humorous vibes and building personalities for chatbots, Voicery’s tech illustrates how speech is embedding itself in branding in an entirely new way. Because after all, as computers migrate from screens into the spaces we inhabit, we’ll want a way to interact with them that doesn’t feel–or sound–robotic.