The San Francisco-based startup Voicery is only a few months old, but CEO and cofounder Bobby Ullman says he’s already had hundreds of requests from companies that are interested in developing their own branded voices. That’s because Voicery offers something most companies probably didn’t know they needed even just five years ago: a customized digital voice that sounds like an actual human, not a computer.

Ullman is a computer scientist who formerly worked at Palantir, and his cofounder, CTO Andrew Gibiansky, has experience in machine learning and worked on speech recognition at the Chinese company Baidu. The duo, who are childhood friends, applied to Y Combinator with a similar idea and honed it into Voicery in the Silicon Valley accelerator program.

Unlike the canned voices you’re likely to hear on customer service calls today, Voicery’s AI-synthesized voices sound human enough to convey carefully designed emotions that can act as an extension of a company’s brand. As more of our interactions with companies shift away from the visual and toward the verbal–whether thanks to Echo and Google Home or automated customer service systems–the tone, quality, and cadence of a company’s voice is becoming the new face of the brand.

Speech’s Uncanny Valley

Voice can be a powerful branding device–think of the familiar Jack in the Box voice or the rumbling voice of Allstate’s Dennis Haysbert. Yet you’ve probably cringed at how awkward Alexa sounds when she tells a joke. That’s because it’s incredibly hard for synthetic voices–which mimic human speech–to convey believable emotion with their halting, robotic cadence. Most of these computerized voices use an older method of speech synthesis called the concatenative model, which entails a voice actor recording up to 200 hours of speech, all that speech getting digitally chopped up into small bits of sound, and finally reconstituting it into whatever you need it to say.

Voicery’s model works differently. It only needs a few hours of a voice actor’s speech, on which it trains a deep neural network to imitate that person’s voice. The entire process, from casting an actor, to having them read sets of phrases, to actually training the computer, takes about two weeks. Creating a single synthetic voice’s neural net model takes four days. At the moment, Voicery has three production-ready synthesized voices, drawn from voice actors or from audiobooks that are all in the public domain.

The technology, as it stands right now, is remarkable. On Voicery’s website, you can take a quiz that asks you determine which voices are human and which are machine. There are very slight giveaways; my results showed that I couldn’t tell the difference between Voicery’s AI and a human speaking a third of the time.

Warmth, Emotion, And Charisma

For companies looking to ensure that their brand is consistent across every interface, the impact of such tech could be great. What if that halting, monotone voice on the other end of the line responded like a person when you called your insurance company about a claim? Better yet, what if it responded with the voice of Allstate’s Dennis Haysbert?