The podcast production startup has launched a private beta test for a feature called “Overdub,” which can use audio samples of a person’s voice to generate new words or phrases. Descript is looking for podcasters, YouTubers, audiobook creators, and other audio pros to help test the new feature, which is supposed to help save time and money on rerecording.
“The idea here is really to save people a trip back to the recording booth, which is such a pain if you’re doing any kind of recording,” says Andrew Mason, Descript’s CEO. “This just really opens it up for people to be able to make editorial corrections on the fly that generally sound really good and usable.”
Typing in audio
Mason, who cofounded Groupon more than a decade ago, created Descript in 2017 as a spinoff from his previous startup, an audio tour app called Detour. In the process of creating audio tours, Detour built its own tools that would let editors modify audio by editing a speech-to-text transcript. Delete a stray word or jumbled sentence from the transcript, for instance, and it will vanish from the audio recording as well. This turned out to be pretty useful for podcast editing, which is now the main application for Descript’s Windows and Mac software.
Overdub is supposed to address the biggest missing piece in Descript’s “word processor for audio” concept, letting users generate new words in addition to just deleting or shuffling existing ones. In a demo, Mason showed me how he could type into a voice actress’s existing transcript to synthesize new audio that matched her voice. When limited to a single word or a short phrase, it sounded just like the real thing.
“It will not only generate speech, but it’ll do it in a way where it’s trying to do a tonal connect-the-dots between the audio that came before and after,” Mason says.
Behind the Overdub feature is another startup called Lyrebird, which Descript is now acquiring for an undisclosed amount and billing as its AI research team. Until now, Lyrebird was letting people clone their own voice with a tool on its website. The process involved recording a series of random sentences so that Lyrebird could train its AI model, and it only took a few minutes. That tool will be shutting down as Lyrebird folds its audio synthesis features into Descript.
You can imagine an array of ways such a technology might be used for nefarious purposes. But Mason says Lyrebird’s setup process inherently prevents bad actors from faking someone else’s voice. Because it requires the user to utter random sentences, and those utterances must match up with the transcript for Lyrebird to process them, whoever’s being sampled would almost certainly have to know that they’re participating.
“It’s a really simple little thing, but if you think about it, there’s really nothing you can do to get around it,” he says.
Work in progress
While it makes for an impressive demo, Descript’s speech generation still has its limitations.
For one thing, Descript used hours of audio to train the AI model for its demonstration, with special permission from the voice actress. Mason says Descript is still figuring out how much audio it’ll need for Overdub but acknowledges that it’ll be more than the small handful of minutes that Lyrebird was requiring on its demo site.
That explains why Descript is starting with a private beta for audio professionals: If a good speech model requires a marathon session of uttering random voice samples, it’ll only make sense for people who routinely spend hours in a recording studio.
“The type of customers that we’re targeting are people who have their own podcasts or are doing a lot of voice audio work and reaching the audio threshold is not really a concern for them,” Mason says.
Also, even with hours of sample audio, Descript’s speech synthesis becomes more noticeable when it has to string more than a few words together. In the demo I heard, for instance, the clone audio stuttered in the middle of the word “doll” when it was part of a longer synthesized phrase. For now, the tech won’t be useful for generating full sentences, let alone entire podcasts.
“We expect that to change over time, but the use case we’re focused on right now is these smaller editorial corrections that are very common,” Mason says.
Descript isn’t saying for how long it’ll keep Overdub private or how broadly it’ll run its beta test. But in the short-term, it might serve another purpose in drawing attention to the software as a whole. The private beta for Overdub is part of a larger Descript update for all users, adding multitrack editing and the ability to create and edit group recording sessions over the internet. It’s technically version 3.0, but Mason thinks of it more as Descript’s first major release.
“It’s the first time that you will be able to create a podcast soup to nuts in Descript,” he says.
To further build on the app, Descript has raised $15 million from Andreessen Horowitz and Redpoint, and it’s working on new editing features such as postproduction effects and one-click publishing to podcast platforms.
Such additions might not be as technically impressive as Overdub, but they’re as essential to podcast production as a spellchecker is to word processing. Compared to cloning your own voice with AI, they might be a little less unsettling as well.