Fast company logo
|
advertisement

Are we ready for talking, emoting AI avatars?

[Image: D-ID]

BY Jesus Diaz3 minute read

When the Israeli company D-ID launched in 2017, it wanted to give the world tools to stop governments and corporations from recognizing our faces. Today, it wants to create synthetic humans so perfect that everyone would recognize them as real human beings. To do that, they are merging AI-based image, animation, text, and speech generation technologies. Think Lensa—but moving and talking.

The current version of this technology (released through D-ID’s Creative Reality Studio) is still quite far from reaching that ambitious goal, but it’s a preview of a world in which synthetic humans are everywhere, from social media to corporations’ customer service.

Merging all the pieces

Like Lensa, D-ID’s new web app uses Stable Diffusion to generate synthetic humans—only instead of resulting in static avatars, D-ID creates animated upper torsos and heads generated via prompts like, “blonde woman with elvish ears and green skin.”

[Image: D-ID]

After you pick your avatar from whatever Stable Diffusion spits out, D-ID’s interface has a field in which you can write a script or ask OpenAI’s GPT3 to write one for you. You can enter “five reasons why you should never put chorizo in paella,” for instance, and the AI will generate a script right then and there. D-ID’s app lets you pick the language, the voice, and intonation your avatar should use. After that, it’s just a matter of clicking create. This is where the system uses text-to-speech AI by Amazon and D-ID’s proprietary animation algorithm to produce the final output: a Harry Potter-esque animated portrait that can say whatever you want.

[Image: D-ID]

If this looks familiar, it may be because you have already seen an earlier version of that animation algorithm in action. It was used for Deep Nostalgia, an app that went viral last year allowing you to animate your dead relatives or anyone else with just one photo. 

A preview of what’s coming

The avatars generated by D-ID are designed to be artistic renditions, not perfect replicas of your selfhood. And, like Deep Nostalgia, the animation generated from one single photo is only good—albeit spooky—when the avatar is not talking. The speaking part doesn’t pass the uncanny valley test—the sensation that something synthetic is not actually real.

To get a more realistic avatar, D-ID offers a higher end option called Premium Presenter that doesn’t use Stable Diffusion or photos to generate an animation. Perry says corporations are using this option to generate marketing materials, training videos, and other commercial uses.

Perry showed me a practical example of a company (he couldn’t disclose the name because they are under NDA) that plans to send millions of messages during the holidays, each of them with an AI-generated CEO addressing customers by name following a personalized text script. The animation technology is basically the same, but this one used a custom biometric model of the CEO’s face, which required the capture of several minutes of video. It also used a cloned voice instead of the standard text-to-speech model. The result is much more realistic, perhaps good enough to pass the uncanny valley test for some people, especially on a small phone screen. However, you could still see that we are just a year or two away from truly getting to that point. 

The other part that D-ID hasn’t achieved yet is to make the avatars speak in real time. This will be an essential next step in creating high-fidelity synthetic assistants, customer reps, salespeople, friends, and lovers. There are some roadblocks for this to happen. The animation, according to Perry, is not one of them. “Right now we are capable of animating in half the time of the video, so that’s faster than real time,” he explains. The barriers seem to be around text generation and text to speech, and how all the pieces interact together. To create truly believable synthetic beings, all these components will have to work a lot faster and be tightly integrated.

When will that happen? Perry is optimistic that it will be soon. “In five months, it’ll be real time,” he tells me. And, after that, it’s only a matter of iterating these technologies to make the final output pass the uncanny valley test. He thinks that, “for now, it looks good enough [talking about the CEO’s personalized videos] but, by next year it will look better. “In two years, it will be perfect,” he says. 

And when it does, he thinks it will totally change the way we interact with computers and with the world. The only question after that—which is one that we should ask ourselves before that moment arrives—is how we will react to these beings as a society. Perry thinks that this waiting time is good: “The world will need time to get used to this technology.”

Recognize your brand’s excellence by applying to this year’s Brands That Matter Awards before the final deadline, June 7.

Sign up for Brands That Matter notifications here.

CoDesign Newsletter logo
The latest innovations in design brought to you every weekday.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Privacy Policy

ABOUT THE AUTHOR

Jesus Diaz is a screenwriter and producer whose latest work includes the mini-documentary series Control Z: The Future to Undo, the futurist daily Novaceno, and the book The Secrets of Lego House. More


Explore Topics