In Hollywood, big money is getting lost in translation.
Sure, the global entertainment business is synced up like never before. Marvel blockbusters captivate audiences in China. Korean directors score one coup after another in the U.S. Streaming development executives now scour foreign markets to bring home the next Squid Game, Lupin, and Money Heist. And Western entertainment companies are pouring money into so-called localization efforts to ensure the sun never sets on Spiderman. Disney upped its localization spending to $33 billion in 2022, according to Variety, a 32% increase. Streamers now include options for subtitles and audio in multiple languages, even in old and niche entertainment.
But even as companies invest in quality script translations and better performances by voice actors, dubbed entertainment often still looks as cheesy as old kung fu films and Mr. Ed, turning audiences off. No matter how good the sound is, it seems wrong. Lips don’t lie.
“The lips are always, always the last piece that nobody’s solved for,” says Jonathan Bronfman, cofounder and CEO of the visual effects company, Monsters Aliens Robots Zombies (MARZ).
Earlier this year, Bronfman’s company unveiled a technology called LipDub AI, which digitally manipulates actors’ facial expressions to match spoken words in foreign languages. The technology promises to achieve an extraordinary level of realism and fluency, learning to make actors’ lips match the language and the performers. Marlon Brando will mumble in Mandarin; Jim Carrey will gesticulate in German, and Arnold Schwarzenegger’s English . . . well. AI is making more progress every day.
In the beginning, lip-dubbing technology was a crude joke—Schwarzenegger screaming at a late-night TV host through the superimposed lips of another man (“I AM HEE-AH TO SAVE CALIFORNIA!”). But the promise of new AI-driven software means that global audiences may be laughing with such technology, not at it—as well as crying, cheering, and loving performances where actors deftly deliver lines any of hundreds of languages, whether or not the performers themselves have ever uttered a word in those tongues themselves.
LipDub’s technology is an evolution of an open-source AI model known as Wav2Lip, first released in 2020 by researchers at Hyderabad’s International Institute of Information Technology. Designed initially to synchronize lip movements in videos with specific audio tracks, it analyzes the input audio’s phonetic elements to identify different speech sounds. In parallel, it processes the video, focusing on the speaker’s face, especially the lip area. Wav2Lip uses deep learning models to understand the facial structure and predict corresponding lip movements. The technology combines audio analysis with video data to generate accurate lip synchronization. This results in a video where the lip movements match the spoken words in the audio track, enhancing realism for applications like movie dubbing, video conferencing, or animated characters.