From algorithms that can automatically tag you in photos, to face recognition systems embedded in city surveillance systems to voice generators that can put words in people’s mouths, AI is dismantling privacy. A new tool is peeling back the curtain a little more, with a method to figure out what your face looks like from your voice.
In research published on Arxiv, a publishing site for non-peer-reviewed papers, MIT researchers created a way to reconstruct some people’s very rough likeness based on a short audio clip. The paper, “Speech2Face: Learning the Face Behind a Voice,” explains how they took a dataset made up of millions of clips from YouTube and created a neural network-based model that learns vocal attributes associated with facial features from the videos. Now, when the system hears a new sound bite, the AI can use what it’s learned to guess what the face might look like.
The researchers, led by MIT postdoctoral student Tae-Hyun Oh, do briefly acknowledge the privacy concerns in the paper, explaining in an “Ethical Consideration” section that Speech2Face was trained to capture visual features like gender and age that are common, and only when there was enough evidence from the voice to do so. In other words, the system is not trying or able to produce images of specific people.
Still, the researchers speculate, the AI “may support useful applications, such as attaching a representative face to phone/video calls based on the speaker’s voice.”
The resulting images are certainly very rough. But while they are not quite the quality of the latest computer-generated images that police departments are putting out to find missing children or crime suspects, generally, many of the images get in the right ballpark for age, ethnicity, and gender. Previous research has explored methods for predicting age and gender from speech, but in this case, the researchers claim they have also detected correlations with some facial patterns too. “Beyond these dominant features, our reconstructions reveal non-negligible correlations between craniofacial features (e.g., nose structure) and voice,” they write.
The system struggled with people of certain identities, however. Under the ethics section, the researchers acknowledge cases where attributes like spoken language or voice pitch caused the model to create highly erroneous associations and approximations of what the speaker looks like. This reflects the limits of machine learning, and the limits of the premis that a voice can be used to predict a face beyond basic stereotypes. With enough data, AI can find insignificant patterns anywhere.
“The training data we use is a collection of educational videos from YouTube, and does not represent equally the entire world population,” the authors write. “Therefore, the model—as is the case with any machine learning model—is affected by this uneven distribution of data.” They recommend “that any further investigation or practical use of this technology will be carefully tested to ensure that the training data is representative of the intended user population.”
The MIT tool hasn’t been released, but clips can be played here, with the screenshot from the YouTube video it was pulled from, as well as the generated face.
While Speech2Face’s results look more believable than these, it might have problems that the other papers don’t. Bhiksha Raj, who worked on the CMU research, points out that the MIT paper does not include any code for outside developers to test out and thinks the paper overstates what it can accomplish. “In other words, there’s nothing in the paper that shows that they’re performing anything more extraordinary than predicting people’s gender, age, and ethnicity from their voice, with some error, and drawing a face which matches those,” says Raj.
As Slate reported, not everyone that was part of the dataset was thrilled to become an unwitting part of the project. Nick Sullivan, a technology researcher at Cloudflare, tweeted out about discovering his face and voice were in the paper, and his attempt to learn how he became part of it. Many public and non-public face recognition databases rely on faces scraped from the web. For now, that kind of data harvesting may be protected by law: YouTube content is considered publicly available data, and any claims to copyright could likely be countered with a fair use argument.
Voice privacy has taken a backseat to the push to regulate face recognition, but there are plenty of places where our voices are already being used as a biometric data point, with or without our knowledge. Chase started using technology called “Voice ID” last year to recognize credit card customers when they call the bank, collecting and storing a sample of your voice unless you explicitly opt out. Correctional institutions across the country are building a database of “voiceprints” of thousands of incarcerated people.
Other research wants AI to be able to do sentiment analysis of your voice, the next frontier of machines knowing more about us than we might like. Amazon filed for a patent earlier this year that one day could allow Alexa to recognize your emotional state and target ads based on your mood. Amazon said in a statement to The New York Times it did not use voice recordings for targeted advertising. Of course, that doesn’t mean the company won’t do so in the future—or run all manner of other algorithms on people’s voices.
This story has been updated to include similar research from Carnegie Mellon University.