There are many indignities involved with having an accent. To begin with, many strangers will say they can never understand you. Added to the list, in our modern world, is the inability of those with accents to get the most out of their voice recognition technology. Shortly after Siri launched on the iPhone 4S, Scottish users took to the Internet to protest that she couldn’t understand their accents. A month later, those criticisms persist: The past week has brought complaints that Siri still fumbles with Southern and Indian accents, too.
The good news, according to several experts, is that Siri and other voice recognition software will inevitably get better at understanding accents. Though understanding accents poses a particular problem for voice recognition, research is advancing, and the increasing stores of data on accents means that Siri will improve with time. “All recognizers get better every year,” and Siri will be no exception, Dan Jurafsky, a professor of linguistics and computer science at Stanford, tells Fast Company.
How exactly does voice recognition software deal with accents? In general, voice recognition works by first collecting vast amounts of data in the standard pronunciation of a language. Researchers then build a “dictionary” mapping words to their corresponding sound components, called “phones.” Once that standard dictionary is constructed, says Jurafsky, there is a multi-step process for ameliorating the accent problem: 1) collect as much accented data as possible, 2) combine the larger standard speech data with the smaller amount of accented data, and 3) create a modified pronunciation dictionary for the accent. There are then two final steps that kick in once a particular user begins interacting with the software: 4) identify the user’s accent, and 5) use adaptation techniques to quickly shift your model toward the user’s accent.
After watching some of the YouTube videos of Scottish Siri haters, Jurafsky said he suspects “the problem may be caused by problems in the pronunciation dictionary for the very strongly accented speakers.” Jurafsky suspects that companies tend to skimp on step three of the five-step process above: the creation of modified dictionaries. Though Apple declined to comment for this story, its Siri FAQ page offers mixed messages. On the one hand, it says that “Siri uses voice recognition algorithms to categorize your voice into one of the dialects or accents it understands.” On the other hand, it refers to “United Kingdom” and “American” English as though each were a uniform thing–supporting the theory that it may not have exhaustively thorough pronunciation dictionaries for specific regional accents.
Siri surely launched with some data on accents, though, given that Siri is widely believed to use speech recognition technology provided by Nuance. (“We can’t say much about our relationship with Apple,” Nuance’s Peter Mahoney says. “All we can say is we provide technology for certain Apple products.”) Nuance, for its part, employs distinctive models for eight different varieties of accented English commonly heard in the U.S.: Northeast, Southern, Midlands, Southeast Asian, Indian, U.K., Hispanic, and even “a generic children’s model” are all represented in the Nuance product Dragon NaturallySpeaking, which offers voice control for PCs.
There’s another important feature of Siri that makes it likely to get better at accents quite soon: its mobile, cloud-based nature. Nuance’s Dragon NaturallySpeaking software lives on a hard drive; this makes it faster, but it limits the pool of data from which it can draw. Siri, meanwhile, is constantly collecting data not just on your accent, but on the accents of hundreds or thousands of people who sound like you. “In a cloud-based system like Siri, most of these systems are adaptive systems,” says Mahoney. “The more people use it, the smarter it gets. We’re going to get more data and have the ability to get more refined views of different accents, and over time it’s just going to get more and more accurate.” One emerging area of research in voice recognition, says Mahoney, is “basically inventing ways to use more and more and larger sets of data.”
Data doesn’t exist in a vacuum, though, say Mahoney and Jurafsky: The contributions and insights of researchers are needed to leverage that data to make incremental improvements to the technology. There are many prongs of attack on the accent problem, though it begins to take a PhD to even begin to understand them: Nuance spokesperson Rebecca Paquette says that key research is ongoing in “every aspect of the recognition and understanding problem: ‘feature extraction,’ statistical acoustic modeling, pronunciation modeling, language modeling, syntax and semantic analysis, conversation management, and more.”
Asked about a paper (PDF) he co-authored on detecting Shanghai-accented Mandarin, Jurafsky says “it didn’t change the world, let me say that.” But a technique it advocated–separating signature sounds that distinguish an accent, rather than all the aggregated data on an accent–has led to incremental improvement. This is a realm not of “revolutionary breakthroughs,” says Jurafsky, but rather of “a half-percent error reduction rate here, a half-percent error reduction rate there.” Enough small victories, and you have a beta product; enough more, and you satisfy even the most frustrated users.
Until then, though, Scottish users may only want to depend on voice recognition for non-mission-critical tasks. Even before Siri launched, the Scottish comic duo of Connell and Florence thought it unwise to rely on Siri-like technology in, for instance, elevators.