A voice-controlled virtual assistant–Siri, Alexa, Cortana, or Google Home–is only as good as the data that powers it. Training these programs to understand what you are saying requires a whole lot of real-world examples of human speech.
That gives existing voice recognition companies a built-in advantage, because they have already amassed giant collections of sample speech data that can be used to train their algorithms. A startup with hopes of competing in this arena would have to acquire its own set of voice audio files, perhaps from an existing archive such as the roughly 300-hour corpus built from TED Talk transcriptions.
Developers generally need access to hundreds or thousands of hours of audio, says Alexander Rudnicky, a research professor at Carnegie Mellon University and director of the Carnegie Mellon Speech Consortium.
Google acknowledged as much on Thursday, in releasing a crowdsourced dataset of global voice recordings. The 65,000 one-second audio clips include people from around the world saying simple command words–yes, no, stop, go and the like. This comes just a couple of weeks after Mozilla, the organization behind the open source Firefox browser, recently introduced a new project called Common Voice. Their goal is to build a freely available, crowdsourced dataset of voice samples from around the world, speaking a wide variety of sample words and sentences.
Google’s recordings were collected as part of the AIY do-it-yourself artificial intelligence program, designed to enable makers to experiment with machine learning. “The infrastructure we used to create the data has been open sourced too, and we hope to see it used by the wider community to create their own versions, especially to cover underserved languages and applications,” wrote software engineer Pete Warden in announcing the release.
In full, it’s more than a gigabyte of sound, but that’s just a tiny fraction of the total amount of voice data Google has collected to train its own AI systems. The company once opened an automated directory assistance service which, it turned out, was primarily a way for them to gather human voice data.
Amazon’s Alexa transmits voice queries from its users to a server, where they’re used to further train the tool. Apple teaches Siri new languages and dialects by hiring speakers to read particular passages of known text, and by having humans transcribe snippets of audio from the service’s speech-to-text dictation mode. Microsoft has reportedly set up simulated apartments around the world to grab audio snippets in a homelike setting to train its Cortana digital assistant.
All of that is privately held, and generally unavailable to academics, researchers, or would-be competitors. That’s why Mozilla decided to launch its Common Voice project.
“As we started on building on these systems, we found that we could build on the works of others in terms of algorithms, and do our own innovative work in terms of algorithms, but for all of these, the data curation, creation and aggregation was a challenge,” says Sean White, Mozilla’s SVP of emerging technology. “if you wanted to do a new speech-recognition system, you couldn’t just go out and find a high-quality data set to use.”
Common Voice invites anyone with an internet connection and a microphone to submit brief recordings of themselves reading particular sentences, all through a couple of clicks or taps on a web browser. That’s similar to how Google’s project works, although Common Voice asks people to submit full sentences while Google asked only for particular words and numbers commonly used as commands. The sentences are a mix of those conversational phrases submitted by contributors–“She gave me back the charger” is one from the project’s GitHub files–and quotes from classic movies like Charade and It’s a Wonderful Life. Mozilla also asks participants to supply some basic demographic information, like age, gender and dialect of English spoken (such as United States English, Canadian English or English from the West Indies and Bermuda).
In its first roughly 57 days, the project collected about 307,000 recordings, each about 3 to 5 seconds long. That makes for between 340 and 520 hours of total audio, says Michael Henretty, a digital strategist working with Mozilla’s open innovation team.
“We’ve already surpassed the TED talks, which is one of the bigger open source data sets that’s out there,” he says.
Mozilla aims to release a version of the dataset later this year and is hoping to have 10,000 hours of audio by that time, the quantity it estimates is enough to train a modern, production-quality system. That’s far bigger than the 18 hours of clips that Google just made available.
One of the key reasons to have a large and wide variety of voice samples is so that the algorithms that are trained on it avoid having an unintended bias. As anyone with a heavy accent who has tried to use a voice assistant can attest, these systems are still better at understanding plain English than anything else.
Rachael Tatman, a data preparation analyst at Google-owned data science platform Kaggle, published a paper earlier this year on how gender and dialect affected accuracy in YouTube’s automated captions. She found YouTube’s captioning was less accurate for women and for speakers from Scotland, but different error patterns can appear in different systems depending on what training data gets used.
“If I’ve seen a lot of speech from women from Virginia, I’m gonna be really accurate on women from Virginia, and less accurate on men from California,” says Tatman.
Existing open source datasets have been found to have their own biases—the so-called Switchboard conversational dataset initially collected by Texas Instruments and now hosted as part of the University of Pennsylvania’s Linguistic Data Consortium has been found to skew Midwestern, for instance. Biased data has been an issue in other areas of artificial intelligence as well–some algorithms have been found to be better at recognizing white faces or to have trouble understanding African-American vernacular English in tweets, for instance–and naturally is a particular concern for tech companies and open source projects looking to serve a diverse audience.
Mozilla also invites everyday users to validate submitted samples, listening and verifying the recordings are the text they’re supposed to be. On a recent day, samples served by the website for validation included correct recordings with accents from all over the English speaking world, along with one inaudible sample and one that was, inexplicably, a tinny snippet of an Elvis Presley song.
The reason most of the companies behind popular voice assistants haven’t made their internal recordings available isn’t entirely about hampering the competition, Tatman says. Since so many queries contain personal information, like internet searches run or text messages sent, it would be a privacy breach to release the data. An individual could potentially be identified by their distinctive voice.
Still, companies are willing to use the data internally: Apple has said in the past that it can retain Siri data, with user identifiers like ID numbers and email addresses stripped, for up to two years to help improve its algorithms. The company didn’t immediately respond for requests per comment on its current Siri audio retention policies.
“Your voice is recognizable,” Tatman says. “it’s considered identifiable information.”
Mozilla is also taking steps to protect user privacy as it collects its open sourced voice data. “We take pains to separate users from recordings such that there’s no personally identifiable information embedded in the clips themselves,” he says.
One advantage of the Mozilla dataset over some existing sets of publicly available recordings like annotated TED talks is that, like sound samples from Siri or Alexa devices, they’re recorded in similar conditions to how people will actually use voice recognition software.
“Basically they’re using a browser to collect the data, which means the data they collect will have various characteristics that are more representative of what their target users are going to be like,” says Rudnicky. “I’m sitting in an office, I have a particular kind of microphone which is likely one that’s to be found in a desktop environment, and so forth.”
Having a deliberately diverse set of speakers and accents, combined with the sheer expected size of the dataset, should make the collected recordings more useful than existing free collections of audio and, perhaps, even competitive with the datasets big companies keep behind closed doors.
“We’re trying to cast as wide a net as possible,” Henretty says.