Ten years ago, YouTube began to auto-caption every video uploaded to the service. With 20 hours of video uploaded every minute at the time, it was a task for speech-recognition technologies of unprecedented scale. Artificial intelligence has progressed even more since then: Starting this year, the new version of Android, dubbed Android Q, will caption anything on your phone. That includes podcasts and videos from Facebook and Twitter. And it doesn’t need servers; it just needs your phone.
Live Caption, it’s called, doesn’t use the cloud. No data leaves your phone, and it can even work in Airplane Mode. Much like we’ve seen with Google’s music identification service (which identifies 70,000 songs) and Night Sight photography (which can basically see in the dark), the technology uses shrunk-down machine-learning algorithms to run right on your device.
Even though most every service allows creators to manually caption their videos, it can be laborious to do so. As a result, many videos aren’t captioned at all. Similarly, podcasts are rarely transcribed, and personal videos that friends share via text never feature closed captioning. With Live Caption, a world of otherwise inaccessible content will be made available to the deaf and hard-of-hearing community.
The project was born out of Google’s Creative Lab, which invited KR Liu, an advocate for the deaf and hard of hearing, to the office. “We didn’t have an idea. We brought her in, said let’s talk about the community, and workshopped things,” says Robert Wong, VP of Google Creative Lab. The lab has since dubbed this wider initiative Start with One. “You start with one person, don’t even try to solve their problem, but get with them, design with them,” Wong explains. “It’s not user testing. It’s more like, ‘You have a different take on the world, a different experience. What’s tough in your life? How do we solve that?’ It’s designing with, not designing for.”
What Wong describes is almost a textbook definition of inclusive design, or bringing in people who are considered edge users of a product to spearhead design and development. Somewhere early in the process, the Lab landed on a big idea born from the process: “We were thinking, if YouTube could caption every video, why couldn’t we do that for every piece of content on your phone?” says Nicole Bleuel, team lead on the project with the Creative Lab. Captioning would be wonderful for the deaf community. It would also be handy for anyone who was using their phone somewhere without sound.
Of course, there were reasons why Google couldn’t easily caption every piece of content inside Android. While the Pixel currently has features like call screening, which uses AI on the phone to detect and transcribe what someone on hold is saying, to caption everything on the device requires the Android team to recode some fundamental bits of Android’s audio architecture.
Beyond that, there were big questions of what closed captioning on a phone would even look like. On television, where it began in the 1970s, closed captioning is pretty straightforward. There’s only one constant video stream that takes up your whole screen–so sticking it near the bottom generally works. On mobile phones, every app interface is a little bit different. Where could these captions float without getting in the way?
At first, the team mocked up something akin to Chat Head, a late UI from Facebook that is used in some Android functions. It’s a floating button that you could activate in the settings and tap when you needed to translate audio to text. The team shared the idea with designers who were deaf and hard of hearing, and they were remarkably receptive to it. “Even though I don’t consider this to be an accessibility feature, I’d rather start by building it for the people who need and want it the most,” says Bleuel. “That’s how you get to the point to make something universally useful and accessible.”
From that feedback, the design morphed into a relatively simple dark gray box with white text. You activate it, not inside accessibility settings or some deep menu, but a menu that appears when you tap the volume buttons of your phone. Once it’s on, it’s just on until you turn it back off. “And any time it detects an audio stream on your phone, a video on a social network–or someone sent a voice message, or a video in Google Photos, it will pull up a caption box and start captioning that in real time,” says Bleuel.
Together, all of these features mean it’s easy to turn on and off, and a mere tap or two away from a user at any time.
As for the text window itself, you can drag and drop it anywhere on the screen at any time, basically making the optimal interface for yourself. You can also increase the box’s size, the font size, and the color for legibility. “You can imagine with a podcast, there isn’t anything on the screen you’re looking at so you just want more captions,” says Bleuel. Indeed, with Live Caption, it seems like you could read a podcast much like an e-book.
Live Caption will be launched on the Google Pixel this fall, and the wider Android Q ecosystem some time in the future. A caveat: Since the speech recognition engine runs on your phone, the accuracy will be far from perfect. The cloud has limitless power and storage to support a precise AI. Phones, on the other hand, have mobile chips that sip on batteries to operate, so the AI can’t be as powerful. Google actually shrunk its cloud model from 2GB to 80MB for this project. But local AI is ultimately better for consumers: It works regardless of your connection, it’s faster to process, it saves your data plan from sending back-and-forth media files, and it protects your personal privacy.
Bleuel cautions that someone who is deaf shouldn’t rely upon it entirely. It also won’t transcribe phone calls, and it can only translate English for now. Even still, with more than 2 billion Android users around the world, Live Caption is the sort of technology that could just make life a little to a lot better for the 466 million people around the world who suffer hearing loss of some sort, along with all of us who just want to watch a video on mute. It’s a testimonial to the sort of inventiveness that can happen when we look to people with disabilities and other potential edge cases, as the most important users of a product.
“I feel like it’s daunting sometimes at any of the big tech companies, because you are always thinking about, ‘How will I scale this?’ ‘What’s my idea that billions of people will like?'” says Wong. “That’s one way of looking at things. But when we start the other way, there’s new ideas that pop up that we couldn’t imagine.”