Shazam was the must-have app of the late aughts. It seemed to fulfill every promise of the post iPhone world: With the mere tap of the screen, you could beam information to the cloud to identify a random song playing in a commercial, at the bar, or on the radio. But it required Shazam to build a huge server farm–its own entire data center to handle the loads.
For a tangible example of how things have changed in the decade since Shazam’s smartphone app debuted, think about this: On the Pixel 2, with a feature called Now Playing, Google has shrunk the equivalent of Shazam’s countless servers of yore to run entirely on the phone. It can match 70,000 songs, no internet required. And instead of you asking it what song is on, Now Playing listens all the time and tells you before you even ask.
What made this possible? “There’s been a deep learning revolution,” says Matt Sharifi, a software engineer at Google, who first helped bring music identification to Google’s own search bar back in 2010. “When we started working on this problem, the approaches to music recognition were different than in 2017. We did everything with deep learning and machine learning.”
The benefits of running Now Playing on the Pixel were clear: It would be faster for the user, and it would ensure more privacy, too, since the audio snippets didn’t need to be sent to the cloud. But perhaps the biggest reason Google brought Now Playing onto the phone was the simplest: That, suddenly, it could be done at all.
To the uninitiated, the first moment you see Now Playing work is semi-creepy omniscience, even if the feature is technically opt-in. The Pixel 2’s lock screen has the feel of an old clock radio. It shows the time along with a few minimal notifications. And then, below all that, you see the title of the song playing in your room–a moment of technological prowess that’s presented as an afterthought.
By design, Google didn’t want Now Playing to ever look like it was breaking a sweat on your behalf. “The Pixel is all about being helpful and useful, but also being kind of playful, too,” says Brandon Barbello, product manager of Now Playing. “Now Playing sits well at that juncture . . . the moment you have the question, ‘What is this song?’ you can look at your phone and the answer is already there.”
But this effortless “ambient awareness,” as Google calls it, was actually years in development. Even though it was built by the same group that created Google’s own music search technology in 2010, the team had to start from scratch to get audio matching working on a phone. That’s because in the cloud, Google more or less has unlimited computational power. Audio matching, in particular, is resource intensive, even for industrial servers. On a phone’s low-power chips, processing is limited, and every request posed to the silicon sucks away precious battery life. Plus there’s the problem of the music samples themselves. To ID a song, it must be matched up against audio fingerprints stored in a database. In the case of Now Playing, that’s 70,000 song samples that need to be squeezed onto the phone’s chips.
To build Now Playing, the first thing researchers did was create their database of 70,000 sound fingerprints, which are essentially complex snapshots of a song’s waveforms. To do that, they used a neural net that could transform audio fingerprints into absolutely minimal files that were still recognizable as unique. That’s tough because not only does the song need to be recognizable, it has to have enough data that the fingerprint is useful even against a murky sample, one that might be overpowered by environmental sounds like conversations or vacuum cleaners.
Those fingerprints were created on Google servers, but the same neural net was placed right on Pixel phones, too. As sound comes in, it’s filtered by the Pixel’s DSP chip, a super low-power chip that listens for hot words like “Okay Google.” The DSP listens all the time until it believes it hears music. When it does, the DSP has permission to fire up the Pixel’s powerful, power-hungry processor, which runs the Now Playing neural network. With a few seconds of a clean audio sample, the net creates a new song fingerprint on the device. Then, another algorithm tries to match that new fingerprint up with its match in the 70,000 on-phone song library.
All of this back-end technology is absurdly complicated, but to the user, it’s invisible labor. AI is simply lying in wait for its moment to help its user with a very specific task. The same is true across other parts of the Pixel phone that uses on-device AI. Its camera uses AI to help with image processing. Its UI uses AI to detect the exact snippet of text that you might be trying to copy and paste. All of these examples raise a question: Could we have phones that run on entirely their own AI in the future, without needing to talk to the cloud at all?
“I do think there are lots more opportunities to do [AI] on-device,” says Sharifi, but he is quick to draw lines in the sand. “There are some things that should be done in the cloud because you need a big database. With the Google Assistant, you want access to the world’s knowledge. It’s quite difficult to fit that onto the device.”