Last week Siri, Apple's voice-commanded digital assistant, got an upgrade that gave her many new powers. But developments in voice recognition tech across all kinds of devices mean that your next-next-gen smartphone will easily surpass Siri's passive listening skills and turn it, and systems like it, into chat-happy, always-on life mates.
Nuance is the company behind many of the innovations in voice recognition, and may or may not have played a part in the latest iteration of Siri, which grew out of SRI International. The recent advances in voice tech are partly due to developments in the core technology of voice recognition and partly due to Nuance's clever choice to make a database of millions of bits of real speech from its users, which it can use to train and optimize its algorithms—even to the point of better understanding different dialects. This week the company's chief technology officer Vlad Sejnoha revealed that Nuance has been working with chip manufacturers to give smartphones an amazing new voice-command power. Nuance wants to give phones the power to listen to you when they're otherwise "asleep."
Think about how often your phone is just lying there, mute and dark-screened on your desk or bedside cabinet. All that billion-dollar tech (at least in terms of R&D) is sitting there idle, passive ... wasted. It'll wake up, sure, when you get an email or someone rings it, but to interact with it yourself you have to wake it up and then tap at its screen or engage with its voice command system—this may or may not be Siri, or Samsung's S-Voice or Google's own rumored Majel project.
The different scenario Nuance's CTO imagines is that in the future your phone will always be listening. Isn't that a bit stalker-y?
So your phone is quietly sitting there, sipping at battery power so it doesn't consume that precious resource, until you ask it when your next meeting is, or if it can text your partner or if it's going to rain later. The benefits are obvious, says Sejnoha—there's less of a barrier to using it because you don't have to turn on the device, and indeed if a strong mic is involved, you won't even have to be near it. Nuance is even working on making its system better at isolating a user's voice from background chatter so you could even drop it into conversations with your friends, throwing a question at your smartphone even while talking to other people in a noisy environment.
This isn't far off, Sejnoha explained to Fast Company in an email: "It's hard to predict exactly, but probably somewhere on the order of one to two years." While some current speech systems do allow devices to passively listen for commands, "Future implementations will allow the user to speak a single utterance that will both wake the device up and convey the desired intent, and do so in a manner that conserves power." Quite apart from super-charging the already sizable emotional attachment we have with our smartphones, this is going to have an enormous impact on daily life.
Siri 3.0 and her kin will likely be listening all the time, which means you'll use them more, and it will probably choose to volunteer information a bit more. Nuance's CTO, sensitive to this, added a rider to his discussion about the amazing potential of this tech—it shouldn't be used to make "creepy" apps. What this means is yet to be determined, but the fact he mentioned Microsoft's hated Clippy system is telling.
Nuance is nervous your phone will sit quietly on the sideboard until it hears you talking about, say, the birds in the garden at which point it'll chirp up and demand something like "Hey, it's been a while since you played Angry Birds! Why not play a bit now?" What app maker wouldn't be tempted by a power like that? Of course there's scope here for apps that are far creepier: What if parents load their kids' phones with apps that listen for keywords like sex or alcohol? Or spying apps that secretly snap photos when prompted by keywords?
There's another industry this new way will benefit: Advertising. Picture your phone sitting there quietly listening to what you say, and noticing you mentioned Coca Cola earlier in the day. Later you're shopping in the supermarket—and the machine pings you about a Pepsi sale. So far advertisers, haven't penetrated this far. But just this week, Linux users revolted when Ubuntu said shopping suggestions would be served up from Amazon based on the search terms that users are putting into their PCs running Unity Dash's "Home Lens" universal search feature. Could voice-activated ads be far behind?
Sejnoha says he thinks "there are application contexts where judicious advertising not only does not interfere with the user's activity or feel intrusive, but can actually provide value from the user's perspective."
By way of example he points to the organic relationship between searches and the products they reveal. But he doesn't think it'll happen any time soon, or that it'll necessarily work in the creepiest way we can imagine—to Sejnoha it seems "extremely unlikely that anyone would (or should!) tolerate advertising based on some sort of eavesdropping."
But that doesn't mean it won't happen. A look through the U.S. patents database, in fact, turns up inventions such as patent 20100086107 which describes exactly this sort of voice-powered ad serving system that's also sensitive to location. As a similar example of how sound-recognition is already driving adverts, music-recognizer Shazam recently revealed it could be integrated into all TV shows in the U.S. for social sharing purposes at first, but ad-related spots are almost certainly in Shazam's future. Microsoft has already developed ad tech that listens to the web video you're playing and drives adverts based on the words it recognizes, analogous to a future system that would listen to what you're saying:
Audio eavesdropping ad tech may be a while off then, but what's certain is that the ubiquitous always-on voice recognition tech that will power this future is already on the way.
There's going to be an inevitable privacy backlash if or when it arrives—an exaggerated version of the one that blew up around the fact that today's Siri tech sends anonymized voice data to Apple. When similar tech is popularized by Google as well, will users be concerned that along with the hundreds of other bits of data Google tracks about them it'll add voice samples? What happens when the authorities decide to subpoena the voice files for a smartphone user implicated in a crime? How anonymous is "anonymized"? These issues will influence user attitude—at least at first. For proof, consider the mess Facebook's got itself into in Europe over face recognition technology.
On this matter Sejnoha reassuringly noted that "any application that has an ability to listen for input from the user in a way that doesn't require an explicit user action each time (such as the pressing of a push-to-talk button), must offer an 'opt-out'—it goes without saying that users must be able to control when the app is allowed to listen and when it's not."
Goes without saying. We get it.
Update: As if by magic, a very comprehensive Apple patent relating to an "Intelligent Automated Assistant" has popped up at the USPTO. It's more or less the guts of Siri or, as the formal patent language puts it it's all about how Siri "in various embodiments, the intelligent automated assistant engages with the user in an integrated, conversational manner using natural language dialog, and invokes external services when appropriate to obtain information or perform various actions." In addition, Apple puts an interesting spin on the matter in the introduction to the patent. It suggests that apps are getting ever more sophisticated and users have to learn to interact with the UI of each one—a task that is onerous, and perhaps problematic for a certain class of user. Speech, Apple thinks, is the unifying interface that can cut through this problem.