“I wish I could touch you,” Theodore says, laying in bed. He’s met with silence. Rejection. Until she speaks up, tentatively. “How would you touch me?”
It’s a famously poignant scene from the movie Her, as the character Theodore is about to make vocal love to an artificial intelligence living in his ear. But according to half a dozen experts I interviewed, ranging from industrial designer Gadi Amit to the usability guru Don Norman, in-ear assistants aren’t science fiction. They’re an imminent reality.
In fact, a notable pile of discreet, wireless earbuds enabling just this idea are coming to market now. Sony recently released its first in-ear assistant, the Xperia Ear. Intel showed off a similar proof-of-concept last year. The talking, bio-monitoring Bragi Dash will be reaching early Kickstarters soon, while fellow startup Here has raised $17 million to compete in the smart earbud space. And then there’s Apple: Reportedly, the company is eliminating the headphone jack in future iPhones and replacing it with a pair of wireless Beats. “People have no idea how close we are to Her,” says Mark Stephen Meadows, founder of the conversational interface firm Botanic.io.
But to reach the husky “pherotones” of Scarlett Johansson, we have considerable cultural, ergonomic, and technological design problems to solve first.
Thanks in part to Amazon’s success with voice control—the company released two new versions of Echo last week—talking to a computer in your home is finally feeling downright domestic. But while Amazon may own the mindshare now, according to a study by MindMeld, only 4% of all smartphone users have used Alexa. Meanwhile, 62% of this market has tried mobile-centric voice control AIs such as Siri, Google Now, and Cortana. That’s why Echo’s early success in this space could soon be eclipsed by a new wave of personal devices from Sony, Apple, and a small pile of startups–unless Amazon finds a way to sneak into your ear, too.
The product category is new: an in-ear assistant that can hear you and respond with an intimate whisper. It lets Siri or Alexa curl up next to your eardrum. It’s remarkable what experts from across the industry believe this technology could do within just a few years. Imagine a personal assistant that takes notes about your conversations, a helpful researcher who automatically checks IMDb for that actress’s name you can’t recall, a companion who listens to your problems, and even suggests psychiatric treatment, silently consulting the collective knowledge of experts in the industry.
The form factor—a discrete, wireless speaker and microphone that lives in your ear—sounds sci-fi, but it’s hitting markets now. “I think the in-ear trend is the trend that could be the one for [voice] to have that ‘iPhone moment,'” says Jason Mars, assistant professor at the University of Michigan and codirector of Clarity-Lab. “With [Amazon] Echo, there are some really interesting ideas in that you can talk to your house. Now, with in-ear technology, you can imagine the assistant is always connected to you.”
The inherent intimacy of these devices will dictate how and where we use them. Anyone walking by can see what’s on your computer screen. Even our phones aren’t completely private. But even if an AI doesn’t know your deepest secrets, it’s still in your ear—the equivalent to someone holding their lips just inches from your ear.
“With the Apple Watch, I’m addressing a machine. I’m talking to a thing on my wrist. While it’s got a Dick Tracy-like schtick to it, it is still a very specific invocation of a function,” says Mark Rolston, former CCO of Frog and founder of Argodesign. “Whereas, talking to myself, and having a ghost, angel, or devil on my shoulder, is a much more, I don’t know—it has deeper psychological implications, the idea that there is someone else in my head.”
Rolston suggests that the nature of this private interface will change your relationship to AI. You’ll naturally rely on it for more secretive topics—whereas you might not want your Apple Watch reminding you it’s time to take your birth control, a voice that only you can hear telling you the same information becomes perfectly acceptable. And slowly, any task that would be rude or embarrassing to Google in front of someone else on your phone, becomes invisibly helpful when managed by an AI in your ear.
“Imagine I’m listening to you with my right ear, and Siri is coaching me in my left,” he says. “And I’m conducting this kick-ass interview because I have a computer feeding me all sorts of parallel questions and concepts.”
At the same time, having a seemingly all-knowing voice whispering in your ear will make it very easy to set unrealistic expectations for that voice’s functional capabilities, and that poses a problem for designers. In everyday life, we constantly set realistic expectations for the people around us based on context. We don’t, for instance, ask our dry cleaner to calculate the 12.98% ARP on our credit card, or our accountant to tell us a bedtime story. It’s harder to know what a reasonable expectation is for a technology so nascent as in-ear assistants, which means users may treat these platforms as omniscient gods that understand everything in all contexts rather than single pieces of software—and be sorely disappointed.
That disparity—between what an AI assistant can do for us and what we expect it to do for us—is already a problem for existing AI technology, such as Siri. “If you go too fast too soon, there will be too many corner cases that aren’t working,” says Dan Eisenhardt, GM of Headworn Division for Intel’s New Devices Group. “Like Siri, I keep giving Siri a chance, but it’s the one or two times she doesn’t work a day that I get disappointed . . . so I don’t use Siri.”
At Intel, Eisenhardt is solving this problem by creating audio-based wearables keyed to more specific contexts. During CES, Intel debuted a collaboration with Oakley called Radar, a combination of glasses and earbuds that allows runners and cyclists to ask questions such as “how far have I run?” or “what’s my heart rate?” Because this system knows your limited context, it can be specialized to understand your probable topic of discussion. This raises overall accuracy, and it allows for some small moments of magic, such as if you ask about your cadence (or running speed) and then follow up later with “how about now?” the system will understand that you’re still curious about cadence.
Another unknown about these in-ear assistants? Whether it will be a single person speaking to us all the time, or whether individual companies will develop their own voice-based personalities. So far, third parties have been anxious to adopt Amazon’s Alexa to control their apps or products. But expect this trend to change, as these companies elbow to create their own voice identities.
“I can order Domino’s pizza with Alexa. I can order an Uber. But these are all brands that spend millions of dollars establishing their voice as a brand,” Rolston says. “Now we’re getting to a place where we can do rich things with hundreds to thousands of brands in the world, but those are going to be black boxed in Siri and Alexa. And they are not good representatives of the Pizza Pizza pizza parlor down the street. Right now it’s Alexa. I want the stoned pizza guy.”
“The solution may be that the voice of each brand needs to quite literally be the voice of each brand,” Rolston continues. “So if I have an app within Siri that’s a pizza company, maybe I don’t say, ‘Hey Siri.’ I say, ‘Hey Pizza Pizza.’ Because the pizza company, these guys don’t want to be Siri, they want to be them. We have to solve that.”
At Botanic, Mark Meadows has pioneered a potential solution. He calls them “avatars,” and they’re basically a way to give different chatbots you’re talking to varying personalities. These personalities can be informed by experts. So, for example, psychologists could share their collective knowledge through a single virtual psychologist—or mechanics could contribute to their own collective mechanic. Meadows has actually patented a rating system for these avatars, too, because as he explains, humans are irrationally trusting of machines—and the intimacy quotient gives them incredible power.
Meadows points to a recent McDonald’s promotion that turned Happy Meal boxes into virtual reality headsets. He imagines that the fast food franchise could use such technology to create a Ronald McDonald avatar that talks directly to your child in a brand-to-consumer conversation that you, as a parent, might not even be able to observe. “The [avatar] relationship gives brands this capacity to engage so intimately and so affectively that Ronald McDonald is not just a weird clown you see on TV,” he says, “but a very private friend whispering advice into your child’s face.”
Meadows thinks a rating system could serve as a counterbalance to that power, so he’s actually patented a “license plate” that would identify potential abusers of AI chatbots—basically a cross between being verified on Twitter and star-rated on Amazon.
For an iPhone user, Siri feels like nothing more than a software update. That’s because her real cost is invisible. It’s tucked away in North Carolina, where Apple built the world’s first $1 billion data center before deploying Siri’s AI technology. The cloud, as it turns out, is real. And it’s expensive. The hidden computing cost of these assistants help to explain why Amazon, which runs one of the biggest server networks on the planet, is so dominant in voice intelligence. Even still, our current servers aren’t nearly enough to enable our Her future. In fact, they couldn’t scale our relatively dumb Siri of today.
“If every person on the planet today wanted to interact continually with Siri, or Cortana, all the time, we simply don’t have the scale of cycles in data centers to support that sort of load,” Mars says. “There is a certain amount of scale that can’t be realized technologically. Much like we can’t have every cell phone continually downloading videos on the planet because cell signals can’t support it. Similarly, we don’t have the computational infrastructure in place to be continuously talking to these intelligent assistants at the scale of millions and billions of people.”
Consider when Siri launched. There were constant outages. Did Apple iron out some of the kinks? Surely. But are people using Siri less than when they first got their new iPhones with Siri? Probably. Mars implies that Siri couldn’t scale today, and “with just a bit of improved quality or more users, the cost goes through the roof.” The smarter AI gets, the more processing it requires, and the solution isn’t merely hooking up a few more big server farms either. We require orders of magnitude more processing than we have now. It’s why, in Mars’s lab, he researches methods to get 10x to 100x improvements out of the way we design servers. If someone’s phone can handle more of the load, for instance, while servers deploy specialized hardware just to run singular pieces of software such as AIs, that may be possible.
So it’s unclear if it’s even possible to supply the necessary computational firepower needed to make these assistants ubiquitous. If the infrastructure can only support a small subsection of users, how will companies choose who gets the tech first? And how much smarter will those people be than the rest of us? Mars believes these rapidly improving in-ear assistants will accelerate the server bottleneck. What happens next is anyone’s guess.
Of course, server farms are just one piece of the hardware problem. Just because smart earbuds are currently on sale doesn’t mean they’re polished for prime-time use. And Gadi Amit, the founder of Silicon Valley design firm NewDealDesign, doesn’t believe that the current in-ear hardware itself is quite as good as Sony and other startups might paint it.
For one, earbuds are notoriously difficult to fit in terms of comfort. Consider, for instance, how some people believe Apple’s earbuds are perfect, while others can’t stand to wear them for even a few seconds. Once designers remove the cords—which actually stabilize earbuds inside your ear with their own weight—technologies from Sony and Here offer no other type of anchor to hold them in place other than your ear canal itself.
“One of the main issues is that they fall out. And they fall and will continue to fall any time you do physical activity,” Amit says. “There is no solution to that. The solution is to get out of the ear, and wrap some kind of feature around your ear.” And once you do wrap the device around the ear, all the subtlety of the device is gone—plus you’re left with a second pain point around your cartilage.
“The comfort problem is there, and it’s going to be very much a personal decision. Some people will find it okay; some people will find it unacceptable,” he says. “It will never have 100% acceptability, especially when running. It’ll hover more around a 30% or 50%.” He compares that to touch screens, which work almost all of the time for almost 100% of the population.
Another issue Amit is quick to point out is that of audio quality. The consumer audio market already has people chasing higher quality, over-the-ear headphones. And he says that given the slow pace of change in the last 10 years, improvements in micro-audio aren’t likely in the near future. Similarly, there are limitations with microphones and voice recognition systems, which while very good, feature accuracy in real-life use tends to hover close to 90%.
“That sounds like a lot, but it’s horrific. If you’re having a normal conversation, and 5% is incomprehensible, for you, it’d be horrible to listen to,” Amit says. “It’s pretty nice for some applications. But we’re not going anywhere with your ear that’s eradicating GUI in the next few years.”
Instead, Amit imagines the near future is a “tapestry” of interactions, of which an ear-computer or voice-control system would be just one component. While he believes that graphic user interfaces peaked in 2015, he’s doubtful that any one breakthrough—such as the iPhone’s touch screen—will cannibalize all other UX from here on out. Now we have technologies that can read our hand gestures and facial emotions, we have VR headsets that can immerse us in video content, and we have haptics that can pass along physical feelings, too.
“We have five senses, and we have to use all of them to interact with smart technology,” Amit says. “And the real difficulty, when we’re designing these projects now, is to find the right blend, and allow people enough flexibility to adapt to their comfort level. Hybridization is the challenge we have now. We have all of these technologies; how do you put the right combination together?”
As Meadows points out, when these technologies work together, they grow more syntactically accurate and emotionally aware. They can understand what we are saying and what we are feeling.
That may be why Apple recently made two acquisitions that you probably didn’t hear about: Emotient, a mood-recognition software that can read emotion from a human’s face with split-second speed, and Faceshift, a software that can record and essentially puppet an avatar’s face from a human one. Together, these acquisitions indicate that Siri could be so much more intelligent if she could not just hear you, but see you. And she might be that much more empathetic if you could see her, too.
The biggest challenge with the rise of in-ear assistants, however—larger than the data centers, the ergonomics, and even the potential corporate abuse of intimacy—will be all of the tiny, social considerations that an AI living in your ear will have to get just right.
“Now you have assistants saying your favorite Italian restaurant opened, and it may very well be that you’re delighted,” says Don Norman, director of the design lab at UC San Diego and author of The Design of Everyday Things. “But it may be when driving or crossing a street, or when I’m finally having a deep, difficult discussion with a lover. The hardest parts to get right are the social niceties, the timing, knowing when it’s appropriate or not to provide information.”
In-ear assistants will have to juggle these intuitively social moments often, given that Norman believes some of its biggest potential benefits exist in its ability to make use of 5-, 10-, or 30-second bursts in his day—those moments when he might be able to get ahead on email or check his text messages, that in aggregate could add up to a significant amount of time. But he’s also concerned about the potentially dangerous rudeness of a socially inept computer.
“I’m worried about safety. We already know people injure themselves reading their cell phone while walking. They bump into things, but at least the cell phone is under your control. You can stop whenever you want. You can force yourself,” he says. “I never read the phone when crossing the street. But if it’s an assistant, giving advice, recommending things to me, telling me things it might think are interesting, I don’t have control of when it happens, and it might in a dangerous situation.”
In his lab, Norman is studying some of these complicated social boundaries through the lens of car automation—namely, how does a car without a driver navigate through busy pedestrian intersections? “The cars have to be aggressive, or they’ll never get through a stream of pedestrians,” he says. So this requires the cars actually be programmed to local car-pedestrian culture, which in California means the car inches forward slowly, and the humans will naturally make way. Yet in Asia, it means the car must push its way through a crowd far more forcefully, more or less ramming its way through. Each works, but if you were to swap these approaches, the California car would sit at an intersection in Asia all day, while the Asian car would run Californians over.
So it’s complicated.
But for the doomsdayers who believe Her technology will lead us all to tune each other out, it’s worth noting that we already check our smartphones 150 times a day. If that hasn’t defeated humanity yet, it’s unlikely one new technology will bring society to its knees.
“I walk to my office, but it means I walk past lots of students, and I’m amazed that 90% of them are reading their phones as they walk across campus,” Norman says. “I try to understand what they’re doing, but mostly, they seem happy. They seem engaged. I don’t think they’re doing it because of the technology. I think it keeps them connected.”
Cover Photo: Mark Rolston and Hayes Urban, Argodesign