Imagine going to your doctor to learn you have cancer. You sit down in front of her desk, and she shuffles through some scans that you took the previous week. Finally, she brings up a sheet of paper, looks at it, then looks up and says, “It’s a positive. I’m sorry.” You ask how she can be so sure. You’re thinking: Maybe there’s been a mistake. But instead of the doctor offering any kind of explanation or rationale–any kind of summary of the evidence she’s gathered–she simply says, “Just trust me. It’s cancer.”
Would you trust that doctor’s diagnosis? Of course not.
This example is purely imaginary, but it gets at a problem that sits just over the horizon. Increasingly, artificial intelligence is proving itself to be as good as, or better than, humans at a range of tasks, from reading mammograms to deciding who should get a mortgage. We are well on our way to forming new kinds of partnerships, in which humans have to trust the recommendations of machines. But unless those machines are capable of explaining why their recommendations are the right ones, we won’t trust them.
We hear all the time about AI’s massive promise: everything from sifting through data to find new drugs to making it possible to diagnose cancer at the tap of a button. But it’s not enough that a machine could help us do better in our work. For us to accept the advice a machine gives, we have to trust the machine. Without that trust, the promise of AI can never be realized. This isn’t a challenge just for the computer scientists that build these algorithms. It’s a challenge for the designers, design researchers, and cognitive psychologists building the ways we interact with AI. In fact, making AI comprehensible may be the greatest challenge in AI design.
Not Just One Black Box
If you read about AI at all, you might have read about something called the “black box problem,” which I recently covered extensively. It’s mentioned most often in the context of deep neural nets (DNNs)–the sort of machine learning algorithms that are used for image recognition, natural-language processing, and many other tasks that involve both massive data sets and fuzzy, hard-to-define concepts such as the essence of a word or the characteristics that define a cat. With those massive data sets, and the massive amount of correlations that a deep-neural net uses to reach a single accurate conclusion, it can be extremely hard to know how a DNN did what it did. Hence, their inner workings seem locked away in a black box that we cannot peer into.
The black box as described with DNNs is but one flavor of AI inscrutability. Sometimes, as with DNNs, it may be that a machine has made so many minute inferences that figuring out a single “why” might be impossible. Other times, it may simply be that a data set is hiding its own inherent biases. For example, if you trained an AI to crunch parole data, it might “outperform” a human in predicting rates of recidivism. But it would also almost certainly recreate the racial biases inherent in the way courts today offer parole. This issue of deep-seated bias has become the central concern among researchers, such as Kate Crawford, who are calling for industry-wide ethics in AI.
In other cases, the opacity of machine learning techniques may simply be due to fact that the statistical ways algorithms reach conclusions aren’t easily understood by people, given how poorly humans deal with statistics. “Imagine that a machine tells you that the reason it concluded something was that it combined 10 observations that were present, while another 10 were absent, for an 80% certainty. That’s considered a top-notch explanation,” says Eric Horovitz, Microsoft Research’s managing director and a long-time leader in the AI community. “My reaction with a wry look is I’m not sure if that’s really a good explanation. I haven’t seen good explanations anywhere, even in non-neural nets.” In other words, what counts as a good explanation to the experts that invent a system might not suffice for the users of that system, who don’t know all its ins and outs. Horvitz points out that to create machines that can truly explain themselves, we’ll have to better understand just what humans regard as a good explanation. That’s a hot topic of research in a few universities, including the University of Oregon and Carnegie Mellon. But it’s also an issue that non-academics are running into as they invent software that makes AI user-friendly.
The Uses Of Mystery
The term “black box,” used to refer to a mysterious, closed technology, likely originated with World War II pilots in Britain’s Royal Air Force. The first airplane radar instruments were installed in boxes painted black to prevent light interference; they allowed pilots to target bombs through clouds. To the swashbuckling fliers who had once been guided by only their senses and few gauges, the new radars were like magic.
Arthur C. Clark happened to be an RAF radar operator during World War II. Twenty years later, he would go on to write the script for Stanley Kubrick’s 2001, in which a black obelisk lacking motive or origin, brings tool-making to mankind’s ancestors. Mankind eventually parlays those talents into making the ultimate tool, an artificial intelligence. In addition to making the black box an enduring image in popular imagination, 2001 also articulated something essential about technology’s march. We’re drawn to black boxes, by the promise to be awed by how great things can become. They make oracles out of technology. Even today, the greatest compliment that we give to technology is that it “works like magic.” Of course, magic isn’t exactly a good grounding for trust. The things we cannot understand are liable to disappoint or terrify us. So even if black boxes draw us in, mystery isn’t so appealing in practice. Perhaps no other company is dealing with this delicate balance quite as much as IBM.
More than any other company, IBM has defined the common-imagination about AI, via its multi-billion dollar campaign Watson campaign. In dozens of commercials, Watson acts like a self-deprecating smarty pants, who amazes everyone from doctors to golf pros with astounding insights while offering no explanation for them beyond a blinking logo. Watson is a black box as brand. This promise of human-like capabilities in the form of an obelisk has drawn fire from some, who argue that Watson is marketing to people’s imaginations about AI rather than showcasing what it can actually do.
In reality, Watson is both humbler and more powerful than you might expect–it’s not one product, but many, tailored to different industries. Watson Oncology may be the company’s marquee product. It consists of a simple web app that looks like nothing so much as a very fancy, user-friendly database. Which, in a way, it is. For 10 different types of cancer, doctors can enter a number of patient attributes, then Watson spits out treatment recommendations, grouped from “Recommended” to “Not recommended.” Those recommendations were trained from a data set created by a team of cancer doctors at Memorial Sloan Kettering. Watson Oncology represents the cumulative wisdom of some of the world’s leading physicians. To develop Watson Oncology, IBM does constant design research, covering every new feature or improvement. I recently sat in on one such session, in which the design team was verifying that a refreshed interface made intuitive sense to new users.
The woman leading the session asked a doctor who’d never used Watson before to explain what each part of the interface did, and how it should behave. Then, as the session was winding down, she asked, “If Watson were a colleague, how would you describe it?” Questions that probe for underlying metaphors are common in design research. The thinking goes, the right metaphor can illuminate just how far along people think a product is–and, perhaps, what it should become. In describing IBM Watson to me, its inventors were at pains to call it something like a learned colleague, or a trusted advisor. That’s what the researchers would be hoping to hear–it would mean that the product was properly expressing its own power. But this doctor, when asked the question, wasn’t biting. “I’d say Watson is the recipe,” he said, explaining that the art of cooking lies in leaving the recipe behind.
“That was a little deflating,” Lillian Coryn, Watson’s senior design manager admitted afterwards. A doctor had demoted Watson from a smart second-opinion to a rote set of instructions. If the doctor saw Watson in such mundane terms, it was hard to imagine him becoming a devoted user. I’ve seen variations of this problem while reporting about AI: of systems to make parole recommendations, for example, that mostly told people what they thought they already knew, and thus didn’t seem obviously useful to humans even while outperforming them. Humans, it seems, need to be sold on the benefits–while the benefits themselves need to be tuned so that they’re just alluring enough to spur curiosity.
As I thought about the doctor’s recipe metaphor, I started to wonder if the design might need to be tweaked to better show off what Watson could do that no doctor could. Perhaps the computer could be crafted to show off just how many thousands of studies it had assembled to reach its recommendations? The deeper problem seemed to be in crafting the computer in such a way that the human was naturally drawn to its expertise. Doing so isn’t just a matter of designing interfaces in ways that we already know. Rather, the greater challenge will be in designing machines that augment our own understanding without threatening it–while simultaneously showing off just how much a computer knows that we cannot.