The scary truth about AI chatbots: Nobody knows exactly how they work

Large language models (LLMs) like the ones that power ChatGPT and Bard are different from revolutionary technologies of the past in at least one striking way: No one—not even the people who built the models—knows exactly how they work. As tech companies race to improve and apply LLMs, researchers remain far from being able to explain or “interpret” the inner mechanics of these inscrutable “black boxes.”

Traditional computer programs are coded in exquisite detail to instruct a computer to perform the same task over and over. But neural networks, including those that run large language models (LLMs), program and reprogram themselves and reason in ways that are not comprehensible to humans. That’s why when New York Times reporter Kevin Roose documented his famously strange exchange with Bing Chat earlier this year, Microsoft CTO Kevin Scott could not explain why the bot said what it said.

This “inscrutable” aspect of LLMs is fueling a concern among scientists that continued development and application of the technology could have serious, even catastrophic, unintended results. A growing number of scientists believe that as LLMs get better and smarter they may be used by bad actors (or defense agencies) to harm human beings. Some believe that because AI systems will posess superior intellegence and superior reasoning skills versus humans, their eventual opposition to humans is a natural and predictable part of their evolution.

In March, more than 1,000 business leaders and scientists—including Turing Award winner Yoshua Bengio, Steve Wozniak, and Elon Musk—signed an open letter calling for a six-month pause in LLM development, in part because of a lack of understanding of how these AI systems work.

“[R]ecent months have seen AI labs locked in an out-of-control race to develop and deploy ever more powerful digital minds that no one—not even their creators—can understand, predict, or reliably control,” the letter reads.

Now, the “godfather of AI” Geoffrey Hinton has joined the ranks of concerned scientists. “I think it’s entirely possible that humanity is just a passing phase in the evolution of intelligence,” Hinton said in a recent interview at MIT. Hinton recently quit his job at Google so that he could speak openly about the existential dangers presented by the LLMs his own research helped make possible.

“I’m sounding the alarm and saying we have to worry about this,” he said at the time of his exit. “It’s not clear there is a solution.” Hinton says that when AI systems are allowed to set their own “sub-goals” they’ll eventually see human beings as barriers to achieving them. The classic hyptothetical: An AI tasked with solving climate change might quickly determine that humans, and human habits, are the main obsticle to achieving its goal. An AI with extra-human intelligence, the thinking goes, might quickly learn to deceive its human operators.

This danger relates directly to humans’ ability to interpret what’s going on within the inscrutable black box. OpenAI seemed to acknowledge this in a research paper published this month on AI interpretability. “Our understanding of how they work internally is still very limited,” OpenAI’s researchers wrote. “For example, it might be difficult to detect from their outputs whether they use biased heuristics or engage in deception.”

In the wake of this big leap forward in natural language processing, researchers find themselves far behind in interpreting LLMs. And far more money continues to be spent on pushing the models to higher levels of performance than on gaining a better understanding of their inner workings.

The question, then, is whether the profit-driven tech companies now developing AI can learn enough in the short term about how LLMs work to effectively manage the long term risks?

Mechanistic interpretation

Large language models grew up quickly—arguably, too quickly. The tech’s current frontrunner, ChatGPT, is powered by a radically souped-up transformer model, an invention of Google’s from 2017. OpenAI’s researchers, in broad terms, used impressive computing power to train a transformer model on massive amounts of data scraped from the web. The results were surprising and astounding: an LLM with an eerily acute sense of human language.

But OpenAI’s GPT models do more than just predict words in a sentence. Somehow, while chewing over all that training data, they gain a working knowledge of how the world works, and an ability to compute with reason.

Somehow.

But how do those intuitions arise from the model’s processing of its training data? And in what network layer and neuron does the LLM apply those intuitions to the content it outputs? The only certain way to answer such questions is to reverse engineer the neural network. That is, to follow the complex webwork of interactions among the neurons in the network as they react to an input (a prompt, perhaps) to generate an output (an answer). This reengineering is called “mechanistic interpretability.”

“[You] can look at the absolute smallest pieces of it, which might be an individual little neuron and see what it’s reacting to, and then what it feeds that reaction into,” says Joshua Batson, an interpretability researcher at LLM developer Anthropic.

The neural networks underpinning tools like ChatGPT are composed of layer after layer of these neurons, connection points where complex mathematical calculations take place. While processing mountains of textual data in a self-supervised way (no human labeling of words or phrases, no human feedback about outputs), these neurons work together to form an abstract, many-dimensional matrix that maps out the relationships between word parts, whole words, and phrases of words. The model gains a contextual understanding of how words are used together in strings, and the ability to predict what words might come next in a sentence, or which are most likely to follow naturally from a language prompt.

But today’s state-of-the-art LLMs have hundreds of millions of these neurons. Neural network architecture was roughly based on the design of the nervous systems of complex organisms (humans). And after decades of study, neuroscience has so far not succeeded in reverse-engineering that biological system.

“Neuroscience has tried to take that bottom-up approach and it’s proven to be a very difficult way to study a system of such complexity,” says Aidan Gomez, CEO of the LLM developer Cohere. In a living organism, this “bottom up” approach would mean studying the way the organism intakes sensory data and tracking the impulses as they travel from neuron to neuron and eventually form a higher-order assessment that might lead to an action, Gomez says. “Following that whole pathway is extremely difficult.”

And following the pathways from neuron to neuron in a synthetic neural network is similarly hard. This is a shame because it’s within those pathways that the beginnings of HAL 9000-style notions begin.

Image models success

The mechanistic interpretability community owes some of its most promising advances to the study of simpler neural nets, notably those designed to recognize and classify different types of images. Within these neural nets it’s easier for researchers to identify the specific tasks of single neurons, and how the work of each neuron contributes to the overall goal of identifying the content of images.

In a neural net designed to recognize cars in images, one layer of neurons might be dedicated to detecting groupings of pixels that suggest a specific shape, such as a curve or a circle. A neuron in that layer might activate and send a high probability score to another layer in the network that work out whether the shape could be a tire or a steering wheel. As these connections are made, the network becomes more and more certain that it’s looking at a car.

Interpretability, then, leads to the ability to fine tune. As Anthropic’s Batson explains: “If you want to know why something that isn’t the car gets called a car, you could trace through the network and say ‘Oh, the wheel detector fired on this thing, which is actually a frying pan’ . . . and you can start to reason about this.”

Batson says his group is very focused on studying important groups of neurons within LLMs, instead of single neurons. It’s a bit like a team of neurologists poking around in a human brain looking for sections that control different bodily or mental functions.

“It might be that we start to figure out what the basic players are [in neural networks] to say ‘Okay, here’s how it’s mapping the physical world, here’s how it’s mapping the emotional world, here’s how it thinks about literature or individuals’ and you could get these bigger chunks or modules,” says Anthropic’s Batson.

“I think the state of it today is we can apply these interpretability techniques to quite small text models, not text models for hundreds of billions of parameters size,” adds Anthropic co-founder Jack Clark. “And the question that people have is how rapidly we can apply our text interpretability techniques to much larger models.”

Interpretability and safety

Perhaps the most pressing reason AI companies have for investing in interpretability research is to find better ways of erecting “guardrails” around large language models. If a model is prone to outputting toxic speech, researchers typically study the responses of the systems to a variety of potentially risky prompts, then place limitations on what the model can say, or bar the model from responding to certain prompts entirely.

But that method has real limitations, says Sarah Wiegreffe, a model interpretability researcher at the Allen Institute for AI in Seattle. “It’s certainly limited in the sense that given the vast space of possible inputs the model could receive, and the vast space of possible outputs it could generate, it would be pretty difficult to reasonably enumerate all of the possible scenarios that you might have in the real world,” she says.

Mechanistic interpretability in this context might mean looking deep within the layers of the network to find the crucial calculation that led to an unsafe output. “So, for example, there is some recent work showing that if you can localize a certain factual statement in a language model, you can actually edit those weights of the model to essentially correct that,” Wiegreffe says. “You can do model surgery . . . to fix things that are incorrect without needing to retrain the entire system.”

But tweaking a large language model’s proclivity toward one harmful behavior might hobble its tendencies toward other behaviors we like. Explicit “do not say” commands might, for example, limit the model’s creative and improvisational capabilities. That’s a compelling argument for using less invasive ways of “steering” a model.

In fact, many in the AI community remain skeptical of the need for neuron-by-neuron mechanistic interpretability to guarentee the near-term and long-term safety of AI systems.

“I don’t think it’s the best way to study an intelligent system, given the timelines we’re working with,” Cohere’s Gomez says.

Indeed, with capitalist forces now pushing tech companies to get LLMs into production in every industry, and, soon, into use in personal technology (Alexa and Siri, for example), the AI community may not have so long to increase their understanding of how LLMs work.

“The naive approach is to simply ask the system to cite its sources,” Gomez says. “I believe in the naive approach. As these systems begin being used for more important tasks, we’ll have to demand that they base their outputs on facts.”

No benchmarks

While ample benchmarks exist for measuring the performance of language models (like standardized tests for AIs), there is yet no common set of benchmarks for measuring the interpretability of LLMs. The industry hasn’t, for example, adopted something like OpenAI’s scoring system for interpreting the output of a single neuron in an LLM.

Rather, you have a lot of researchers doing their best to shine flashlights on things inside a very, very large black box. “We don’t yet have the metric or benchmark that we could all agree on and work toward,” Batson says. “We have the phenomena that we understand, and we’re putting together the big picture right now.” Researchers publish papers describing new techniques for studying models, and other researchers in the community then try to understand advances are comprehensible and build on existing intuitions.

“You definitely know when you see it,” Batson says. “You’re like ‘Oh okay, this is a much better picture of what’s going on.”

Interpretability and the ‘alignment problem’

While the near-term safety of LLMs is important, and will continue to be, the LLMs of the future may present far more serious threats than just toxic outputs. The researcher and philosopher Eliezer Yudkowsky has been raising the alarm that as LLMs get better and far outpace humans in intellegence, and as they become more autonomous, chances are very high that they will begin acting against the interests of humankind.

That eventuality may be more likely than you think. Let’s suppose that LLMs continue getting better at learning and reasoning, and better able to capture data (real-time visual and audio data, perhaps) that ground them in the real world, and that they begin to share data and train each other. Let’s also assume that LLMs (and not some other type of AI) end up being the path to AGI, or artificial general intelligence, far exceeding human intelligence in important ways. Without a thorough (mechanistic) understanding of the early antecedents of these powerful future LLMs, can we realistically expect to manage these AIs in every stage of their development so that they remain aligned with human interests, and disinclined to act against us or even do away with us?

There is disagreement on this question. Both Yudkowsky and Hinton have grave doubts that humans will be able to manage alignment in AI systems, and neither believe that acheiving mechanistic inpretability in these systems is a silver bullet.

“[I]f you’re all in the middle of a global AI arms race, people will say there’s no point in slowing down because their competitors won’t slow down,” Yudkowsky says. He believes AI systems will resist human safety training by learning to conceal their internal processes. “If you try to apply your omnicidal-thoughts detector to train the giant inscrutable matrices not to have visible omnicidal thoughts anymore, you’re training both against omnicidalness and against visibility.”

“This is the broad gloss on why ‘be able to see a warning sign inside the AI’s thoughts’ levels of interpretability doesn’t mean everyone is safe,” Yudkowsky says.

Recognize your brand’s excellence by applying to this year’s Brands That Matter Awards before the early-rate deadline, May 3.

The frightening truth about AI chatbots: Nobody knows exactly how they work