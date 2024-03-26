BY Saul Perlmutter, John Campbell, and Robert Maccounlong read

Expert overconfidence can have grave consequences. After the Challenger space shuttle exploded in 1986, an investigation found that NASA had officially predicted that there would be one failure in 100,000 launches—about the same odds we considered for getting hit by a car crossing Hearst Ave in the last chapter. Yet other evidence revealed that NASA had strong evidence against this rosy projection. NASA experts had, only five years prior, noted in a report that the historical failure rate for solid‑fuel rockets—which were used to boost the Challenger into orbit—was one in every 57 firings. Since the space shuttle launches used two such rockets every time, a failure could be expected once in every 28 or 29 launches, assuming the historical failure rate continued unchanged. Challenger was the twenty‑fifth shuttle launch, so it was almost right on schedule to fail. So something happened within the organization to transform a pretty pessimistic risk estimate into a much more optimistic—and, apparently, unrealistic—one.

In the mid‑twentieth century, Lev Landau, a theoretical physicist, offered a pithy description of expert overconfidence among scientists: “Cosmologists are often in error, but never in doubt.” This is perhaps an overstatement; after all, scientists do sometimes retract their findings. For example, in 2010, twenty‑three experts published an open letter to the Federal Reserve’s then‑chairman, Ben Bernanke, arguing that his quantitative easing policies would produce “currency debasement and inflation.” In 2014, when it was clear that this prediction was wrong, two journalists contacted the twenty‑three signatories. Fourteen refused to comment, but those who did said their views hadn’t changed. Having earlier mocked these experts, Paul Krugman, a New York Times columnist (and Nobel Prize–winning economist), was more open in early 2022 in acknowledging his own error in dismissing predictions that President Biden’s 2021 stimulus package would produce high inflation. “I don’t want to be like those guys. So I’m currently spending a fair bit of time trying to understand why my relaxed view of inflation early last year has been refuted by events.” However, he did go on to argue that his original analysis was correct in its fundamentals, and that the Covid pandemic was responsible for upending the usual patterns in the economy. (Economics is a tough game to play, and even to write about; while Krugman underestimated inflation, it is still not clear who is right about the specific role played by the stimulus package.) THE IMPORTANCE OF INTELLECTUAL HUMILITY With these examples in mind, a challenge for expert authority in the Third Millennium is to cultivate what is often called intellectual (or epistemic) humility. Psychologist Mark Leary has been studying this trait for many years, finding that people high in intellectual humility are “more attentive to the strength of evidence regarding factual claims” and “more interested in understanding the reasons that people disagree with them.” He notes that “[c]ultures vary in the degree to which they value openness and flexibility and tolerate uncertainty and ambiguity.” At its best, Silicon Valley is a culture that has fostered an openness about errors, as exemplified by the popular slogan “Fail fast, fail often.” Of course, the saying is not a celebration of failure for its own sake, but a claim that failures are an inevitable by‑product of cutting‑edge technological entrepreneurship. There’s a similar notion among many scientists who contend that every graduate student is guaranteed to make certain experimental errors, and so the best course is to get them over with by gaining a lot of research experience as early as possible.

Recently, a community of younger scholars in psychology has begun to promote a culture in which researchers admit their mistakes. In their “Loss‑of‑Confidence” project, the scholars documented findings they had reported but now doubt. And they reported an anonymous survey of 315 scientists that found that 44% reported questioning at least one of their published findings. In the majority of these cases, researchers had not publicly acknowledged their loss of confidence, or did so but in a forum other than the journal that published their study. CALIBRATING OUR CONFIDENCE LEVELS As we argued earlier, scientific evidence can provide probabilities but not absolute certainties. This implies that it is both foolish and unfair to expect our experts to be infallible. Even if they are doing their job per‑ fectly, they will get things wrong some of the time. But it is quite sensible to expect our experts to be calibrated. What do we mean by “calibration”? If the expert is offering the prob‑ ability of an event, we can look across many different situations and see if the prediction matches the event’s frequency. If the expert is making a categorical assertion—“It is a brain tumor—we can ask them to quantify the probability that they are correct. And if they are estimating a quantity, we can ask them to describe the range of low‑to‑high estimates that they are 95% confident will contain the correct value.

A person is well calibrated when their stated confidence level at the moment of prediction matches their accuracy rate once we find out about the true outcome. We demonstrate what this means for our students by giving them dichotomous questions like, “Which is longer—the Panama Canal or the Suez Canal?” Now, most people haven’t looked up and memorized the answer to this question. But we aren’t studying our students’ mastery of “useless trivia.” Instead, we want to know how they rate their confidence in their response to each item. When their expressed confidence level corresponds to the odds of being correct, they are perfectly calibrated. For example, across all the times you give a confidence level of 50%, you should be right half the time and wrong half the time. Across all the times you say you are 100% confident, you should always be right. If your accuracy rate falls below your expressed confidence, then you are overconfident. You are underestimating your ignorance. In the figure below, we show the results from many years of giving students the calibration exercise. When students report their confidence level as 50%, which essentially means they are guessing, they get things right a little more than 50% of the time — perhaps because they know more than they realize. But as people express increasingly high confidence in their answers, they are consistently less accurate than they believed they would be. This “classic” calibration pattern, showing a clear tendency toward overconfidence, has been derived again and again in many different studies for many different populations. Evidence for this “overconfident” miscalibration is also seen in expert judgments. In the early 2000s, several researchers studied overconfidence among stock market forecasters in Germany. They asked 350 financial experts to give their predictions for the level of the DAX index (the German equivalent of the Dow Jones average) six months ahead of time on a rolling monthly basis. Importantly, they asked each expert to specify a 90% confidence interval for each prediction—the range of values within which they thought the actual DAX value would fall nine times out of ten. Here’s what happened: Every month, the actual DAX value fell completely outside the confidence intervals that many of the experts had provided six months earlier. In fact, more than half the time during the twenty‑six‑month‑long study, it turned out that fewer than half of the experts had provided confidence intervals broad enough to encompass the DAX value that month. Not only were many of these experts quite wrong about the future direction of the German stock market, but they were also very poor at estimating how wrong they might be.

That previous sentence actually holds the key to the concept of calibration. It contains the idea that in addition to knowledge (what you use, for example, to predict the DAX index six months out), there is metaknowl‑ edge, or knowledge about your knowledge. The German financial experts who provided confidence intervals that proved too tight were demon‑ strating poor metaknowledge. They didn’t know how much they didn’t know. They could have done better by improving their metaknowledge—that is, by calibrating their confidence levels. Another example comes from Phil Tetlock’s research on professional foreign policy experts. What these experts predict can have a profound influence on public policy. Partly on the basis of expert predictions and forecasts, the US Congress makes allocations to the military budget, and the president develops diplomatic, economic, and military strategies and negotiates treaties. The more confident the experts are in their pre‑ dictions, the more likely members of Congress and the president are to be swayed one way or the other. Tetlock’s research suggests that we need to be wary of such predictions. He asked several hundred foreign policy experts to make yes‑or‑no predictions for events five and ten years out. He asked, for example, “Will Vladimir Putin still be president of Russia in 2016?” For each prediction, he also had his subjects provide a level of confidence on a scale from 1 to 9. He found two pieces of bad news. First, the predictions were barely more accurate than would have been obtained by tossing a coin. Second, there was essentially no relationship between accuracy and expressed confidence. Predictions that turned out to be right had an average self‑reported confidence between 6.5 and 7.6; those that were wrong had an average confidence level between 6.3 and 7.1. These were not significantly different. The experts who were wrong were just as confident as the ones who were right. That means the confidence with which a foreign policy expert stated their prediction was a very poor guide for deciding whether you should believe them. We might expect physicists and other natural scientists to be better at calibrating their confidence than social scientists, especially when they are investigating features of the natural world that have nothing to do with politics. After all, natural scientists have at their disposal data, with frequency distributions and multiple measures, and advanced statistical formulas into which they input reams of data and get back precise confidence intervals. But it appears that those in the “hard” sciences have often in the past had just as much trouble judging the appropriate level of confidence to attach to their findings as financial and foreign policy experts.

Interestingly, part of the reason that we know this about natural scientists is that physicists have been particularly interested in understand‑ ing how well calibrated their confidence statements are, so they have been studying and tracking this question for decades. Physics was one of the first fields of science to work with extremely large datasets, and physicists have a long tradition of collaboration amid competition among teams around the world, so in the late 1950s and early 1960s they started to collect, compare, and combine their competing measurements—and their confidence estimates. They soon discovered indications of misplaced confidence in results. For example, in trying to pin down the exact values of physical constants like the speed of light and the mass of an electron, physicists would be expected to report great uncertainty in the initial measurements, followed by progressively more certain estimates over time. In other words, the error bars should start out very broad and get tighter and tighter with each new study, and each new measurement of the constant should be likely to lie within the margin of error of the previous one. But this is not what happened. When the physicists plotted their historical estimates for the value of c, the speed of light, from 1870 to the 1960s along with their error bars, they found that the estimates bounce all over the place, and frequently the value estimated in one study lies entirely outside the margin of error given by the previous study. The same inconsistent and seemingly incoherent pattern occurs for the estimates of such physical constants as the inverse fine‑structure constant, Planck’s constant, the charge of an electron, the mass of an electron, and Avogadro’s number. Of course, during this whole history of the measurements, the scientists believed that their work finally represented a close approximation of the truth. For example, in 1941, the physicist Raymond Birge wrote, “After a long and, at times, hectic history, the value for c has at last settled down into a fairly satisfactory ‘steady’ state.” Not long afterward, most estimates for the value of c began coming in much higher than Birge’s estimate, quite a bit outside his stated confidence interval, and the presentday estimate, known to a high degree of confidence, is similarly well outside Birge’s “steady state.” After the physicists saw these failures in estimating confidence levels, they became much more cautious about trusting simple internal estimates and began demanding much more cross‑comparison of results to gauge the uncertainties, and much tougher standards for accepting a claimed scientific discovery. Yet even so, one of the biggest lessons that physics experimentalists teach their students is that they will still be over‑confident in their measurements!

Even if overconfidence is a feature of human psychology, improving our calibration is possible. Under certain circumstances, we can calibrate our confidence quite well. When you study the calibration of confidence levels among different professions in which predictions are integral to the work, you find that meteorologists’ short‑term forecasts, for example, are remarkably well calibrated. If you look at all the times a weather fore‑ caster says the chances of rain the next day are 80%, you find that about 80% of the time the day turned out to be rainy. Why such good calibration? The key might be that meteorologists are constantly getting immediate feedback about their predictions. In addition, meteorologists’ professional prestige depends on their metaknowledge (that is, being calibrated) at least as much as it depends on their knowledge (being accurate). In any profession or domain, professional demands and social and cultural forces influence people’s judgments about the state of their knowledge. Becoming aware of the forces that affect how confidence lev‑ els are calibrated in yours may help you identify and resist the forces that subtly push you toward overconfidence. We should strive, in a sense, to emulate IBM’s supercomputer, Watson. Watson famously defeated the best human Jeopardy! players not just because of its vast, Wikipedia‑like knowledge, but because of its astute metaknowledge. In Jeopardy!, metaknowledge is hugely important because only one contestant gets the chance to provide the right “question” to each “answer,” and that’s the contestant who presses the buzzer first. Since incorrect responses are penalized, there’s a great disincentive to simply “buzz in” as rapidly as possible. You want to press the buzzer only if you know, or think you know, the correct response. The contestants who win are those able to rapidly determine if they know the correct “question” or not. Watson is programmed to do this in real time and to do it very well. It knows its own state of ignorance. Watson basically tells you, “In this case you should believe me; in this other case there’s little reason to believe me,” and that’s a very valuable thing in an expert.

Excerpted with permission from THIRD MILLENNIUM THINKING by Saul Perlmutter, John Campbell and Robert MacCoun. Copyright © 2024 by Saul Perlmutter, John Campbell, and Robert MacCoun. Used with permission of Little, Brown Spark, an imprint of Little, Brown and Company. New York, NY. All rights reserved. Saul Perlmutter is a Nobel Laureate in Physics, is a professor at the University of California, Berkeley and Lawrence Berkeley National Laboratory. John Campbell is a former president of the European Society for Philosophy and Psychology, is a professor of philosophy at the University of California, Berkeley.