Why We’re So Bad At Measuring Impact, And How To Fix It

Data scientists and information economists in particular are beginning to pair with social innovators to understand the dynamics of interventions, and separating what works from what doesn’t.

Why We’re So Bad At Measuring Impact, And How To Fix It

This piece is from PopTech Editions III–Made to Measure: The new science of impact, which explores the evolving techniques to accurately gauge the real impact of initiatives and programs designed to do social good. Visit PopTech for more interviews, essays, and videos with leading thinkers on this subject.


How often has some version of this story happened:

A group of young, eager innovators come together to develop a new, promising approach to one of today’s “wicked problems” in an area like climate change, poverty alleviation, food security, or off-grid energy.

With a mix of design and engineering prowess, good intentions and no small amount of luck, they develop a laudable prototype. This wins them breathless media attention, speaking invitations to conferences and perhaps a prize or two, followed by sufficient seed capital for a pilot.

The pilot shows promise; after the intervention, the relevant critical indicator (which might be a measure of market access, public health, etc.) shows marked improvement. On the strength of this happy outcome, more capital is raised. The intervention moves out of the pilot stage and is rolled out to the community. The press is breathless. Hopes are high.

And then, much to everyone’s chagrin: almost nothing changes. The new social innovation barely makes a dent in the problem, which appears more pernicious than ever.

What happened?


If you recognize elements of this story (or if you wince in self-recognition) you are not alone. This is the common fate of most social innovations, and it’s the field’s dirty little secret: many of the most promising new approaches to tough problems fail, in ways that surprise and frustrate their creators, funders, and constituents alike.

The reasons behind such failures are complex. The most common culprit is a kind of cultural blindness on the part of would-be change agents, who fail to design “with, not for” the communities they serve, and end up trying to impose a solution from without, rather than encourage its adoption from within. More generally, it’s important to remember that wicked problems have earned that moniker for a reason–they are generally immune to “elegant hacks” and quick fixes that can be a hallmark of other endeavors, such as software development.

But there are other, deeper reasons why social innovations unexpectedly fail. They involve the many ways we unintentionally mismeasure the impact we’re having, and fool ourselves that a social intervention is working when it really isn’t.

The most common pitfall we encounter in measuring the impact a social innovation is failing to establish a control group. Without assessing a matched cohort that is not receiving an intervention, it is impossible to know what precise effect a social innovation is having.

For example, let’s say you develop an innovative literacy-improving program for children. You test a community of low-literacy subjects, then provide the intervention, and test them again. Their measured rates of literacy jump dramatically. Time to pop the champagne corks, right?

Wait a moment. Why exactly did rates of literacy improve? Was it your program? Or was it a natural byproduct of the maturation of the subjects? (Between the first and second tests, the children you tested got older–their independent cognitive development may account for the increase.) Or was it a practice effect of the test? After all, we tend to do better on tasks we’ve tried before. It might be the case that subjects simply got better because they’d seen this kind of test before.


Then again, perhaps we have run into a regression effect. These require a bit of additional explanation.

Many phenomena, like the temperature in a given month, or your bowling score, will cluster around an average. On some days, it may be moderately higher, on others moderately lower. But on average, these indicators will cluster around a central number, a “mean.”

Now, let’s imagine we take a group of subjects and give them a test, such as the baseline literacy test mentioned above. As with the examples above, most will score close to the mean, while a few will be outliers, scoring dramatically higher or lower. Given the same test again, with no additional intervention, its likely that the subjects who were outliers in the first test will “migrate” closer to the mean, while some that were at the mean in the first test will “migrate” to the extreme high or low of the range in the second. This is a purely natural statistical artifact.

Now let’s temporarily assume, for the sake of argument, that the hypothetical literacy program we devised had an astonishing 0% effectiveness. We measure the baseline of the population; then we deliver this (useless) intervention; and then measure again, paying careful attention to those who did the worst on the first test. Amazingly, many will show marked improvement, “migrating” to the middle of the pack, though for reasons that have nothing to do with our literacy program.

Even controlling for regression effects, there may be other phantoms lurking in our measurement. Placebo effects happen in social interventions just as they do in medicine. Some people who believe they’ve received an effective intervention may do better whether the intervention is actually effective or not.

Much more common, particularly in measuring social innovation initiatives is the problem of selective dropout. This occurs when the “users” of a particular intervention find it either too easy or too difficult, and stop participating. When that happens, the results of any subsequent analysis can be markedly skewed. Perhaps its true that the average literacy rates of a particular classroom of students improved by 20% after the administration of our program, but it’s meaningless if 20% of the students found it too difficult and left the class altogether.


The inverse problem–a form of priming–is particularly common in social innovation and makes measurement difficult. This occurs when the measurement of an intervention suggest–often subconsciously–what the “right” answers should be.

Finally, there are compensation effects that can occur when we change a social system. When we make cars safer, people may drive more dangerously, precisely because we made driving less dangerous. When we make cookstoves more efficient (and therefore more healthy and less polluting to use) people may use them more, offsetting the benefits of the efficiency.

All of these biases–sample maturation, practice effects, regression artifacts, placebo and compensation effects, and countless others–can dramatically distort the perceived success of a particular intervention, often making it look much more effective than it actually is.

Does this mean we should just throw in the towel? Hardly. Social science and fields like medical research are replete with tools for designing effective impact measurement. Data scientists and information economists in particular are beginning to pair with social innovators to understand the dynamics of interventions, and separate what works from what doesn’t. Technologists are uncovering new ways to aggregate core impact data and make it open. Yet this work has little bearing on the kind of impact statements demanded by many funders today.

What we need now is a revolution in both the practice and culture of social innovation, one that recognizes that meaningful measurement is every bit as essential–and artful–as the interventions themselves, and bakes it in as a core component of the work. Otherwise, we may very well be wasting everyone’s time.