Why is data visualization so important in statistics, anyway? Graphs and other kinds of visualizations might seem superfluous, if you’re using statistical analysis to look for patterns in a data set, right? Short answer: wrong.
A new research paper presented this week at the human-computer interaction conference ACM CHI shows just how important it is to visualize your data. In it, two Autodesk researchers show how 12 data sets that share the same basic qualities, like mean, standard deviation, and Pearson’s correlation, can look radically different as graphs. The data sets might have a lot in common on paper, but as visualizations they form stars, circles, and other shapes. The point? To show that data visualization isn’t just aesthetic–it’s a crucial part of analysis that can reveal surprising things about your data.
“There’s still the impression that creating graphics or visualizations is really just making pretty pictures and the real stuff you need to do can be done through analysis,” says Autodesk researcher Justin Matejka, who wrote the paper with fellow researcher George Fitzmaurice. “Even if you’re very good at statistics, you might miss something.”
The paper builds on a classic idea in statistics called Anscombe’s Quartet. The “quartet” is a group of four data sets, created by the statistician F.J. Anscombe in 1973, that have the same “summary statistics,” or mean, standard deviation, and Pearson’s correlation. Yet they each produce wildly different graphs. It’s a famed demonstration of just how vital it can be to visualize data rather than relying on statistics alone, and Matejka and Fitzmaurice wanted to update it for data-rich 2017.
“[Anscombe’s Quartet] is 45 years old at this point, so maybe it’s time for a slightly more exciting tool to teach the same lesson,” Matejka says.
They were also inspired by an image from the data viz expert Albert Cairo, who tweeted a visualization of a data set that formed the shape of a T. rex (he called it “the datasaurus”) last year. The numbers in this data set itself looked totally normal–it wasn’t until they were visualized that the dinosaur emerged. No matter how well you think you know your data, visualizing it can reveal something surprising.
— Alberto Cairo (@albertocairo) August 15, 2016
Matejka and Fitzmaurice took his point even further. Their work shows how 12 different data sets that have the same summary statistics as the Datasaurus can have 12 vastly different graphic representations. Each of the 12 data sets began as the data set that Cairo used to make the Datasaurus, and yet the resulting graphics form a series of shapes that Matejka chose particularly because they’re so different.
To achieve what they call the “Datasaurus dozen,” Matejka and Fitzmaurice made 200,000 incremental changes to the Datasaurus data set, slightly shifting points so that the summary statistics stayed within one-hundredth of the originals. GIFs that show the slowly shifting points next to the summary statistics hammer their point home.
Matejka explains why this is important in practice through what’s called Simpson’s Paradox, where groups of data within a set might show one trend, but the entire data set might show something completely contradictory. For example, Matejka points to one set of data that appears to show crime increasing. Yet when that data is broken down by location, there is a strong downward trend in crime in each area–another example of how data that has the same summary statistics can look vastly different when it’s graphed.
Matejka hopes that the research can be used for educational purposes, but he also believes that the researchers’ iterative approach to shifting data sets could have more commercial applications. Take a data set from a study that includes identifying information, for instance. Their approach could preserve the data’s summary statistics while truly anonymizing it.
One unintended consequence of Matejka’s research is the implication that data is also easily manipulated–even when keeping summary statistics constant. If you can make a visualization that has the same statistical characteristics as another, it could be possible to mess with data without detection–though only if you stay far away from graphing it.
So data designers, rest assured: Your work is very, very important.