The Datasaurus - Why Data Visualization is Critical

Miles Porter
May 13, 2020

One of the fundamental pillars of data science and analytics is statistics. Statistics typically involves sampling data and using those samples to make statements, with varying degrees of confidence about the population from which they came. Descriptive statistics, what may people consider to be ALL of statistics, involves developing numerical measurements about a set of data. We learned about one basic descriptive statistic when we were in elementary school, the mean. Other descriptive statistics like mode, variance and standard deviation come later. Essentially, they all do the same thing… they reduce down a dataset into some basic numerical values.

Descriptive statistics also provide a way to compare different sets of data. ANOVA (aka Analysis of Variance) is one technique that does this. In its most basic form, ANOVA can tell us if the mean of two different datasets is significantly different.

Care must be taken, however, when using descriptive statistics as a way to compare datasets. To see why, let’s take a look at a dataset that was created by Albert Cario called “The Datasaurus Dozen”.

Load the data

Let’s begin by reading in the dataset into R and taking a look at the data…

rm(list=ls())
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.3     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
dino <- read_tsv('DatasaurusDozen.tsv')
## Parsed with column specification:
## cols(
##   dataset = col_character(),
##   x = col_double(),
##   y = col_double()
## )
dino$dataset <- as.factor(dino$dataset)
head(dino)
## # A tibble: 6 x 3
##   dataset     x     y
##   <fct>   <dbl> <dbl>
## 1 dino     55.4  97.2
## 2 dino     51.5  96.0
## 3 dino     46.2  94.5
## 4 dino     42.8  91.4
## 5 dino     40.8  88.3
## 6 dino     38.7  84.9

The Datasaurus Dozen contains 12 datasets, each with an X and Y values. The datasets are names

  • away
  • bullseye
  • circle
  • dino
  • dots
  • hlines
  • highlines
  • slant_down
  • slant_up
  • star
  • v_lines
  • wide_lines
  • x_shape

Visualize boxplots

Let’s begin by looking side by side boxplots of the X and Y values for each of these datasets.

ggplot(dino, aes(x = dataset, y = x)) + geom_boxplot()

ggplot(dino, aes(x = dataset, y = y)) + geom_boxplot()

In the above, the boxplots look very similar.

ANOVA

One test we can use to determine if the means of the data are statistically different is ANOVA. Running ANOVA on the Datasaurus datasets we get the following:

print("ANOVA Test for Datasaurus X Values")
## [1] "ANOVA Test for Datasaurus X Values"
a <- aov(x~dataset, data=dino)
summary(a)
##               Df Sum Sq Mean Sq F value Pr(>F)
## dataset       12      0     0.0       0      1
## Residuals   1833 515354   281.1
print("ANOVA Test for Datasaurus Y Values")
## [1] "ANOVA Test for Datasaurus Y Values"
a <- aov(y~dataset, data=dino)
summary(a)
##               Df  Sum Sq Mean Sq F value Pr(>F)
## dataset       12       0     0.0       0      1
## Residuals   1833 1329881   725.5

In the two above tests, the null hypothesis \(H_0\) is that difference between the means for the x’s and the y’s is 0. In other words, the means of the x’s and the means of the y’s are the same for all the 12 datasets.

Since the the p-value (Pr(>F)) is 1 for both tests, we cannot reject this null hypothesis for the x’s or the y’s.

It is critical to keep in mind what the last statement says, exactly. It does NOT say that the data sets are equal. It says that we cannot reject that the datasets have the same mean.

It is a subtle difference, and it turns out that these datasets are very different. ANOVA just doesn’t capture the difference.

We can also look at the standard deviations of the different sets:

nms = c("away","bullseye","circle","dots","h_lines","high_lines","slant_down",
        "slant_up","star","v_lines","wide_lines","x_shape","dino")


var_sum <- data.frame(nm=character(), x_std=numeric(), y_std=numeric())

for (nm in nms){
 varx = var(filter(dino, dataset==nm)$x)
 vary = var(filter(dino, dataset==nm)$y)
 sdx = sd(filter(dino, dataset==nm)$x)
 sdy = sd(filter(dino, dataset==nm)$x)
 var_sum <- rbind(var_sum, data.frame(nm, varx, vary, sdx, sdy))
}
names(var_sum) <- c('dataset','variance_x','variance_y','stdev_x','stdev_y')
print(var_sum)
##       dataset variance_x variance_y  stdev_x  stdev_y
## 1        away   281.2270   725.7498 16.76982 16.76982
## 2    bullseye   281.2074   725.5334 16.76924 16.76924
## 3      circle   280.8980   725.2268 16.76001 16.76001
## 4        dots   281.1570   725.2352 16.76774 16.76774
## 5     h_lines   281.0953   725.7569 16.76590 16.76590
## 6  high_lines   281.1224   725.7635 16.76670 16.76670
## 7  slant_down   281.1242   725.5537 16.76676 16.76676
## 8    slant_up   281.1944   725.6886 16.76885 16.76885
## 9        star   281.1980   725.2397 16.76896 16.76896
## 10    v_lines   281.2315   725.6388 16.76996 16.76996
## 11 wide_lines   281.2329   725.6506 16.77000 16.77000
## 12    x_shape   281.2315   725.2250 16.76996 16.76996
## 13       dino   281.0700   725.5160 16.76514 16.76514

Looking at the table above, we can see that the variances and standard deviations for the x and y values in the datasets are very very close.

So, if the means and the variances of the data sets are nearly identical, are they nearly the same data? It turns out that they are not even close to being the same.

Seeing is Believing!

As you probably have already guessed, there is more going on here. Let’s take a look at the different plots of the raw data…

plot_ds <- function(nm){
  g = ggplot(filter(dino, dataset==nm), aes(x = x, y = y)) + geom_point()
  return(g)
}

nms = c("away","bullseye","circle","dots","h_lines","high_lines","slant_down",
        "slant_up","star","v_lines","wide_lines","x_shape","dino")

for (nm in nms){
print(plot_ds(nm))
}

And in the last graph, it is pretty clear why this is called “The Datasaurus Dozen”!

Conclusion

As The Datasaurus Dozen points out, just looking at descriptive statistics is not necessarily enough to get a sense of what is going on with datasets. One of the most powerful tools we have in our toolkit is our eyes.

What we are seeing here is illustrated by another famous dataset that has been since the 1970s. That dataset is called Anscombe’s Quartet. One of the biggest mysteries with Anscombe’s Quartet is that nobody really knows how he came up with the dataset. Researchers at Autodesk used a technique called simulated annealing that allowed Cario to develop the Datasaurus. More information about that can be found in this post:

https://www.autodeskresearch.com/publications/samestats