Miles Porter
May 13, 2020
One of the fundamental pillars of data science and analytics is statistics. Statistics typically involves sampling data and using those samples to make statements, with varying degrees of confidence about the population from which they came. Descriptive statistics, what may people consider to be ALL of statistics, involves developing numerical measurements about a set of data. We learned about one basic descriptive statistic when we were in elementary school, the mean. Other descriptive statistics like mode, variance and standard deviation come later. Essentially, they all do the same thing… they reduce down a dataset into some basic numerical values.
Descriptive statistics also provide a way to compare different sets of data. ANOVA (aka Analysis of Variance) is one technique that does this. In its most basic form, ANOVA can tell us if the mean of two different datasets is significantly different.
Care must be taken, however, when using descriptive statistics as a way to compare datasets. To see why, let’s take a look at a dataset that was created by Albert Cario called “The Datasaurus Dozen”.
Let’s begin by reading in the dataset into R and taking a look at the data…
rm(list=ls())
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.3 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
dino <- read_tsv('DatasaurusDozen.tsv')
## Parsed with column specification:
## cols(
## dataset = col_character(),
## x = col_double(),
## y = col_double()
## )
dino$dataset <- as.factor(dino$dataset)
head(dino)
## # A tibble: 6 x 3
## dataset x y
## <fct> <dbl> <dbl>
## 1 dino 55.4 97.2
## 2 dino 51.5 96.0
## 3 dino 46.2 94.5
## 4 dino 42.8 91.4
## 5 dino 40.8 88.3
## 6 dino 38.7 84.9
The Datasaurus Dozen contains 12 datasets, each with an X and Y values. The datasets are names
Let’s begin by looking side by side boxplots of the X and Y values for each of these datasets.
ggplot(dino, aes(x = dataset, y = x)) + geom_boxplot()
ggplot(dino, aes(x = dataset, y = y)) + geom_boxplot()
In the above, the boxplots look very similar.
One test we can use to determine if the means of the data are statistically different is ANOVA. Running ANOVA on the Datasaurus datasets we get the following:
print("ANOVA Test for Datasaurus X Values")
## [1] "ANOVA Test for Datasaurus X Values"
a <- aov(x~dataset, data=dino)
summary(a)
## Df Sum Sq Mean Sq F value Pr(>F)
## dataset 12 0 0.0 0 1
## Residuals 1833 515354 281.1
print("ANOVA Test for Datasaurus Y Values")
## [1] "ANOVA Test for Datasaurus Y Values"
a <- aov(y~dataset, data=dino)
summary(a)
## Df Sum Sq Mean Sq F value Pr(>F)
## dataset 12 0 0.0 0 1
## Residuals 1833 1329881 725.5
In the two above tests, the null hypothesis \(H_0\) is that difference between the means for the x’s and the y’s is 0. In other words, the means of the x’s and the means of the y’s are the same for all the 12 datasets.
Since the the p-value (Pr(>F)) is 1 for both tests, we cannot reject this null hypothesis for the x’s or the y’s.
It is critical to keep in mind what the last statement says, exactly. It does NOT say that the data sets are equal. It says that we cannot reject that the datasets have the same mean.
It is a subtle difference, and it turns out that these datasets are very different. ANOVA just doesn’t capture the difference.
We can also look at the standard deviations of the different sets:
nms = c("away","bullseye","circle","dots","h_lines","high_lines","slant_down",
"slant_up","star","v_lines","wide_lines","x_shape","dino")
var_sum <- data.frame(nm=character(), x_std=numeric(), y_std=numeric())
for (nm in nms){
varx = var(filter(dino, dataset==nm)$x)
vary = var(filter(dino, dataset==nm)$y)
sdx = sd(filter(dino, dataset==nm)$x)
sdy = sd(filter(dino, dataset==nm)$x)
var_sum <- rbind(var_sum, data.frame(nm, varx, vary, sdx, sdy))
}
names(var_sum) <- c('dataset','variance_x','variance_y','stdev_x','stdev_y')
print(var_sum)
## dataset variance_x variance_y stdev_x stdev_y
## 1 away 281.2270 725.7498 16.76982 16.76982
## 2 bullseye 281.2074 725.5334 16.76924 16.76924
## 3 circle 280.8980 725.2268 16.76001 16.76001
## 4 dots 281.1570 725.2352 16.76774 16.76774
## 5 h_lines 281.0953 725.7569 16.76590 16.76590
## 6 high_lines 281.1224 725.7635 16.76670 16.76670
## 7 slant_down 281.1242 725.5537 16.76676 16.76676
## 8 slant_up 281.1944 725.6886 16.76885 16.76885
## 9 star 281.1980 725.2397 16.76896 16.76896
## 10 v_lines 281.2315 725.6388 16.76996 16.76996
## 11 wide_lines 281.2329 725.6506 16.77000 16.77000
## 12 x_shape 281.2315 725.2250 16.76996 16.76996
## 13 dino 281.0700 725.5160 16.76514 16.76514
Looking at the table above, we can see that the variances and standard deviations for the x and y values in the datasets are very very close.
So, if the means and the variances of the data sets are nearly identical, are they nearly the same data? It turns out that they are not even close to being the same.
As you probably have already guessed, there is more going on here. Let’s take a look at the different plots of the raw data…
plot_ds <- function(nm){
g = ggplot(filter(dino, dataset==nm), aes(x = x, y = y)) + geom_point()
return(g)
}
nms = c("away","bullseye","circle","dots","h_lines","high_lines","slant_down",
"slant_up","star","v_lines","wide_lines","x_shape","dino")
for (nm in nms){
print(plot_ds(nm))
}
And in the last graph, it is pretty clear why this is called “The Datasaurus Dozen”!
As The Datasaurus Dozen points out, just looking at descriptive statistics is not necessarily enough to get a sense of what is going on with datasets. One of the most powerful tools we have in our toolkit is our eyes.
What we are seeing here is illustrated by another famous dataset that has been since the 1970s. That dataset is called Anscombe’s Quartet. One of the biggest mysteries with Anscombe’s Quartet is that nobody really knows how he came up with the dataset. Researchers at Autodesk used a technique called simulated annealing that allowed Cario to develop the Datasaurus. More information about that can be found in this post: