When Are Numbers Note Enough?

It can be difficult to demonstrate the importance of data visualization. The truth is that statistical analysis alone is insufficient to tell a complete (and true) story.

Developed by F.J. Anscombe in 1973, Anscombe’s Quartet consists of four datasets, each having identical summary statistics (mean, standard deviation, and correlation). This may incorrectly lead one to believe that the data being described by the datasets are similar. However, after visualizing the data, it becomes clear that the datasets are very different.

Never Trust Summary Statistics Alone

In this section, we examine the datasaurus_dozen data set available in the datasauRus library.

library(datasauRus)
library(dplyr)
library(ggplot2)

We begin by examining the basic structure of the data. Clearly, there is a data set label and a number of \(x\) and \(y\) values associated with the data sets.

head(datasaurus_dozen)
## # A tibble: 6 x 3
##   dataset     x     y
##   <chr>   <dbl> <dbl>
## 1 dino     55.4  97.2
## 2 dino     51.5  96.0
## 3 dino     46.2  94.5
## 4 dino     42.8  91.4
## 5 dino     40.8  88.3
## 6 dino     38.7  84.9

In all, there are 13 unique data sets in this data all with \(x\) and \(y\) variables.

unique(datasaurus_dozen$dataset)
##  [1] "dino"       "away"       "h_lines"    "v_lines"    "x_shape"   
##  [6] "star"       "high_lines" "dots"       "circle"     "bullseye"  
## [11] "slant_up"   "slant_down" "wide_lines"

We will now summarize the data for each data set.

datasaurus_dozen %>%
  group_by(dataset) %>%
  summarize(
      mean_x    = mean(x),
      mean_y    = mean(y),
      std_dev_x = sd(x),
      std_dev_y = sd(y),
      corr_x_y  = cor(x, y)
  )
## # A tibble: 13 x 6
##    dataset    mean_x mean_y std_dev_x std_dev_y corr_x_y
##    <chr>       <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
##  1 away         54.3   47.8      16.8      26.9  -0.0641
##  2 bullseye     54.3   47.8      16.8      26.9  -0.0686
##  3 circle       54.3   47.8      16.8      26.9  -0.0683
##  4 dino         54.3   47.8      16.8      26.9  -0.0645
##  5 dots         54.3   47.8      16.8      26.9  -0.0603
##  6 h_lines      54.3   47.8      16.8      26.9  -0.0617
##  7 high_lines   54.3   47.8      16.8      26.9  -0.0685
##  8 slant_down   54.3   47.8      16.8      26.9  -0.0690
##  9 slant_up     54.3   47.8      16.8      26.9  -0.0686
## 10 star         54.3   47.8      16.8      26.9  -0.0630
## 11 v_lines      54.3   47.8      16.8      26.9  -0.0694
## 12 wide_lines   54.3   47.8      16.8      26.9  -0.0666
## 13 x_shape      54.3   47.8      16.8      26.9  -0.0656

Quickly, we notice that each data set has the same mean \(x\), mean \(y\), standard deviation of \(x\), standard deviation of \(y\) and correlation between \(x\) and \(y\).

Basic Plots to the Rescue

We plot the data below. Notice that the diagrams are each quite unique - far more dissimilar than the initial summary statistics would suggest.

ggplot(datasaurus_dozen, aes(x=x, y=y, colour=dataset))+
  geom_point()+
  theme_void()+
  theme(legend.position = "none")+
  facet_wrap(~dataset, ncol=3)

Citations

Matejka, J. & Fitzmaurice, G. (2017). Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI textquotesingle17, : ACM Press.