Data 621 Blog 1

Alexander Ng

02/11/2020

Introduction

This blog demonstrates the importance of graphical visualization in addition to using statistical summaries to understand a data set.

Begin With the Familiar

The most famous example of how different data sets can have the same statistical summaries but different graphical structure is Anscombe’s quartet. The quartet was published by statistician Frank Anscombe in 1973.

anscombe %>% knitr::kable() %>% kable_styling(bootstrap_options = c("striped") , font_size=17 )
x1 x2 x3 x4 y1 y2 y3 y4
10 10 10 8 8.04 9.14 7.46 6.58
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71
9 9 9 8 8.81 8.77 7.11 8.84
11 11 11 8 8.33 9.26 7.81 8.47
14 14 14 8 9.96 8.10 8.84 7.04
6 6 6 8 7.24 6.13 6.08 5.25
4 4 4 19 4.26 3.10 5.39 12.50
12 12 12 8 10.84 9.13 8.15 5.56
7 7 7 8 4.82 7.26 6.42 7.91
5 5 5 8 5.68 4.74 5.73 6.89

Next, we reorganize the named columns into more useful columns grouped by plot. We observed that x1 is paired with y1, x2 with y2, etc. The paired columns are spaced 4 apart.

ds = data.frame()
for(j in 1:4){
  ds = rbind( ds, data.frame( number = j , x = anscombe[, j], y = anscombe[,j+4]))
}
ggplot(ds, aes(x, y) ) + geom_point( color = "red", fill="red", shape=17) + facet_wrap(~number)

Now we show the statistical summaries of these 4 data sets are the same to 2 decimal places.

ds %>% group_by(number) %>% summarize( mean(x), mean(y), var(x), var(y), cor(x,y)) %>%
  kable(digits=3) %>% kable_styling(bootstrap_options = c("striped"), font_size = 17 )
number mean(x) mean(y) var(x) var(y) cor(x, y)
1 9 7.501 11 4.127 0.816
2 9 7.501 11 4.128 0.816
3 9 7.500 11 4.123 0.816
4 9 7.501 11 4.123 0.817

DataSaurus

I learned that some researchers have discovered a way to generalize the Anscombe quartet to much more complex shapes.

This sample dataset called the Datasaurus has been transformed into an R library called datasauRus. We begin by loading this dataset and showing they have the same statistical summaries.

library(datasauRus)

datasaurus_dozen %>% group_by(dataset) %>%
  summarize( mean(x), mean(y), var(x), var(y), cor(x,y)) %>%
  kable(digits=3) %>% kable_styling(bootstrap_options = c("striped"), font_size=17)
dataset mean(x) mean(y) var(x) var(y) cor(x, y)
away 54.266 47.835 281.227 725.750 -0.064
bullseye 54.269 47.831 281.207 725.533 -0.069
circle 54.267 47.838 280.898 725.227 -0.068
dino 54.263 47.832 281.070 725.516 -0.064
dots 54.260 47.840 281.157 725.235 -0.060
h_lines 54.261 47.830 281.095 725.757 -0.062
high_lines 54.269 47.835 281.122 725.763 -0.069
slant_down 54.268 47.836 281.124 725.554 -0.069
slant_up 54.266 47.831 281.194 725.689 -0.069
star 54.267 47.840 281.198 725.240 -0.063
v_lines 54.270 47.837 281.232 725.639 -0.069
wide_lines 54.267 47.832 281.233 725.651 -0.067
x_shape 54.260 47.840 281.231 725.225 -0.066

However, their graphical displays are distinct, different and unrelated datasets. We illustrate this with the scatterplots.

ggplot(datasaurus_dozen, aes(x=x, y=y, color=dataset)) +
  geom_point() + theme_void()+ facet_wrap(~dataset, ncol = 4)

The trick to creating these datasets is described in the paper by Justin Matejka and George Fitzmaurice “Same Stats, Different Graphs: Generating Datsets with Varied Appearances and Identical Statistics through Simulated Annealing”. They used Monte Carlo based techniques to perturb an existing graphical shape while maintaining statistical properties. The intermediate shapes are guided from a starting object to a terminal shape (as shown).