This blog demonstrates the importance of graphical visualization in addition to using statistical summaries to understand a data set.
The most famous example of how different data sets can have the same statistical summaries but different graphical structure is Anscombe’s quartet. The quartet was published by statistician Frank Anscombe in 1973.
anscombe %>% knitr::kable() %>% kable_styling(bootstrap_options = c("striped") , font_size=17 )
| x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 |
|---|---|---|---|---|---|---|---|
| 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
| 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
| 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
| 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
| 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
| 14 | 14 | 14 | 8 | 9.96 | 8.10 | 8.84 | 7.04 |
| 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
| 4 | 4 | 4 | 19 | 4.26 | 3.10 | 5.39 | 12.50 |
| 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
| 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
| 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
Next, we reorganize the named columns into more useful columns grouped by plot. We observed that x1 is paired with y1, x2 with y2, etc. The paired columns are spaced 4 apart.
ds = data.frame()
for(j in 1:4){
ds = rbind( ds, data.frame( number = j , x = anscombe[, j], y = anscombe[,j+4]))
}
ggplot(ds, aes(x, y) ) + geom_point( color = "red", fill="red", shape=17) + facet_wrap(~number)
Now we show the statistical summaries of these 4 data sets are the same to 2 decimal places.
ds %>% group_by(number) %>% summarize( mean(x), mean(y), var(x), var(y), cor(x,y)) %>%
kable(digits=3) %>% kable_styling(bootstrap_options = c("striped"), font_size = 17 )
| number | mean(x) | mean(y) | var(x) | var(y) | cor(x, y) |
|---|---|---|---|---|---|
| 1 | 9 | 7.501 | 11 | 4.127 | 0.816 |
| 2 | 9 | 7.501 | 11 | 4.128 | 0.816 |
| 3 | 9 | 7.500 | 11 | 4.123 | 0.816 |
| 4 | 9 | 7.501 | 11 | 4.123 | 0.817 |
I learned that some researchers have discovered a way to generalize the Anscombe quartet to much more complex shapes.
This sample dataset called the Datasaurus has been transformed into an R library called datasauRus. We begin by loading this dataset and showing they have the same statistical summaries.
library(datasauRus)
datasaurus_dozen %>% group_by(dataset) %>%
summarize( mean(x), mean(y), var(x), var(y), cor(x,y)) %>%
kable(digits=3) %>% kable_styling(bootstrap_options = c("striped"), font_size=17)
| dataset | mean(x) | mean(y) | var(x) | var(y) | cor(x, y) |
|---|---|---|---|---|---|
| away | 54.266 | 47.835 | 281.227 | 725.750 | -0.064 |
| bullseye | 54.269 | 47.831 | 281.207 | 725.533 | -0.069 |
| circle | 54.267 | 47.838 | 280.898 | 725.227 | -0.068 |
| dino | 54.263 | 47.832 | 281.070 | 725.516 | -0.064 |
| dots | 54.260 | 47.840 | 281.157 | 725.235 | -0.060 |
| h_lines | 54.261 | 47.830 | 281.095 | 725.757 | -0.062 |
| high_lines | 54.269 | 47.835 | 281.122 | 725.763 | -0.069 |
| slant_down | 54.268 | 47.836 | 281.124 | 725.554 | -0.069 |
| slant_up | 54.266 | 47.831 | 281.194 | 725.689 | -0.069 |
| star | 54.267 | 47.840 | 281.198 | 725.240 | -0.063 |
| v_lines | 54.270 | 47.837 | 281.232 | 725.639 | -0.069 |
| wide_lines | 54.267 | 47.832 | 281.233 | 725.651 | -0.067 |
| x_shape | 54.260 | 47.840 | 281.231 | 725.225 | -0.066 |
However, their graphical displays are distinct, different and unrelated datasets. We illustrate this with the scatterplots.
ggplot(datasaurus_dozen, aes(x=x, y=y, color=dataset)) +
geom_point() + theme_void()+ facet_wrap(~dataset, ncol = 4)
The trick to creating these datasets is described in the paper by Justin Matejka and George Fitzmaurice “Same Stats, Different Graphs: Generating Datsets with Varied Appearances and Identical Statistics through Simulated Annealing”. They used Monte Carlo based techniques to perturb an existing graphical shape while maintaining statistical properties. The intermediate shapes are guided from a starting object to a terminal shape (as shown).