Introduction

This blog demonstrates the importance of graphical visualization in addition to using statistical summaries to understand a data set.

Begin With the Familiar

The most famous example of how different data sets can have the same statistical summaries but different graphical structure is Anscombe’s quartet. The quartet was published by statistician Frank Anscombe in 1973.

anscombe %>% knitr::kable() %>% kable_styling(bootstrap_options = c("striped") , font_size=17 )

x1	x2	x3	x4	y1	y2	y3	y4
10	10	10	8	8.04	9.14	7.46	6.58
8	8	8	8	6.95	8.14	6.77	5.76
13	13	13	8	7.58	8.74	12.74	7.71
9	9	9	8	8.81	8.77	7.11	8.84
11	11	11	8	8.33	9.26	7.81	8.47
14	14	14	8	9.96	8.10	8.84	7.04
6	6	6	8	7.24	6.13	6.08	5.25
4	4	4	19	4.26	3.10	5.39	12.50
12	12	12	8	10.84	9.13	8.15	5.56
7	7	7	8	4.82	7.26	6.42	7.91
5	5	5	8	5.68	4.74	5.73	6.89

Next, we reorganize the named columns into more useful columns grouped by plot. We observed that x1 is paired with y1, x2 with y2, etc. The paired columns are spaced 4 apart.

ds = data.frame()
for(j in 1:4){
  ds = rbind( ds, data.frame( number = j , x = anscombe[, j], y = anscombe[,j+4]))
}
ggplot(ds, aes(x, y) ) + geom_point( color = "red", fill="red", shape=17) + facet_wrap(~number)

Now we show the statistical summaries of these 4 data sets are the same to 2 decimal places.

ds %>% group_by(number) %>% summarize( mean(x), mean(y), var(x), var(y), cor(x,y)) %>%
  kable(digits=3) %>% kable_styling(bootstrap_options = c("striped"), font_size = 17 )

number	mean(x)	mean(y)	var(x)	var(y)	cor(x, y)
1	9	7.501	11	4.127	0.816
2	9	7.501	11	4.128	0.816
3	9	7.500	11	4.123	0.816
4	9	7.501	11	4.123	0.817

DataSaurus

I learned that some researchers have discovered a way to generalize the Anscombe quartet to much more complex shapes.

This sample dataset called the Datasaurus has been transformed into an R library called datasauRus. We begin by loading this dataset and showing they have the same statistical summaries.

library(datasauRus)

datasaurus_dozen %>% group_by(dataset) %>%
  summarize( mean(x), mean(y), var(x), var(y), cor(x,y)) %>%
  kable(digits=3) %>% kable_styling(bootstrap_options = c("striped"), font_size=17)

dataset	mean(x)	mean(y)	var(x)	var(y)	cor(x, y)
away	54.266	47.835	281.227	725.750	-0.064
bullseye	54.269	47.831	281.207	725.533	-0.069
circle	54.267	47.838	280.898	725.227	-0.068
dino	54.263	47.832	281.070	725.516	-0.064
dots	54.260	47.840	281.157	725.235	-0.060
h_lines	54.261	47.830	281.095	725.757	-0.062
high_lines	54.269	47.835	281.122	725.763	-0.069
slant_down	54.268	47.836	281.124	725.554	-0.069
slant_up	54.266	47.831	281.194	725.689	-0.069
star	54.267	47.840	281.198	725.240	-0.063
v_lines	54.270	47.837	281.232	725.639	-0.069
wide_lines	54.267	47.832	281.233	725.651	-0.067
x_shape	54.260	47.840	281.231	725.225	-0.066

However, their graphical displays are distinct, different and unrelated datasets. We illustrate this with the scatterplots.

ggplot(datasaurus_dozen, aes(x=x, y=y, color=dataset)) +
  geom_point() + theme_void()+ facet_wrap(~dataset, ncol = 4)

The trick to creating these datasets is described in the paper by Justin Matejka and George Fitzmaurice “Same Stats, Different Graphs: Generating Datsets with Varied Appearances and Identical Statistics through Simulated Annealing”. They used Monte Carlo based techniques to perturb an existing graphical shape while maintaining statistical properties. The intermediate shapes are guided from a starting object to a terminal shape (as shown).

Data 621 Blog 1

Alexander Ng

02/11/2020

Introduction

Begin With the Familiar

DataSaurus