It can be difficult to demonstrate the importance of data visualization. The truth is that statistical analysis alone is insufficient to tell a complete (and true) story.
Developed by F.J. Anscombe in 1973, Anscombe’s Quartet consists of four datasets, each having identical summary statistics (mean, standard deviation, and correlation). This may incorrectly lead one to believe that the data being described by the datasets are similar. However, after visualizing the data, it becomes clear that the datasets are very different.
In this section, we examine the datasaurus_dozen data set available in the datasauRus library.
library(datasauRus)
library(dplyr)
library(ggplot2)
We begin by examining the basic structure of the data. Clearly, there is a data set label and a number of \(x\) and \(y\) values associated with the data sets.
head(datasaurus_dozen)
## # A tibble: 6 x 3
## dataset x y
## <chr> <dbl> <dbl>
## 1 dino 55.4 97.2
## 2 dino 51.5 96.0
## 3 dino 46.2 94.5
## 4 dino 42.8 91.4
## 5 dino 40.8 88.3
## 6 dino 38.7 84.9
In all, there are 13 unique data sets in this data all with \(x\) and \(y\) variables.
unique(datasaurus_dozen$dataset)
## [1] "dino" "away" "h_lines" "v_lines" "x_shape"
## [6] "star" "high_lines" "dots" "circle" "bullseye"
## [11] "slant_up" "slant_down" "wide_lines"
We will now summarize the data for each data set.
datasaurus_dozen %>%
group_by(dataset) %>%
summarize(
mean_x = mean(x),
mean_y = mean(y),
std_dev_x = sd(x),
std_dev_y = sd(y),
corr_x_y = cor(x, y)
)
## # A tibble: 13 x 6
## dataset mean_x mean_y std_dev_x std_dev_y corr_x_y
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away 54.3 47.8 16.8 26.9 -0.0641
## 2 bullseye 54.3 47.8 16.8 26.9 -0.0686
## 3 circle 54.3 47.8 16.8 26.9 -0.0683
## 4 dino 54.3 47.8 16.8 26.9 -0.0645
## 5 dots 54.3 47.8 16.8 26.9 -0.0603
## 6 h_lines 54.3 47.8 16.8 26.9 -0.0617
## 7 high_lines 54.3 47.8 16.8 26.9 -0.0685
## 8 slant_down 54.3 47.8 16.8 26.9 -0.0690
## 9 slant_up 54.3 47.8 16.8 26.9 -0.0686
## 10 star 54.3 47.8 16.8 26.9 -0.0630
## 11 v_lines 54.3 47.8 16.8 26.9 -0.0694
## 12 wide_lines 54.3 47.8 16.8 26.9 -0.0666
## 13 x_shape 54.3 47.8 16.8 26.9 -0.0656
Quickly, we notice that each data set has the same mean \(x\), mean \(y\), standard deviation of \(x\), standard deviation of \(y\) and correlation between \(x\) and \(y\).
We plot the data below. Notice that the diagrams are each quite unique - far more dissimilar than the initial summary statistics would suggest.
ggplot(datasaurus_dozen, aes(x=x, y=y, colour=dataset))+
geom_point()+
theme_void()+
theme(legend.position = "none")+
facet_wrap(~dataset, ncol=3)
Matejka, J. & Fitzmaurice, G. (2017). Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI textquotesingle17, : ACM Press.