Humans are good at understanding patterns, given some assistance. Statistics that summarize a data set can hide differences. Francis Anscombe, an English statistician, designed a data set to illustrate how data visualization helps people see how different data can be when summary statistics match.
For R programming the Anscombe data is included in the base datasets package. Looking at the data the differences in the 4 sets is not easy to detect.
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
To make the subsets more manageable, we can create separate rows for each set, add a column to identify the set and combine the rows. A few sample rows are printed to show the data structure:
| x | y | group | |
|---|---|---|---|
| 1 | 10 | 8.04 | x1_y1 |
| 12 | 10 | 9.14 | x2_y2 |
| 23 | 10 | 7.46 | x3_y3 |
| 34 | 8 | 6.58 | x4_y4 |
Summary statistics match for each of the subsets: mean, variance and the correlation between x and y
| group | Mean x | Sample variance x | Mean y | Sample variance y | Correlation between x and y |
|---|---|---|---|---|---|
| x1_y1 | 9 | 11 | 7.5 | 4.1 | 0.82 |
| x2_y2 | 9 | 11 | 7.5 | 4.1 | 0.82 |
| x3_y3 | 9 | 11 | 7.5 | 4.1 | 0.82 |
| x4_y4 | 9 | 11 | 7.5 | 4.1 | 0.82 |
The similarity at a summary level goes even further. If we fit a linear model in the subsets the regression line is the same for each:
| Data set | Linear regression |
|---|---|
| x1_y1 | y = 3 + 0.5x |
| x2_y2 | y = 3 + 0.5x |
| x3_y3 | y = 3 + 0.5x |
| x4_y4 | y = 3 + 0.5x |
but when we look at the data points it’s immediately clear each subset has a substantially different pattern!
(the blue line is the linear regression)
Clearly the Anscombe data is cleverly designed to make a point about variation that can hide in summary stats. It’s important to look at the data as part of data analysis.
The example worked here is highly derivative. The calculations have been prepared and charts produced by many people since Anscombe published his work. Code is on Github