Simple analysis of Anscombe’s quartet data

Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.

The data

First, load and view the data.

library(Tmisc)
data(quartet)
str(quartet)
## 'data.frame':    44 obs. of  3 variables:
##  $ set: Factor w/ 4 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ x  : int  10 8 13 9 11 14 6 4 12 7 ...
##  $ y  : num  8.04 6.95 7.58 8.81 8.33 ...
set x y
I 10 8.04
I 8 6.95
I 13 7.58
I 9 8.81
I 11 8.33
I 14 9.96
I 6 7.24
I 4 4.26
I 12 10.84
I 7 4.82
I 5 5.68
II 10 9.14
II 8 8.14
II 13 8.74
II 9 8.77
II 11 9.26
II 14 8.10
II 6 6.13
II 4 3.10
II 12 9.13
II 7 7.26
II 5 4.74
III 10 7.46
III 8 6.77
III 13 12.74
III 9 7.11
III 11 7.81
III 14 8.84
III 6 6.08
III 4 5.39
III 12 8.15
III 7 6.42
III 5 5.73
IV 8 6.58
IV 8 5.76
IV 8 7.71
IV 8 8.84
IV 8 8.47
IV 8 7.04
IV 8 5.25
IV 19 12.50
IV 8 5.56
IV 8 7.91
IV 8 6.89

Simple statistics

Now, compute the mean and standard deviation of both x and y, and the correlation coefficient between x and y for each dataset.

library(dplyr)
quartet %>% group_by(set) %>% summarize(mean(x), sd(x), mean(y), sd(y), cor(x, 
    y))
## Source: local data frame [4 x 6]
## 
##   set mean(x) sd(x) mean(y) sd(y) cor(x, y)
## 1   I       9  3.32     7.5  2.03     0.816
## 2  II       9  3.32     7.5  2.03     0.816
## 3 III       9  3.32     7.5  2.03     0.816
## 4  IV       9  3.32     7.5  2.03     0.817

Visualization

Finally, plot y versus x for each set with a linear regression trendline displayed on each plot:

library(ggplot2)
ggplot(quartet, aes(x, y)) + geom_point() + geom_smooth(method = lm, se = FALSE) + 
    facet_wrap(~set)