Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.
First, load and view the data.
library(Tmisc)
data(quartet)
str(quartet)
## 'data.frame': 44 obs. of 3 variables:
## $ set: Factor w/ 4 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ x : int 10 8 13 9 11 14 6 4 12 7 ...
## $ y : num 8.04 6.95 7.58 8.81 8.33 ...
set | x | y |
---|---|---|
I | 10 | 8.04 |
I | 8 | 6.95 |
I | 13 | 7.58 |
I | 9 | 8.81 |
I | 11 | 8.33 |
I | 14 | 9.96 |
I | 6 | 7.24 |
I | 4 | 4.26 |
I | 12 | 10.84 |
I | 7 | 4.82 |
I | 5 | 5.68 |
II | 10 | 9.14 |
II | 8 | 8.14 |
II | 13 | 8.74 |
II | 9 | 8.77 |
II | 11 | 9.26 |
II | 14 | 8.10 |
II | 6 | 6.13 |
II | 4 | 3.10 |
II | 12 | 9.13 |
II | 7 | 7.26 |
II | 5 | 4.74 |
III | 10 | 7.46 |
III | 8 | 6.77 |
III | 13 | 12.74 |
III | 9 | 7.11 |
III | 11 | 7.81 |
III | 14 | 8.84 |
III | 6 | 6.08 |
III | 4 | 5.39 |
III | 12 | 8.15 |
III | 7 | 6.42 |
III | 5 | 5.73 |
IV | 8 | 6.58 |
IV | 8 | 5.76 |
IV | 8 | 7.71 |
IV | 8 | 8.84 |
IV | 8 | 8.47 |
IV | 8 | 7.04 |
IV | 8 | 5.25 |
IV | 19 | 12.50 |
IV | 8 | 5.56 |
IV | 8 | 7.91 |
IV | 8 | 6.89 |
Now, compute the mean and standard deviation of both x and y, and the correlation coefficient between x and y for each dataset.
library(dplyr)
quartet %>% group_by(set) %>% summarize(mean(x), sd(x), mean(y), sd(y), cor(x,
y))
## Source: local data frame [4 x 6]
##
## set mean(x) sd(x) mean(y) sd(y) cor(x, y)
## 1 I 9 3.32 7.5 2.03 0.816
## 2 II 9 3.32 7.5 2.03 0.816
## 3 III 9 3.32 7.5 2.03 0.816
## 4 IV 9 3.32 7.5 2.03 0.817
Finally, plot y versus x for each set with a linear regression trendline displayed on each plot:
library(ggplot2)
ggplot(quartet, aes(x, y)) + geom_point() + geom_smooth(method = lm, se = FALSE) +
facet_wrap(~set)