November 13, 2016
If you’re looking at this, I suppose you must’ve heard Anscombe’s Quartet before.
But just in case, if you don’t know what that is, you can find it out here.
- Original Data Set
The original data from Anscombe’s Quartet looks like this:
AnscombesQuartet <- read.csv("Anscombe's Quartet.csv")| Anscombe’s Quartet Original Data | ||||||||||
| group I | group II | group III | group IV | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| x_1 | y_1 | x_2 | y_2 | x_3 | y_3 | x_4 | y_4 | |||
| 10 | 8.04 | 10 | 9.14 | 10 | 7.46 | 8 | 6.58 | |||
| 8 | 6.95 | 8 | 8.14 | 8 | 6.77 | 8 | 5.76 | |||
| 13 | 7.58 | 13 | 8.74 | 13 | 12.74 | 8 | 7.71 | |||
| 9 | 8.81 | 9 | 8.77 | 9 | 7.11 | 8 | 8.84 | |||
| 11 | 8.33 | 11 | 9.26 | 11 | 7.81 | 8 | 8.47 | |||
| 14 | 9.96 | 14 | 8.1 | 14 | 8.84 | 8 | 7.04 | |||
| 6 | 7.24 | 6 | 6.13 | 6 | 6.08 | 8 | 5.25 | |||
| 4 | 4.26 | 4 | 3.1 | 4 | 5.39 | 8 | 5.56 | |||
| 12 | 10.84 | 12 | 9.13 | 12 | 8.15 | 19 | 12.5 | |||
| 7 | 4.82 | 7 | 7.26 | 7 | 6.42 | 8 | 7.91 | |||
| 5 | 5.68 | 5 | 4.74 | 5 | 5.73 | 8 | 6.89 | |||
- Mean & Variance
AnscombesQuartet.sum <- AnscombesQuartet %>%
summarise_each(funs(mean, variance = var)) %>%
gather(key = sum_type, value = value) %>%
separate(col = sum_type, into = c("variable", "group","sum_type")) %>%
spread(key = sum_type, value = value) %>%
unite(variable, variable, group)| Mean & Variance of Each Variable in Different Groups | |||
| variable | mean | variance | |
|---|---|---|---|
| x_1 | 9 | 11 | |
| x_2 | 9 | 11 | |
| x_3 | 9 | 11 | |
| x_4 | 9 | 11 | |
| y_1 | 7.5 | 4.1273 | |
| y_2 | 7.5 | 4.1276 | |
| y_3 | 7.5 | 4.1226 | |
| y_4 | 7.5 | 4.1232 | |
The mean and variance of x and y in different groups are either the same or very close.
- Correlation
cor.fun <- function(n) {
x <- paste("x", n, sep = "_")
y <- paste("y", n, sep = "_")
x_col <- AnscombesQuartet[, x]
y_col <- AnscombesQuartet[, y]
correl <- cor(x_col, y_col)
return(correl)
}
corr <- sapply(1:4, cor.fun)| Correlation of x and y within Each Group | |||
| group I | group II | group III | group IV |
|---|---|---|---|
| 0.8164 | 0.8162 | 0.8163 | 0.8165 |
The correlation of x and y within each group are very close too.
- Linear Regression
lm.fun <- function(n) {
x <- paste("x", n, sep = "_")
y <- paste("y", n, sep = "_")
x_col <- AnscombesQuartet[, x]
y_col <- AnscombesQuartet[, y]
lm_model <- lm(y_col ~ x_col)
intercept <- round(lm_model$coef[1], 0)
slope <- round(lm_model$coef[2], 1)
return(paste0("y = ", intercept, " + ", slope, "*x"))
}
reg <- sapply(1:4, lm.fun)| Linear Regression Equation of Each Group | |||
| group I | group II | group III | group IV |
|---|---|---|---|
| y = 3 + 0.5*x | y = 3 + 0.5*x | y = 3 + 0.5*x | y = 3 + 0.5*x |
Linear models show no difference between groups.
Finally, let’s make some plots!
But before that, remember to transform the data into tidy format.
AnscombesQuartet.x <- AnscombesQuartet %>%
select(contains("x")) %>%
gather(key = key, value = x) %>%
separate(col = key, into = c("variable", "group")) %>%
select(-variable)
AnscombesQuartet.y <- AnscombesQuartet %>%
select(contains("y")) %>%
gather(key = key, value = y) %>%
select(-key)
AnscombesQuartet.tidy <- cbind(AnscombesQuartet.x, AnscombesQuartet.y)g <- ggplot(aes(x = x, y = y), data = AnscombesQuartet.tidy) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, size = 0.5) +
facet_wrap(~group, ncol = 2) +
ggtitle("Anscombe's Quartet")
ggplotly(g)Despite the summary statistics of the data are pretty much the same, the plots here are telling us a different story: