Anscombe’s Quartet

Prologue

If you’re looking at this, I suppose you must’ve heard Anscombe’s Quartet before.

But just in case, if you don’t know what that is, you can find it out here.

Summary Statistics

Original Data Set

The original data from Anscombe’s Quartet looks like this:

AnscombesQuartet <- read.csv("Anscombe's Quartet.csv")

group I		group II		group III		group IV
Anscombe’s Quartet Original Data
x_1	y_1	x_2	y_2	x_3	y_3	x_4	y_4
10	8.04	10	9.14	10	7.46	8	6.58
8	6.95	8	8.14	8	6.77	8	5.76
13	7.58	13	8.74	13	12.74	8	7.71
9	8.81	9	8.77	9	7.11	8	8.84
11	8.33	11	9.26	11	7.81	8	8.47
14	9.96	14	8.1	14	8.84	8	7.04
6	7.24	6	6.13	6	6.08	8	5.25
4	4.26	4	3.1	4	5.39	8	5.56
12	10.84	12	9.13	12	8.15	19	12.5
7	4.82	7	7.26	7	6.42	8	7.91
5	5.68	5	4.74	5	5.73	8	6.89

Mean & Variance

AnscombesQuartet.sum <- AnscombesQuartet %>%
        summarise_each(funs(mean, variance = var)) %>%
        gather(key = sum_type, value = value) %>%
        separate(col = sum_type, into = c("variable", "group","sum_type")) %>%
        spread(key = sum_type, value = value) %>%
        unite(variable, variable, group)

	variable	mean	variance
Mean & Variance of Each Variable in Different Groups
	x_1	9	11
	x_2	9	11
	x_3	9	11
	x_4	9	11
	y_1	7.5	4.1273
	y_2	7.5	4.1276
	y_3	7.5	4.1226
	y_4	7.5	4.1232

The mean and variance of x and y in different groups are either the same or very close.

Correlation

cor.fun <- function(n) {
        x <- paste("x", n, sep = "_")
        y <- paste("y", n, sep = "_")
        x_col <- AnscombesQuartet[, x]
        y_col <- AnscombesQuartet[, y]
        correl <- cor(x_col, y_col)
        return(correl)
}
corr <- sapply(1:4, cor.fun)

group I	group II	group III	group IV
Correlation of x and y within Each Group
0.8164	0.8162	0.8163	0.8165

The correlation of x and y within each group are very close too.

Linear Regression

lm.fun <- function(n) {
        x <- paste("x", n, sep = "_")
        y <- paste("y", n, sep = "_")
        x_col <- AnscombesQuartet[, x]
        y_col <- AnscombesQuartet[, y]
        lm_model <- lm(y_col ~ x_col)
        intercept <- round(lm_model$coef[1], 0)
        slope <- round(lm_model$coef[2], 1)
        return(paste0("y = ", intercept, " + ", slope, "*x"))
}
reg <- sapply(1:4, lm.fun)

group I	group II	group III	group IV
Linear Regression Equation of Each Group
y = 3 + 0.5*x	y = 3 + 0.5*x	y = 3 + 0.5*x	y = 3 + 0.5*x

Linear models show no difference between groups.

Visualization

Finally, let’s make some plots!
But before that, remember to transform the data into tidy format.

AnscombesQuartet.x <- AnscombesQuartet %>%
        select(contains("x")) %>%
        gather(key = key, value = x) %>%
        separate(col = key, into = c("variable", "group")) %>%
        select(-variable)
AnscombesQuartet.y <- AnscombesQuartet %>%
        select(contains("y")) %>%
        gather(key = key, value = y) %>%
        select(-key)
AnscombesQuartet.tidy <- cbind(AnscombesQuartet.x, AnscombesQuartet.y)

g <- ggplot(aes(x = x, y = y), data = AnscombesQuartet.tidy) +
        geom_point() +
        geom_smooth(method = "lm", se = FALSE, size = 0.5) +
        facet_wrap(~group, ncol = 2) +
        ggtitle("Anscombe's Quartet")
ggplotly(g)

Despite the summary statistics of the data are pretty much the same, the plots here are telling us a different story:

Data in group 1 has a linear relationship.
Data in group 2 has a non-linear relationshp.
The majority of the data in group 3 has a linear relationship, but there’s an outlier.
In group 4, except for the outlier, the two variables have no relationship at all, since x is the same no matter what the y value is.

Anscombe’s Quartet

Yao Zhang

Prologue

Summary Statistics

Visualization