November 13, 2016


Prologue

If you’re looking at this, I suppose you must’ve heard Anscombe’s Quartet before.

But just in case, if you don’t know what that is, you can find it out here.


Summary Statistics

  1. Original Data Set

The original data from Anscombe’s Quartet looks like this:

AnscombesQuartet <- read.csv("Anscombe's Quartet.csv")
Anscombe’s Quartet Original Data
group I   group II   group III   group IV
x_1 y_1   x_2 y_2   x_3 y_3   x_4 y_4
10 8.04   10 9.14   10 7.46   8 6.58
8 6.95   8 8.14   8 6.77   8 5.76
13 7.58   13 8.74   13 12.74   8 7.71
9 8.81   9 8.77   9 7.11   8 8.84
11 8.33   11 9.26   11 7.81   8 8.47
14 9.96   14 8.1   14 8.84   8 7.04
6 7.24   6 6.13   6 6.08   8 5.25
4 4.26   4 3.1   4 5.39   8 5.56
12 10.84   12 9.13   12 8.15   19 12.5
7 4.82   7 7.26   7 6.42   8 7.91
5 5.68   5 4.74   5 5.73   8 6.89


  1. Mean & Variance
AnscombesQuartet.sum <- AnscombesQuartet %>%
        summarise_each(funs(mean, variance = var)) %>%
        gather(key = sum_type, value = value) %>%
        separate(col = sum_type, into = c("variable", "group","sum_type")) %>%
        spread(key = sum_type, value = value) %>%
        unite(variable, variable, group)
Mean & Variance of Each Variable in Different Groups
variable mean variance
x_1 9 11
x_2 9 11
x_3 9 11
x_4 9 11
y_1 7.5 4.1273
y_2 7.5 4.1276
y_3 7.5 4.1226
y_4 7.5 4.1232

The mean and variance of x and y in different groups are either the same or very close.


  1. Correlation
cor.fun <- function(n) {
        x <- paste("x", n, sep = "_")
        y <- paste("y", n, sep = "_")
        x_col <- AnscombesQuartet[, x]
        y_col <- AnscombesQuartet[, y]
        correl <- cor(x_col, y_col)
        return(correl)
}
corr <- sapply(1:4, cor.fun)
Correlation of x and y within Each Group
group I group II group III group IV
0.8164 0.8162 0.8163 0.8165

The correlation of x and y within each group are very close too.


  1. Linear Regression
lm.fun <- function(n) {
        x <- paste("x", n, sep = "_")
        y <- paste("y", n, sep = "_")
        x_col <- AnscombesQuartet[, x]
        y_col <- AnscombesQuartet[, y]
        lm_model <- lm(y_col ~ x_col)
        intercept <- round(lm_model$coef[1], 0)
        slope <- round(lm_model$coef[2], 1)
        return(paste0("y = ", intercept, " + ", slope, "*x"))
}
reg <- sapply(1:4, lm.fun)
Linear Regression Equation of Each Group
group I group II group III group IV
y = 3 + 0.5*x y = 3 + 0.5*x y = 3 + 0.5*x y = 3 + 0.5*x

Linear models show no difference between groups.


Visualization

Finally, let’s make some plots!
But before that, remember to transform the data into tidy format.

AnscombesQuartet.x <- AnscombesQuartet %>%
        select(contains("x")) %>%
        gather(key = key, value = x) %>%
        separate(col = key, into = c("variable", "group")) %>%
        select(-variable)
AnscombesQuartet.y <- AnscombesQuartet %>%
        select(contains("y")) %>%
        gather(key = key, value = y) %>%
        select(-key)
AnscombesQuartet.tidy <- cbind(AnscombesQuartet.x, AnscombesQuartet.y)
g <- ggplot(aes(x = x, y = y), data = AnscombesQuartet.tidy) +
        geom_point() +
        geom_smooth(method = "lm", se = FALSE, size = 0.5) +
        facet_wrap(~group, ncol = 2) +
        ggtitle("Anscombe's Quartet")
ggplotly(g)

Despite the summary statistics of the data are pretty much the same, the plots here are telling us a different story:

  1. Data in group 1 has a linear relationship.
  2. Data in group 2 has a non-linear relationshp.
  3. The majority of the data in group 3 has a linear relationship, but there’s an outlier.
  4. In group 4, except for the outlier, the two variables have no relationship at all, since x is the same no matter what the y value is.