Understand the principle of ANOVA
Understand what between-group and within-group variance is
Be able to conduct and interpret a simple one-way ANOVA
Create a data frame with two continuous variables of any length. If you do not know what a continuous variable is, look it up. Changes in one variable should cause changes in the other. Explain the context of the data. Plot the data. Run a regression analysis based on the data set. Assign the model output to the object name ‘m1’. Interpret the R output of summary(m1)
. Use the function abline(m1)
to add the regression line to the plot. Plot the 4 diagnostic plots on a multipanel plot (using par(mfrow = c(2, 2))
and interpret them.
## Create two vectors of normally distributed
## continuous random variables and sort them in increasing fashion,
## so that we see a correlation.
set.seed(5)
x <- sort(rnorm(n = 30, mean = 10))
y <- sort(rnorm(n = 30, mean = 10))
dat <- data.frame(x, y)
plot(y ~ x, data = dat)
m1 <- lm(y ~ x, data = dat)
summary(m1)
##
## Call:
## lm(formula = y ~ x, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59803 -0.06977 -0.00385 0.07178 0.33423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.29580 0.33619 -0.88 0.386
## x 1.04427 0.03342 31.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1785 on 28 degrees of freedom
## Multiple R-squared: 0.9721, Adjusted R-squared: 0.9711
## F-statistic: 976.2 on 1 and 28 DF, p-value: < 2.2e-16
## Add model fit
abline(m1)
## Check the underlying model assumptions
par(mfrow = c(2, 2)) # split the graphics window into 4 panels
plot(m1)
## Observation 2 is highlighted as an influential point
## as judged by the Cook's distance (bottom right hand plot)
dat[2, ]
## Run an outlier test
library(car)
outlierTest(m1)
## rstudent unadjusted p-value Bonferonni p
## 2 -4.61105 8.662e-05 0.0025986
## Indeed the outlier test flags observation 2.
## In a real life scenario you need to use your expert knowledge and common sense
## to judge whether such an observation truly is an outlier.
## If you conclude that it is, rerun analysis without the influential
## observation no. 2 and check whether that greatly changes the results.
## You can also revert to a regression model that is robust against outliers
## and model the data without removing the outlier(s)
library(MASS)
m2 <- rlm(y ~ x, data = dat)
summary(m2)
##
## Call: rlm(formula = y ~ x, data = dat)
## Residuals:
## Min 1Q Median 3Q Max
## -0.63831 -0.07722 0.01020 0.05833 0.34061
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 0.0379 0.2649 0.1432
## x 1.0107 0.0263 38.3716
##
## Residual standard error: 0.1052 on 28 degrees of freedom
## Stupidly, this function does not provide P-values...
## These can be be obtained using robust F-tests, so-called Wald tests
## from the sfsmisc package
library(sfsmisc)
f.robftest(m2, var = "x") # note that you have to provide the quoted coefficient name
##
## robust F-test (as if non-random weights)
##
## data: from rlm(formula = y ~ x, data = dat)
## F = 1399.7, p-value < 2.2e-16
## alternative hypothesis: true x is not equal to 0
Analysis of variance (ANOVA) is used when there are more than only two groups to be compared (as we have seen in a t-test). ANOVA is used so widely that you are likely to come across it many times in your scientific career. There are many different ways of conducting ANOVA, here we will treat the most basic one, again with the emphasis on you understanding the fundamental mechanisms rather than seeing all the different ways of performing ANOVA. The latter will be relatively easy once you have a solid understanding of the concept behind it.
ANOVA tests for differences between groups, e.g. if you have two different fertiliser treatments that you apply to a crop, you may want to find out whether there are differences in yield between the two treatments and the control. If you have only one factor such as ‘fertiliser’, then this is called a one-way ANOVA, if you have two factors (e.g. two sites each containing the three treatments), it is a two-way ANOVA etc. The idea now is to weigh off the within-group variance in the data (the variance that is left once the group means have been accounted for, also called the error variance, bottom left panel in the below figure) against the variance between the groups (called between-group variance, bottom right panel in the figure). This intuitively makes sense: If the variance between groups is far larger than the variance within groups, this tells us that the grouping factor (here the fertiliser treatment) has some sort of an effect on the data. On the other hand, if the within-group variance is similar to the between-group variance, this hints at the grouping factor not explaining a lot of the variance in the data. Now all we need is a quantitative way of assessing what ‘larger’ and ‘similar’ mean in the last two sentences. As with the t-test and linear regression, we are looking for a statistic that we can use to assess the ‘rareness’ of our dataset. For ANOVA, we can use the variance ratio (between/within), given the number of degrees of freedom for the within- and between group variances. In other words, you can ask ‘how much of the variance in the data can be explained by your grouping factor?’, or, ‘what is the signal to noise ratio?’. Lastly, note that the total variance in the data is obtained by adding the within- and between-group variance. This was the theory, now let us work through an example. We’ll make up some data to illustrate the way ANOVA works. There is no need to understand the following code, but it may be helpful to see how the figure was drawn. All you need to understand is the first two lines, i.e. how the variable called \(y\) and the factor called \(fac\) are produced. If you want you can imagine the crop yield vs. fertiliser treatment scenario outlined above, where the yield of a crop is a function of two fertiliser applications. Including the obligatory control group which receives no fertiliser, our explanatory variable has three levels: Control, fertiliser A and fertiliser B. And at the end of the experiment, we are certainly interested whether there is a significant fertiliser effect on crop yield. Alright, let’s go:
y <- c(3, 4, 2, 1, 9, 7, 6, 8, 2, 3, 3, 1) # response variable
## We can use the gl() function to create factor levels: n specifies the number of factor levels (categories) and k the number of replicates for each factor level
fac <- gl(n = 3, k = 4, lab = c("Control", "A", "B")) # fac is short for factor
dat <- data.frame(y, fac)
## We can use the group_by() and summarise() functions to compute all three group means simultaneously
library(dplyr)
means <- group_by(.data = dat, fac) %>% summarise(y = mean(y))
## Draw explanatory plots
par(mfrow = c(2, 2), mar = c(4.5, 4, 2, 1), mgp = c(2, 0.8, 0)) # set up 2 x 2 plotting area and reduce the panel margins
boxplot(y ~ fac, ylim = c(0, 10), main = "Boxplot", cex.main = 1.1) # exploratory plot
plot(y, main = "Total variance", ylim = c(0, 10), cex.main = 1.1) # plot the total variance
abline(a = mean(y), b = 0) # draw mean, b = 0 means a slope of zero
for (i in 1:12) {lines(c(i, i), c(mean(y), y[i]))}
groupmeans <- rep(means$y, each = 4) # extend to vector of length 12
plot(y, main = "Within-group variance\n(Noise)", ylim = c(0, 10), cex.main = 1.1) #within-group var
lines(c(1, 4), c(means[1, 2], means[1, 2])) # draw group means
lines(c(5, 8), c(means[2, 2], means[2, 2]))
lines(c(9, 12), c(means[3, 2], means[3, 2]))
for (i in 1:4) {lines(c(i, i), c(means[1, 2], y[i]))} # draw residuals
for (i in 5:8) {lines(c(i, i), c(means[2, 2], y[i]))}
for (i in 9:12) {lines(c(i,i), c(means[3, 2], y[i]))}
plot(y, main = "Between-group variance\n(Signal)", ylim = c(0, 10), cex.main = 1.1) # between-group variance
abline(mean(y), 0) # draw mean
points(groupmeans, pch = 16)
for (i in 1:12) {lines(c(i, i), c(groupmeans[i], mean(y)))}
So we understand visually what is going on. How do we calculate the appropriate variances? First let us compute the between group variance. This is the sum of the squared differences between group means and the overall mean (as shown in the bottom right panel of the figure), divided by the degrees of freedom (the number of groups minus one, i.e. 2):
## Between-group variance
between <- sum((mean(y) - groupmeans)^2)/2 # for calculation of groupmeans see above
between
[1] 35.08333
Next, we compute the within-group variance as shown in the left bottom panel of the figure:
## Within-group variance
within <- sum((y - groupmeans)^2)/9
within
[1] 1.416667
Here we have 9 degrees of freedom, three within each group. Now all we need to do is divide the between-group by the within-group variance (this is the signal to noise ratio). Some clever statisticians have found out that this ratio follows a so-called F-distribution (variance ratio distribution) with 2 and 9 degrees of freedom:
f.value <- between/within # our F-value
f.value
[1] 24.76471
pf(q = f.value, df1 = 2, df2 = 9, lower.tail = F) # compare to F-distribution with 2 and 9 degrees of freedom
[1] 0.0002192338
The second command gives us the probability of obtaining such a variance ratio by chance, given 2 and 9 degrees of freedom, and assuming the data are normally distributed. This probability is our well known the P-value, and as expected (from looking at the figure), it is very low, i.e. we can assume that the grouping factor explains a lot of the variation in the data. We thus reject the NULL hypothesis that there are no differences between groups. Note that we need to substract from 1 because pf()
by default gives us the probability to the left of the specified F-value (here 24.76).
By now, you should know why we need to apply lower.tail = F
. If you need a reminder to clarify why you have to use lower.tail = F
, plot a histogram of 100000 random numbers that follow an F-distribution with 2 and 9 degrees of freedom (rf()
). Then draw a horizontal line where our computed F-value sits (at 24.76) using the function abline()
, specifying the argument ‘v’ which stands for a vertical line.
Of course you can get to the P-value much easier and quicker now that you know what exactly is calculated:
m1 <- aov(y ~ fac, data = dat)
summary(m1)
Df Sum Sq Mean Sq F value Pr(>F)
fac 2 70.17 35.08 24.77 0.000219 ***
Residuals 9 12.75 1.42
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
between
[1] 35.08333
within
[1] 1.416667
You can see the degrees of freedom, the sum of squares, the mean squares (the variances), the F-value and the P-value, as computed by hand. Notice that the between-group variance is listed as the mean squares associated with the explanatory variable (here simply called ‘fac’). The within-group variance is here referred to as ‘Residuals’, because it is the left over variance that remains once the grouping (under the explanatory variable) has been accounted for. Here, our test statistic is the variance ratio (between-group variance/within-group variance), which follows a so-called F-distribution. The model diagostic plots can be obtained with this line:
par(mfrow = c(2, 2), mar = c(4.5, 4, 1.5, 1), mgp = c(2.5, 0.9, 0))
plot(m1)
and are interpreted as in linear regression. Again, here we know that the data are normal (because we created them using rnorm()
. There is a good chance however that they are not in a real-world example. Further below are some directions on what you can do if this is the case. Next, if your ANOVA result is significant, you may want to know between what groups there are significant differences. The ANOVA result only tells us that there are significant differences, not where they occur. To do this, we will have to follow up with a so-called post-hoc test. Perhaps the most popular post-hoc test is Tukey’s HSD (honest significant difference) test, which performs pairwise comparisons (i.e. it compares all groups with each other, which is usually what we want). Essentially, this test uses multiple t-tests to look at specific differences between groups, with the important feature that it corrects for multiple comparisons. This is important. Imagine you run t-tests between all groups, now the chances of finding significant differences increases with every comparison you make. You are increasing your type I error probability! The Tukey HSD test corrects for this on the fly and is easy to apply, which makes it superior to running a series of t-tests. R offers a TukeyHSD
function but this only works with the aov
command, which is quite limiting and annoying as most people simply use lm
or may want to perform pairwise comparisons based on Tukey’s method with other types of models.
m1 <- aov(y ~ fac, data = dat)
summary(m1)
Df Sum Sq Mean Sq F value Pr(>F)
fac 2 70.17 35.08 24.77 0.000219 ***
Residuals 9 12.75 1.42
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.lm(m1)
Call:
aov(formula = y ~ fac, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.5000 -0.6875 0.1250 0.7500 1.5000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.5000 0.5951 4.201 0.002304 **
facA 5.0000 0.8416 5.941 0.000218 ***
facB -0.2500 0.8416 -0.297 0.773174
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.19 on 9 degrees of freedom
Multiple R-squared: 0.8462, Adjusted R-squared: 0.8121
F-statistic: 24.76 on 2 and 9 DF, p-value: 0.0002192
TukeyHSD(m1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = y ~ fac, data = dat)
$fac
diff lwr upr p adj
A-Control 5.00 2.650177 7.349823 0.0005705
B-Control -0.25 -2.599823 2.099823 0.9527766
B-A -5.25 -7.599823 -2.900177 0.0003989
boxplot(y ~ fac, data = dat, ylim = c(0, 12), main = "Boxplot", cex.main = 1.1)
text(x = 1:3, y = c(5, 10, 5), labels = c("a", "b", "a"))
m2 <- lm(y ~ fac, data = dat)
summary(m2)
Call:
lm(formula = y ~ fac, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.5000 -0.6875 0.1250 0.7500 1.5000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.5000 0.5951 4.201 0.002304 **
facA 5.0000 0.8416 5.941 0.000218 ***
facB -0.2500 0.8416 -0.297 0.773174
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.19 on 9 degrees of freedom
Multiple R-squared: 0.8462, Adjusted R-squared: 0.8121
F-statistic: 24.76 on 2 and 9 DF, p-value: 0.0002192
summary.aov(m2)
Df Sum Sq Mean Sq F value Pr(>F)
fac 2 70.17 35.08 24.77 0.000219 ***
Residuals 9 12.75 1.42
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(m2) # TukeyHSD function doesn't work with lm :-(
Error in UseMethod("TukeyHSD"): no applicable method for 'TukeyHSD' applied to an object of class "lm"
Since the TukeyHSD
function only works with aov
, I strongly suggest to use the more flexible lsmeans
function (short for least-squares means) in the emmeans
package.
library(emmeans)
## Compute least-squares means
lsms <- lsmeans(m2, specs = "fac") # also works with the aov command, try it out! :-)
lsms
## fac lsmean SE df lower.CL upper.CL
## Control 2.50 0.595119 9 1.1537472 3.846253
## A 7.50 0.595119 9 6.1537472 8.846253
## B 2.25 0.595119 9 0.9037472 3.596253
##
## Confidence level used: 0.95
## Use the least-squares means for pairwise comparisons
contrast(object = lsms, method = "pairwise")
## contrast estimate SE df t.ratio p.value
## Control - A -5.00 0.8416254 9 -5.941 0.0006
## Control - B 0.25 0.8416254 9 0.297 0.9528
## A - B 5.25 0.8416254 9 6.238 0.0004
##
## P value adjustment: tukey method for comparing a family of 3 estimates
## The following is a shorthand notation of the previous line of code
pairs(lsms)
## contrast estimate SE df t.ratio p.value
## Control - A -5.00 0.8416254 9 -5.941 0.0006
## Control - B 0.25 0.8416254 9 0.297 0.9528
## A - B 5.25 0.8416254 9 6.238 0.0004
##
## P value adjustment: tukey method for comparing a family of 3 estimates
You can see that group A clearly differs from the control and from B, but the latter two are not significantly different from each other. The P-values have been corrected for multiple comparisons using the Tukey method. In a corresponding plot, non-significant differences between groups are often denoted with the same letters. In our case for example, group 1 and three would share the letter ‘a’, and the second group would be marked with a ‘b’.
As an exercise, load the data set ‘iris’. Using the above explanations, run an analysis of variance that makes sense (several would be possible). Your steps should include a brief explanation of the subset of data you want to analyse (use ?iris
to learn more about the data set), the actual analysis, a Tukey’s HSD test, a plot and 4 diagnostics plots, including their interpretation. NB: In an exam, I would give you more precise instructions on what exactly you should do. Here, you have some freedom as to what exactly you want to do. In any case, you should have a nice flow of explanations, R-code, plots, and interpretation of your results.
Another way to test for specific differences between groups is setting contrasts. In R, every object of class ‘factor’, such as our above defined ‘fac’ contains predefined information on contrasts: by default, R uses so-called ‘treatment contrasts’, which means that all factor levels are compared to the first level. In unordered factors, this is simply the factor level that comes first in the alphabet. At this stage, it is important to be able to interpret the extended output of the anova table:
summary.lm(m1)
Call:
aov(formula = y ~ fac, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.5000 -0.6875 0.1250 0.7500 1.5000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.5000 0.5951 4.201 0.002304 **
facA 5.0000 0.8416 5.941 0.000218 ***
facB -0.2500 0.8416 -0.297 0.773174
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.19 on 9 degrees of freedom
Multiple R-squared: 0.8462, Adjusted R-squared: 0.8121
F-statistic: 24.76 on 2 and 9 DF, p-value: 0.0002192
The level that comes first in the alphabet in our factor ‘fac’ is ‘A’, so this becomes the intercept that the other factor levels are tested against. The estimate for the intercept is thus the estimate for the mean value of the data in ‘y’ at level ‘A’, with the associated standard error. The P-value for the intercept compares it to zero, so it is rarely sensible to interpret it. The second line, ‘facB’ compares the data in ‘y’ at factor level ‘B’ to those in ‘A’ (the intercept). The estimate is the difference between ‘A’ and ‘B’, and the standard error for this difference. The P-value in this line thus refers to the difference between groups ‘A’ and ‘B’. Importantly, with treatment contrasts, we do not get any information on the difference between ‘B’ and ‘C’.
contrasts(dat$fac) # show the contrast matrix
A B
Control 0 0
A 1 0
B 0 1
contrasts(fac) <- cbind(c(-1, 1, 0), c(0, -1, 1))
contrasts(fac)
[,1] [,2]
Control -1 0
A 1 -1
B 0 1
summary.lm(aov(y ~ fac))
Call:
aov(formula = y ~ fac)
Residuals:
Min 1Q Median 3Q Max
-1.5000 -0.6875 0.1250 0.7500 1.5000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0833 0.3436 11.884 8.36e-07 ***
fac1 1.5833 0.4859 3.258 0.00986 **
fac2 -1.8333 0.4859 -3.773 0.00440 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.19 on 9 degrees of freedom
Multiple R-squared: 0.8462, Adjusted R-squared: 0.8121
F-statistic: 24.76 on 2 and 9 DF, p-value: 0.0002192
2 * pt(q = 2.079, df = 11, lower.tail = F)
[1] 0.06180482
pt(q = 2.079, df = 11, lower.tail = F) + pt(q = -2.079, df = 11)
[1] 0.06180482
You can see that if you set contrasts in order to compare ‘A’ with ‘B’ and ‘B’ with ‘C’ in the above way, both comparisons become significant. There is a lot more to setting contrasts, and it is clear that this explanation is by far not exhaustive. The key thing is to understand that contrasts can be set, and that in R, the default is to compare factor levels to the alphabetically first level.