ANCOVA (Analysis of covariance)

ANCOVA is a more specialized test than many of the others in these tutorials, but it is super useful. It tests how the relationship (slope) differs between a continuous predictor variable and a continuous response variable differ among groups. For instance, we could test if the relationship between an iris’s sepal width and petal length differs among species of iris. Often, a covariate (the continuous predictor variable above) is an initial height or weight of an organism, but by no means has to be.

It is very similar to dummy variable regression and indicator variable regression. ANCOVA is really just a special case of ANOVA.

We will use the iris dataset for this analysis, a freely available dataset that you can view using the code below.

View(iris)
str(iris)

ANCOVA assumptions

Assumptions for this test are the same as ANOVA (our old friend) with a couple of extra ones.

Assumptions of ANOVA

  1. Samples are independent from one another (within and among groups).
  2. Samples were randomly chosen.
  3. Response variable is normally distributed (often people check residual normality too, but I won’t be showing that here). To do this, google qqplots in R.
  4. Variances are equal among groups.

Extra assumptions

  1. The relationship between the dependent variable (y) and the covariate (x) is linear.
  2. There is no interaction between your two predictor variables (i.e. the trendlines of each group are all parallel (homogeneity of regression slopes)).
  3. The covariate is independent of the treatment effects (i.e. the covariant and independent variables are independent).

You have to determine if your sampling was independent and random. The other assumptions can be tested with the code below.

Testing Normality

First we will test normality using two methods. The first is by looking at the histogram using the hist() function. This is a good way to ‘eyeball’ the data. The second way is to use a statistical test for normality called the Shapiro-Wilk test. There is some disagreement on if this test does a good job, but it is widely used, so I do too (at least for the moment).

attach(iris) #Attachs data to make it easier to call the data later
hist(Sepal.Width)

shapiro.test(Sepal.Width)
## 
##  Shapiro-Wilk normality test
## 
## data:  Sepal.Width
## W = 0.98492, p-value = 0.1012

The histogram is rather normal, but maybe has a bit of right skew. The Shapiro-Wilk test shows a p-value of 0.10, which is greater than 0.05, suggesting that the data is normally distributed. In summary, if the Shapiro-Wilk p-value is greater than 0.05, it suggests the data is normal; if it’s less than 0.05, then it’s nonnormal.

Testing Variances

Next we will test if the variances are the same among the groups using the Levene’s test.

We will call the package car to run the function leveneTest. If you have not installed the package car before, run the line #install.packages… below once and never again. One installation is all you need per computer. To run it, delete the pound sign and hit run. Then put the hashtag back to make it a ‘comment’ and not code.

#install.packages("car") ##Only run this once ever per computer
library(car) #run this every time you open r and need the package
## Warning: package 'car' was built under R version 3.4.4
## Warning: package 'carData' was built under R version 3.4.4
lm.iris1<-lm(Sepal.Width~Species,data=iris) #makes a 'linear model' object which we will use for the next line of code and the ANOVA code
leveneTest(lm.iris1)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  0.5902 0.5555
##       147

The variances are equal, as the p-value is 0.556 (greater than 0.05).

Linearity between response variable and covariate

To test linearity use the code below.

plot(iris$Petal.Length,iris$Sepal.Width, main="", xlab="Petal Length (mm)",ylab="Sepal Width (mm)") 

            #Note: list the predictor variable first, then the response variable.
            #2nd Note: the structure of iris$Sepal.Width is dataframe$variable

The relationship is not linear. For the sake of this example though, let’s continue. If your own data is nonlinear, consider transforming your data.

Testing for an interaction

If you are unfamiliar with what a statistical interaction is, please see this link.

First we will look for an interaction, using the functions lm and anova. In the first line of code below, we create an object called lm.1 using the function lm. Before the ~ is the response variable (sepal width) and after the ~ are the two predictor variables - Species and Petal Length (our covariate).

Note that between our two predictor variables below, we use a ’*’ - this tells R that we want to look for an interaction.

We then run a anova on our object lm.1. This returns an ANOVA table. P-values are in the column Pr(>F). In the species row, we see that p is very small (that’s 2 x 10^-16). In Petal Length, once again our p-value is very low.

But in this test, what we are actually interested in is the interaction, the Species:Petal.Length row. We see that the p-value is 0.1895 - higher than 0.05 so not significant. There is no interaction present, we can continue.

lm.1<-lm(Sepal.Width~Species*Petal.Length,data=iris)
anova(lm.1)  #no interaction
## Analysis of Variance Table
## 
## Response: Sepal.Width
##                       Df  Sum Sq Mean Sq F value    Pr(>F)    
## Species                2 11.3449  5.6725 56.7086 < 2.2e-16 ***
## Petal.Length           1  2.4225  2.4225 24.2184 2.322e-06 ***
## Species:Petal.Length   2  0.1354  0.0677  0.6768    0.5098    
## Residuals            144 14.4041  0.1000                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANCOVA

Now that we have addressed our assumptions, including that there is not an interaction, let’s actually do the ANCOVA.

Look at the code below. You will notice that it looks very similar to the code above, but there is a + between Species and Petal Length. This tells R that we aren’t interested in the interaction, so don’t test for it.

Looking at the p-values below, both species and petal length are significant! We know that the species of iris has a significant effect on the sepal width. We also know that the relationship (slope) of sepal length given petal length differs among the species.

#Second Test
lm.2<-lm(Sepal.Width~Species+Petal.Length,data=iris)
anova(lm.2) #Species and covariate petal length are both significant
## Analysis of Variance Table
## 
## Response: Sepal.Width
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## Species        2 11.3449  5.6725  56.961 < 2.2e-16 ***
## Petal.Length   1  2.4225  2.4225  24.326 2.186e-06 ***
## Residuals    146 14.5395  0.0996                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post hoc comparisons

But which group/s are different?

We will use the package multcomp to help us find out. Like usual, use install.packages to install this package if you haven’t before.

We make an object called model below using the function aov. This function is very similar to the function lm that we used before, but aov() is compatible with the later lines of code.

Next we run the post-hoc comparisons using the function glht, a function that can be tricky to use, so let’s walk through it step by step. We make a new object called posth. Within the function glht, we specify our aov model, called model in this case. Then we specify that our factor variable is Species and we want to do a Tukey test within the mcp and linfct functions. Other post hoc comparisons than Tukey are available. Finally, we use summary to view the results.

library(multcomp)
model = aov(Sepal.Width ~ Species + Petal.Length, data=iris)
posth = glht(model, linfct=mcp(Species='Tukey'))
summary(posth)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = Sepal.Width ~ Species + Petal.Length, data = iris)
## 
## Linear Hypotheses:
##                             Estimate Std. Error t value Pr(>|t|)    
## versicolor - setosa == 0     -1.4927     0.1806  -8.264   <0.001 ***
## virginica - setosa == 0      -1.6741     0.2553  -6.557   <0.001 ***
## virginica - versicolor == 0  -0.1814     0.1004  -1.806    0.128    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

We see that the first two comparisons (versicolor and setosa; virginica and setosa), the p-value is less than 0.05, so these groups differ significantly. The final comparison (virginica and versicolor) is not significant.

We can conclude that I. setosa has a significantly different relationship between sepal width and petal length than the other species.

Plotting your results

If you would like to plot your ANCOVa results, please see Classed-up Scatterplot using ggplot2 in the plotting tutorial.

Now with your data

You have to determine if your sampling was independent and random. The other assumptions can be tested with the code below.

Testing Normality

attach(iris) #Attachs data to make it easier to call the data later
## The following objects are masked from iris (pos = 10):
## 
##     Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species
hist(Sepal.Width)

shapiro.test(Sepal.Width)
## 
##  Shapiro-Wilk normality test
## 
## data:  Sepal.Width
## W = 0.98492, p-value = 0.1012

The Shapiro-Wilk test, if the Shapiro-Wilk p-value is greater than 0.05, it suggests the data is normal; if it’s less than 0.05, then it’s nonnormal.

Testing Variances

If you have not installed the package car before, run the line #install.packages… below once and never again. One installation is all you need per computer. To run it, delete the pound sign and hit run. Then put the hashtag back to make it a ‘comment’ and not code.

Variances are equal if levene test p-value is greater than 0.05.

#install.packages("car") ##Only run this once ever per computer
library(car) #run this every time you open r and need the package
lm.iris1<-lm(Sepal.Width~Species,data=iris) #makes a 'linear model' object which we will use for the next line of code and the ANOVA code
leveneTest(lm.iris1)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  0.5902 0.5555
##       147

Linearity between response variable and covariate

To test linearity use the code below.

plot(iris$Petal.Length,iris$Sepal.Width, main="", xlab="Petal Length (mm)",ylab="Sepal Width (mm)") 

            #Note: list the predictor variable first, then the response variable.
            #2nd Note: the structure of iris$Sepal.Width is dataframe$variable

Testing for an interaction

If you are unfamiliar with what a statistical interaction is, please see this link.

lm.1<-lm(Sepal.Width~Species*Petal.Length,data=iris)
anova(lm.1)  #no interaction
## Analysis of Variance Table
## 
## Response: Sepal.Width
##                       Df  Sum Sq Mean Sq F value    Pr(>F)    
## Species                2 11.3449  5.6725 56.7086 < 2.2e-16 ***
## Petal.Length           1  2.4225  2.4225 24.2184 2.322e-06 ***
## Species:Petal.Length   2  0.1354  0.0677  0.6768    0.5098    
## Residuals            144 14.4041  0.1000                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANCOVA - second test with no interaction

#Second Test
lm.2<-lm(Sepal.Width~Species+Petal.Length,data=iris)
anova(lm.2) #Species and covariate petal length are both significant
## Analysis of Variance Table
## 
## Response: Sepal.Width
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## Species        2 11.3449  5.6725  56.961 < 2.2e-16 ***
## Petal.Length   1  2.4225  2.4225  24.326 2.186e-06 ***
## Residuals    146 14.5395  0.0996                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post hoc comparisons

library(multcomp)
model = aov(Sepal.Width ~ Species + Petal.Length, data=iris)
posth = glht(model, linfct=mcp(Species='Tukey'))
summary(posth)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = Sepal.Width ~ Species + Petal.Length, data = iris)
## 
## Linear Hypotheses:
##                             Estimate Std. Error t value Pr(>|t|)    
## versicolor - setosa == 0     -1.4927     0.1806  -8.264   <0.001 ***
## virginica - setosa == 0      -1.6741     0.2553  -6.557   <0.001 ***
## virginica - versicolor == 0  -0.1814     0.1004  -1.806    0.128    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Plotting your results

If you would like to plot your ANCOVa results, please see Classed-up Scatterplot using ggplot2 in the plotting tutorial.