ANCOVA is a more specialized test than many of the others in these tutorials, but it is super useful. It tests how the relationship (slope) differs between a continuous predictor variable and a continuous response variable differ among groups. For instance, we could test if the relationship between an iris’s sepal width and petal length differs among species of iris. Often, a covariate (the continuous predictor variable above) is an initial height or weight of an organism, but by no means has to be.
It is very similar to dummy variable regression and indicator variable regression. ANCOVA is really just a special case of ANOVA.
We will use the iris dataset for this analysis, a freely available dataset that you can view using the code below.
View(iris)
str(iris)
Assumptions for this test are the same as ANOVA (our old friend) with a couple of extra ones.
You have to determine if your sampling was independent and random. The other assumptions can be tested with the code below.
First we will test normality using two methods. The first is by looking at the histogram using the hist() function. This is a good way to ‘eyeball’ the data. The second way is to use a statistical test for normality called the Shapiro-Wilk test. There is some disagreement on if this test does a good job, but it is widely used, so I do too (at least for the moment).
attach(iris) #Attachs data to make it easier to call the data later
hist(Sepal.Width)
shapiro.test(Sepal.Width)
##
## Shapiro-Wilk normality test
##
## data: Sepal.Width
## W = 0.98492, p-value = 0.1012
The histogram is rather normal, but maybe has a bit of right skew. The Shapiro-Wilk test shows a p-value of 0.10, which is greater than 0.05, suggesting that the data is normally distributed. In summary, if the Shapiro-Wilk p-value is greater than 0.05, it suggests the data is normal; if it’s less than 0.05, then it’s nonnormal.
Next we will test if the variances are the same among the groups using the Levene’s test.
We will call the package car to run the function leveneTest. If you have not installed the package car before, run the line #install.packages… below once and never again. One installation is all you need per computer. To run it, delete the pound sign and hit run. Then put the hashtag back to make it a ‘comment’ and not code.
#install.packages("car") ##Only run this once ever per computer
library(car) #run this every time you open r and need the package
## Warning: package 'car' was built under R version 3.4.4
## Warning: package 'carData' was built under R version 3.4.4
lm.iris1<-lm(Sepal.Width~Species,data=iris) #makes a 'linear model' object which we will use for the next line of code and the ANOVA code
leveneTest(lm.iris1)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.5902 0.5555
## 147
The variances are equal, as the p-value is 0.556 (greater than 0.05).
To test linearity use the code below.
plot(iris$Petal.Length,iris$Sepal.Width, main="", xlab="Petal Length (mm)",ylab="Sepal Width (mm)")
#Note: list the predictor variable first, then the response variable.
#2nd Note: the structure of iris$Sepal.Width is dataframe$variable
The relationship is not linear. For the sake of this example though, let’s continue. If your own data is nonlinear, consider transforming your data.
If you are unfamiliar with what a statistical interaction is, please see this link.
First we will look for an interaction, using the functions lm and anova. In the first line of code below, we create an object called lm.1 using the function lm. Before the ~ is the response variable (sepal width) and after the ~ are the two predictor variables - Species and Petal Length (our covariate).
Note that between our two predictor variables below, we use a ’*’ - this tells R that we want to look for an interaction.
We then run a anova on our object lm.1. This returns an ANOVA table. P-values are in the column Pr(>F). In the species row, we see that p is very small (that’s 2 x 10^-16). In Petal Length, once again our p-value is very low.
But in this test, what we are actually interested in is the interaction, the Species:Petal.Length row. We see that the p-value is 0.1895 - higher than 0.05 so not significant. There is no interaction present, we can continue.
lm.1<-lm(Sepal.Width~Species*Petal.Length,data=iris)
anova(lm.1) #no interaction
## Analysis of Variance Table
##
## Response: Sepal.Width
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.3449 5.6725 56.7086 < 2.2e-16 ***
## Petal.Length 1 2.4225 2.4225 24.2184 2.322e-06 ***
## Species:Petal.Length 2 0.1354 0.0677 0.6768 0.5098
## Residuals 144 14.4041 0.1000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now that we have addressed our assumptions, including that there is not an interaction, let’s actually do the ANCOVA.
Look at the code below. You will notice that it looks very similar to the code above, but there is a + between Species and Petal Length. This tells R that we aren’t interested in the interaction, so don’t test for it.
Looking at the p-values below, both species and petal length are significant! We know that the species of iris has a significant effect on the sepal width. We also know that the relationship (slope) of sepal length given petal length differs among the species.
#Second Test
lm.2<-lm(Sepal.Width~Species+Petal.Length,data=iris)
anova(lm.2) #Species and covariate petal length are both significant
## Analysis of Variance Table
##
## Response: Sepal.Width
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.3449 5.6725 56.961 < 2.2e-16 ***
## Petal.Length 1 2.4225 2.4225 24.326 2.186e-06 ***
## Residuals 146 14.5395 0.0996
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But which group/s are different?
We will use the package multcomp to help us find out. Like usual, use install.packages to install this package if you haven’t before.
We make an object called model below using the function aov. This function is very similar to the function lm that we used before, but aov() is compatible with the later lines of code.
Next we run the post-hoc comparisons using the function glht, a function that can be tricky to use, so let’s walk through it step by step. We make a new object called posth. Within the function glht, we specify our aov model, called model in this case. Then we specify that our factor variable is Species and we want to do a Tukey test within the mcp and linfct functions. Other post hoc comparisons than Tukey are available. Finally, we use summary to view the results.
library(multcomp)
model = aov(Sepal.Width ~ Species + Petal.Length, data=iris)
posth = glht(model, linfct=mcp(Species='Tukey'))
summary(posth)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = Sepal.Width ~ Species + Petal.Length, data = iris)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## versicolor - setosa == 0 -1.4927 0.1806 -8.264 <0.001 ***
## virginica - setosa == 0 -1.6741 0.2553 -6.557 <0.001 ***
## virginica - versicolor == 0 -0.1814 0.1004 -1.806 0.128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
We see that the first two comparisons (versicolor and setosa; virginica and setosa), the p-value is less than 0.05, so these groups differ significantly. The final comparison (virginica and versicolor) is not significant.
We can conclude that I. setosa has a significantly different relationship between sepal width and petal length than the other species.
If you would like to plot your ANCOVa results, please see Classed-up Scatterplot using ggplot2 in the plotting tutorial.
You have to determine if your sampling was independent and random. The other assumptions can be tested with the code below.
attach(iris) #Attachs data to make it easier to call the data later
## The following objects are masked from iris (pos = 10):
##
## Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species
hist(Sepal.Width)
shapiro.test(Sepal.Width)
##
## Shapiro-Wilk normality test
##
## data: Sepal.Width
## W = 0.98492, p-value = 0.1012
The Shapiro-Wilk test, if the Shapiro-Wilk p-value is greater than 0.05, it suggests the data is normal; if it’s less than 0.05, then it’s nonnormal.
If you have not installed the package car before, run the line #install.packages… below once and never again. One installation is all you need per computer. To run it, delete the pound sign and hit run. Then put the hashtag back to make it a ‘comment’ and not code.
Variances are equal if levene test p-value is greater than 0.05.
#install.packages("car") ##Only run this once ever per computer
library(car) #run this every time you open r and need the package
lm.iris1<-lm(Sepal.Width~Species,data=iris) #makes a 'linear model' object which we will use for the next line of code and the ANOVA code
leveneTest(lm.iris1)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.5902 0.5555
## 147
To test linearity use the code below.
plot(iris$Petal.Length,iris$Sepal.Width, main="", xlab="Petal Length (mm)",ylab="Sepal Width (mm)")
#Note: list the predictor variable first, then the response variable.
#2nd Note: the structure of iris$Sepal.Width is dataframe$variable
If you are unfamiliar with what a statistical interaction is, please see this link.
lm.1<-lm(Sepal.Width~Species*Petal.Length,data=iris)
anova(lm.1) #no interaction
## Analysis of Variance Table
##
## Response: Sepal.Width
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.3449 5.6725 56.7086 < 2.2e-16 ***
## Petal.Length 1 2.4225 2.4225 24.2184 2.322e-06 ***
## Species:Petal.Length 2 0.1354 0.0677 0.6768 0.5098
## Residuals 144 14.4041 0.1000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Second Test
lm.2<-lm(Sepal.Width~Species+Petal.Length,data=iris)
anova(lm.2) #Species and covariate petal length are both significant
## Analysis of Variance Table
##
## Response: Sepal.Width
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.3449 5.6725 56.961 < 2.2e-16 ***
## Petal.Length 1 2.4225 2.4225 24.326 2.186e-06 ***
## Residuals 146 14.5395 0.0996
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(multcomp)
model = aov(Sepal.Width ~ Species + Petal.Length, data=iris)
posth = glht(model, linfct=mcp(Species='Tukey'))
summary(posth)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = Sepal.Width ~ Species + Petal.Length, data = iris)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## versicolor - setosa == 0 -1.4927 0.1806 -8.264 <0.001 ***
## virginica - setosa == 0 -1.6741 0.2553 -6.557 <0.001 ***
## virginica - versicolor == 0 -0.1814 0.1004 -1.806 0.128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
If you would like to plot your ANCOVa results, please see Classed-up Scatterplot using ggplot2 in the plotting tutorial.