KickstartR: ANCOVA

Analysis of Covariance, or ANCOVA, is an extension of linear least-squares regression that allows us to compare categorical groups based on the linear relationship between a continuous response variable (Y) and a continuous predictor variable, or covariate (X). As such, it is a special case of a general linear model, implemented using the lm() function.

In biology, we often use ANCOVA to compare between natural biological categories like the two sexes or between different species. Here we will use it to compare morphological relationships among three different species of iris: Iris setosa, Iris versicolor, and Iris virginica. The main elaborate structure of the iris flower is called a sepal, and we can gauge the shape of the flowers by looking at a regression of sepal width against sepal length. Basically, we can ask whether some species tend to have flowers that have long-skinny sepals vs. short-wide sepals.

require(mosaic) # be sure to install and load the usual packages if you have not already
require(lattice)
require(datasets)

data(iris)
View(iris) # take a look at the data

In this case, we will examine sepal width as our response, using species as a categorical predictor (like in ANOVA) and sepal length as our coviate (as in linear regression).

It is always helpful to look at a plot first. Note that type=c("p","r") puts both points (p) and regression lines (r) on the plot

xyplot(Sepal.Width~Sepal.Length, data=iris, groups=Species, type=c("p","r"), auto.key=TRUE, xlab="Sepal length (cm)", ylab="Sepal width (cm)")

plot of chunk unnamed-chunk-3

From this plot, it appears that the the two dimensions of the sepals are positively related in all of the species, but Iris setosa appears to have both a steeper relationship and shorter, wider sepals. The other two species appear to be quite similar.

Now we can make the ANCOVA model in order to assess whether the differences in the relationships are greater than might be expected by chance - i.e., if all the individuals flowers were drawn from a single underlying population. Note that in the formula statement, we use * to indicate that we want to consider Sepal.Length and Species and their interaction Sepal.Length:Species.

Sepals.lm = lm(Sepal.Width~Sepal.Length*Species, data=iris)

In evaluating an ANCOVA model, we want to sequentially ask first whether there is a difference in slope, then if there is not, look for differences in intercept. Let’s take it a step at a time. We start with a peek at an anova table to evaluate the different terms in the model.

anova(Sepals.lm) # this command gives us an anova table to evalute different terms

## Analysis of Variance Table
## 
## Response: Sepal.Width
##                       Df Sum Sq Mean Sq F value  Pr(>F)    
## Sepal.Length           1   0.39    0.39    5.28   0.023 *  
## Species                2  15.72    7.86  105.99 < 2e-16 ***
## Sepal.Length:Species   2   1.51    0.76   10.20 7.2e-05 ***
## Residuals            144  10.68    0.07                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since we first want to know whether there are differences in slope, we need to look at the interaction term, which tells us whether the effect of the covariate (Sepal.Length) depends on our categorical predictor Species. Based on the Pr(>F) value for Sepal.Length:Species, we can see that the observed variation would be very unlikely to arise “by chance.” So we can conclude that the regression slopes do vary across the three species. The very low “p-value” for the Species term allows us to be confident that sepal widths vary systematically across the three species, while that of the Sepal.Length term tells us that sepal width does in fact appear to increase with sepal length, across all the species.

So what if we want to know the equations for the different species? Well, we could subset the data and run a separate linear regression for each of the species. But we can also draw the same information out of our single ANCOVA model, using the summary command.

summary(Sepals.lm) # this command gives us an anova table to evalute different terms

## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length * Species, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7239 -0.1633 -0.0029  0.1646  0.6095 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -0.569      0.554   -1.03  0.30562    
## Sepal.Length                      0.799      0.110    7.23  2.5e-11 ***
## Speciesversicolor                 1.442      0.713    2.02  0.04506 *  
## Speciesvirginica                  2.016      0.686    2.94  0.00385 ** 
## Sepal.Length:Speciesversicolor   -0.479      0.134   -3.58  0.00046 ***
## Sepal.Length:Speciesvirginica    -0.567      0.126   -4.49  1.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.272 on 144 degrees of freedom
## Multiple R-squared:  0.623,  Adjusted R-squared:  0.61 
## F-statistic: 47.5 on 5 and 144 DF,  p-value: <2e-16

Here is where reading thr R output gets a bit tricky. The important thing to remember is that R treats the levels of the categorical variable in alphanumeric order, and it uses the first one as a sort of reference point. So for Iris setosa we can just use the parameter estimates for the intercept and the slope (Sepal.Length): \[ W_{sepal} = -0.57 + 0.799 * L_{sepal} \] Now for each of the other species, we add the listed parameters to those reference parameters. So for Iris versicolor, we see add the Speciesversicolor estimate to the intercept (-0.57 + 1.44 = 0.87) and Sepal.Length:Speciesversicolor to the slope (0.799 - 0.479 = 0.320), so our equation for Iris versicolor is: \[ W_{sepal} = 0.87 + 0.320 * L_{sepal} \] Doing the same thing for Iris virginica gives us: \[ W_{sepal} = 1.45 + 0.232 * L_{sepal} \]

Looking back at the plot, these equations make sense, Iris setosa has the steepest relationship, with a lower intercept, while those of Iris versicolor and Iris virginica appear to be very similar.

To see what happens in a case where the differences are not so pronounced, let’s repeat the analysis, but leave out Iris setosa since it is so different. Note that we are doing this for illustration, and in a real analysis this subjects us to the “multiple comparison problem.”

First, I will subset the data and make a new data frame called iris2

iris2 = subset(iris, Species != "setosa", drop=TRUE)

Now look we can make a new model and analyze it:

sepal2.lm = lm(Sepal.Width~Sepal.Length*Species, data=iris2)
anova(sepal2.lm)

## Analysis of Variance Table
## 
## Response: Sepal.Width
##                      Df Sum Sq Mean Sq F value  Pr(>F)    
## Sepal.Length          1   3.36    3.36   42.92 2.8e-09 ***
## Species               1   0.02    0.02    0.22    0.64    
## Sepal.Length:Species  1   0.06    0.06    0.77    0.38    
## Residuals            96   7.52    0.08                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, you can see from the “p-value” for the interaction term Sepal.Length:Species that the rather subtle difference in slope could have very easily be observed by chance. When this occurs in ANCOVA, you generally want to reconstruct the model, leaving out the interaction term. This means changing the * to a + in our formula statement. This allows us to ask, given that there is not a difference in slope, is there a difference in intercept?

sepal3.lm = lm(Sepal.Width~Sepal.Length+Species, data=iris2)
anova(sepal3.lm)

## Analysis of Variance Table
## 
## Response: Sepal.Width
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## Sepal.Length  1   3.36    3.36   43.02 2.6e-09 ***
## Species       1   0.02    0.02    0.22    0.64    
## Residuals    97   7.58    0.08                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on these results, we can see that there is little evidence for a difference in sepal width between species, which is not too suprising if we look back at the original plot. So just based on these two species, we would conclude that a single relationship between sepal width and sepal length is the best model for the data at hand. We can make that model by simply dropping the Species term from the model altogether.

sepal4.lm = lm(Sepal.Width~Sepal.Length, data=iris2)
summary(sepal4.lm) # note that I switched to summary so we can get the coefficients

## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = iris2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6718 -0.1477  0.0094  0.1744  0.6008 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.1309     0.2659    4.25  4.8e-05 ***
## Sepal.Length   0.2780     0.0422    6.59  2.3e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.278 on 98 degrees of freedom
## Multiple R-squared:  0.307,  Adjusted R-squared:   0.3 
## F-statistic: 43.4 on 1 and 98 DF,  p-value: 2.27e-09

So from this, the relationship between width and length of sepals for these two species is \[ W_{sepal} = 1.13 + 0.278 * L_{sepal} \] Note that the slope and intercept for this equation are both intermediate between the two separate equations for the individual species described above. We can visualize the relationship by leaving the species grouping off of the plot.

xyplot(Sepal.Width~Sepal.Length, data=iris2, col=iris2$Species, type=c("p","r"), xlab="Sepal length (cm)", ylab="Sepal width (cm)", col.line="black")

plot of chunk unnamed-chunk-11

# I kept the species separate using the col command but the line is fit to all the points for both species

KickstartR: ANCOVA

Drew Kerkhoff