I recently had a student ask about ANCOVA for ecological data. As this isn’t the first time, I decided to take the opportunity to show an example of how to generate synthetic data in R and then analyze it using Analysis of Covariance (ANCOVA). The hypothesis being tested in each case is whether or not fork length impacts the distance moved by fish in two different streams. I’ll call the streams the Shribble and the Rush. I’ll leave it to you to Google where I got the names from.
set.seed(650)
#Create dummy variable by repeating Shribble and Rush 325 times each
Stream <- rep(c("Shribble", "Rush"), each = 325)
#Create fork length covariate by taking 325 samples from a normal distribution with a mean of 80 and a standard deviation of 15, then repeat for a distribution with a mean of 100 and a standard deviation of 20
fork <- c(rnorm(325, 80, 15), rnorm(325, 100, 20))
#Create the response variable with two intercepts (15 and 55), a single slope of 0.5, and random normal noise
distance <- rep(c(15, 55), each = 325) + 0.5*fork + rnorm(length(fork), 0, 20)
#Pull all three variables together in a single data frame
movement <- data.frame(Stream = as.factor(Stream), fork = fork, distance = distance)
summary(movement)
## Stream fork distance
## Rush :325 Min. : 41.39 Min. : -5.137
## Shribble:325 1st Qu.: 75.80 1st Qu.: 55.190
## Median : 88.10 Median : 80.115
## Mean : 89.78 Mean : 80.551
## 3rd Qu.:103.07 3rd Qu.:104.192
## Max. :153.77 Max. :167.602
It looks like the data is normally distributed. After all, the mean and median are nearly the same. But let’s check out the data for each stream.
Shribble <- subset(movement, Stream == "Shribble")
Rush <- subset(movement, Stream == "Rush")
summary(Shribble)
## Stream fork distance
## Rush : 0 Min. : 41.75 Min. : -5.137
## Shribble:325 1st Qu.: 69.29 1st Qu.: 41.816
## Median : 80.37 Median : 55.354
## Mean : 79.94 Mean : 55.814
## 3rd Qu.: 89.06 3rd Qu.: 69.905
## Max. :122.01 Max. :113.096
summary(Rush)
## Stream fork distance
## Rush :325 Min. : 41.39 Min. : 45.05
## Shribble: 0 1st Qu.: 85.53 1st Qu.: 90.67
## Median :100.34 Median :104.07
## Mean : 99.62 Mean :105.29
## 3rd Qu.:112.57 3rd Qu.:119.88
## Max. :153.77 Max. :167.60
boxplot(distance~Stream, data = movement, xlab = "Stream", ylab = "Distance moved")
We see the difference between the streams. Distance moved in the Shribble is far less than in the Rush (it better be—otherwise, we generated the data wrong).
shapiro.test(Shribble$distance)
##
## Shapiro-Wilk normality test
##
## data: Shribble$distance
## W = 0.99732, p-value = 0.8749
shapiro.test(Shribble$fork)
##
## Shapiro-Wilk normality test
##
## data: Shribble$fork
## W = 0.99632, p-value = 0.6576
shapiro.test(Rush$distance)
##
## Shapiro-Wilk normality test
##
## data: Rush$distance
## W = 0.99704, p-value = 0.8205
shapiro.test(Rush$fork)
##
## Shapiro-Wilk normality test
##
## data: Rush$fork
## W = 0.99734, p-value = 0.8793
The data is normal within each stream. Again, it better be. The data was generated by sampling from normal distributions.
I fitted a model with distance as the response, stream as the regressor, and fork length as the covariate, setting the model to calculate any interaction between stream and fork length.
model1 <- lm(distance~Stream*fork, data = movement)
summary(model1)
##
## Call:
## lm(formula = distance ~ Stream * fork, data = movement)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.09 -12.55 -0.61 13.53 54.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.39601 5.46092 10.144 < 2e-16 ***
## StreamShribble -38.50664 8.00365 -4.811 1.87e-06 ***
## fork 0.50080 0.05376 9.315 < 2e-16 ***
## StreamShribble:fork -0.01390 0.08983 -0.155 0.877
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.2 on 646 degrees of freedom
## Multiple R-squared: 0.6521, Adjusted R-squared: 0.6505
## F-statistic: 403.7 on 3 and 646 DF, p-value: < 2.2e-16
The good news? The interaction term is not significant. That means that the slopes of the regression lines in each stream are statistically the same and that they vary only by the intercept. If the interaction term was significant, you’d have to analyze each stream separately.
The other good news? The intercepts for the models for each stream are statistically significant (hey, just like I rigged the simulated data to show!). Taken all together, what it means is that regardless of fork length, the average distance moved in the Shribble is -38.506641 units less than the distance moved in the Rush.
library(ggplot2)
ggplot(movement, aes(x = fork, y = distance, color = Stream, shape = Stream)) +
theme_bw() +
labs(main = "Fish movement in response to fork length",
y = "Distance moved",
x = "Fork length") +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, fullrange = TRUE)
And there you have it. You have two final models showing that even though the average movement was greater in the Rush, the impact of fork length on movement was the same for both rivers.
For the Rush, the regression model is
Distance moved = 55.3960084 + 0.5007981*fork length.
For the Shribble, the final model is found by adding the StreamShribble coefficient to the intercept term to get the final intercept:
Distance moved = 16.8893674 + 0.5007981*fork length.
GGPlot automatically calculates and plots the 95% confidence interval around each regression line but if you ever needed to calculate them by hand, use the predict function:
conf_int <- predict(model1, newdata = data.frame(movement$fork), interval = "confidence")
head(conf_int) #"head" prints out the first few lines. If you wanted the last few lines, use the "tail" function
## fit lwr upr
## 1 56.29877 54.20228 58.39526
## 2 58.36319 56.14440 60.58198
## 3 66.05031 62.41669 69.68394
## 4 66.60754 62.84052 70.37456
## 5 46.81955 43.47436 50.16474
## 6 51.57365 49.14672 54.00058
Note: You can also get the prediction interval with the same code. Just change interval = “prediction”. As for the confidence interval for each regression term (if needed), use the confint function.
confint(model1, "fork", level = 0.95)
## 2.5 % 97.5 %
## fork 0.395228 0.6063683
If you need the confidence interval for the mean of each stream, use the following code:
Shribble <- subset(movement, Stream == "Shribble")
Rush <- subset(movement, Stream == "Rush")
Shrib_mean <- mean(Shribble$distance)
Rush_mean <- mean(Rush$distance)
Shrib_sd <- sd(Shribble$distance)
Rush_sd <- sd(Rush$distance)
Shrib_n <- length(Shribble$distance)
Rush_n <- length(Rush$distance)
Shrib_se <- Shrib_sd/sqrt(Shrib_n)
Rush_se <- Rush_sd/sqrt(Rush_n)
Shrib_confint <- Shrib_mean + c(-1,1)*qt(0.975, df = Shrib_n - 1)*Shrib_se
Rush_confint <- Rush_mean + c(-1,1)*qt(0.975, df = Rush_n - 1)*Rush_se
Shrib_confint
## [1] 53.57819 58.04919
Rush_confint
## [1] 102.9304 107.6445
Now on to the second possibility: Different slopes.
This time, I’ll simulate data with different slopes and rerun the ANCOVA.
set.seed(650)
#Create dummy variable by repeating Shribble and Rush 325 times each
Stream <- rep(c("Shribble", "Rush"), each = 325)
#Create fork length covariate by taking 325 samples from a normal distribution with a mean of 80 and a standard deviation of 15, then repeat for a distribution with a mean of 100 and a standard deviation of 20
fork <- c(rnorm(325, 80, 15), rnorm(325, 100, 20))
#Create the response variable with two intercepts (15 and 55), two slopes of 0.5 and 1.5, and random normal noise
distance <- rep(c(15, 55), each = 325) + rep(c(0.5, 1.5), each = 325)*fork + rnorm(length(fork), 0, 20)
#Pull all three variables together in a single data frame
movement <- data.frame(Stream = as.factor(Stream), fork = fork, distance = distance)
summary(movement)
## Stream fork distance
## Rush :325 Min. : 41.39 Min. : -5.137
## Shribble:325 1st Qu.: 75.80 1st Qu.: 55.355
## Median : 88.10 Median :109.506
## Mean : 89.78 Mean :130.362
## 3rd Qu.:103.07 3rd Qu.:202.445
## Max. :153.77 Max. :297.764
As with the other analysis, it looks like the data is normally distributed. After all, the mean and median are nearly the same. But let’s check out the data for each stream.
Shribble <- subset(movement, Stream == "Shribble")
Rush <- subset(movement, Stream == "Rush")
summary(Shribble)
## Stream fork distance
## Rush : 0 Min. : 41.75 Min. : -5.137
## Shribble:325 1st Qu.: 69.29 1st Qu.: 41.816
## Median : 80.37 Median : 55.354
## Mean : 79.94 Mean : 55.814
## 3rd Qu.: 89.06 3rd Qu.: 69.905
## Max. :122.01 Max. :113.096
summary(Rush)
## Stream fork distance
## Rush :325 Min. : 41.39 Min. : 99.04
## Shribble: 0 1st Qu.: 85.53 1st Qu.:183.55
## Median :100.34 Median :202.60
## Mean : 99.62 Mean :204.91
## 3rd Qu.:112.57 3rd Qu.:227.72
## Max. :153.77 Max. :297.76
So far, so good. The means and medians for fork length and distance appear to be normal for each stream but the Shapiro-Wilks test will tell us for certain.
boxplot(distance~Stream, data = movement, xlab = "Stream", ylab = "Distance moved")
The box plot clearly shows the difference between the Shribble and the Rush, with even the outliers barely overlapping. The median for each stream looks like it is in the center of the distribution, further supporting the hypothesis that the data are normally distributed.
shapiro.test(Shribble$distance)
##
## Shapiro-Wilk normality test
##
## data: Shribble$distance
## W = 0.99732, p-value = 0.8749
shapiro.test(Shribble$fork)
##
## Shapiro-Wilk normality test
##
## data: Shribble$fork
## W = 0.99632, p-value = 0.6576
shapiro.test(Rush$distance)
##
## Shapiro-Wilk normality test
##
## data: Rush$distance
## W = 0.99581, p-value = 0.5391
shapiro.test(Rush$fork)
##
## Shapiro-Wilk normality test
##
## data: Rush$fork
## W = 0.99734, p-value = 0.8793
Finally, the Shapiro-Wilks test tells us what we already suspected: That the data for fork length and distance are normal for each stream.
I fitted a model with distance as the response, stream as the regressor, and fork length as the covariate, setting the model to calculate any interaction between stream and fork length.
model1 <- lm(distance~Stream*fork, data = movement)
summary(model1)
##
## Call:
## lm(formula = distance ~ Stream * fork, data = movement)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.09 -12.55 -0.61 13.53 54.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.39601 5.46092 10.144 < 2e-16 ***
## StreamShribble -38.50664 8.00365 -4.811 1.87e-06 ***
## fork 1.50080 0.05376 27.915 < 2e-16 ***
## StreamShribble:fork -1.01390 0.08983 -11.287 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.2 on 646 degrees of freedom
## Multiple R-squared: 0.9427, Adjusted R-squared: 0.9424
## F-statistic: 3540 on 3 and 646 DF, p-value: < 2.2e-16
This time, the model shows that everything is significant. The StreamShribble variable shows us that the intercepts differ between streams. The interaction term (StreamShribble:fork) tells us that the slopes are different, as a plot of the data clearly shows.
library(ggplot2)
ggplot(movement, aes(x = fork, y = distance, color = Stream, shape = Stream)) +
theme_bw() +
labs(main = "Fish movement in response to fork length",
y = "Distance moved",
x = "Fork length") +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, fullrange = TRUE)
From the model, we can tell that the regression model for the Rush is
Distance = 55.3960084 + 1.5007981*fork length, using only the intercept and the fork coefficient from the model.
The corresponding regression model for the Shribble is found by adding the StreamShribble coefficient to the intercept to get the final intercept and the StreamShribble:fork interaction term to the fork coefficient to get the slope.
Distance = 16.8893674 + 0.486896*fork length
We can verify each of these by splitting the data set, then analyzing each stream separately.
Shribble <- lm(distance~fork, data = Shribble)
summary(Shribble)
##
## Call:
## lm(formula = distance ~ fork, data = Shribble)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.473 -12.540 -0.691 14.458 48.068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.88937 5.85039 2.887 0.00415 **
## fork 0.48690 0.07196 6.766 6.22e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.2 on 323 degrees of freedom
## Multiple R-squared: 0.1241, Adjusted R-squared: 0.1214
## F-statistic: 45.78 on 1 and 323 DF, p-value: 6.218e-11
Rush_model <- lm(distance~fork, data = Rush)
summary(Rush_model)
##
## Call:
## lm(formula = distance ~ fork, data = Rush)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.087 -12.551 -0.334 12.849 54.574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.39601 5.46169 10.14 <2e-16 ***
## fork 1.50080 0.05377 27.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.21 on 323 degrees of freedom
## Multiple R-squared: 0.7069, Adjusted R-squared: 0.706
## F-statistic: 779 on 1 and 323 DF, p-value: < 2.2e-16
The impact of fork length on fish movement is significantly different between the rivers, with greater movement in the Rush as the fork length increased. Additionally, the overall average movement was greater in the Rush.
As above, you can always use predict to manually calculate the 95% confidence interval for the regression line and confint to get the confidence intervals for each regression term.
And there you have it: A breakdown of how to use ANCOVA to simultaneously analyze and compare two sets of data. The only requirements are that the data sets have the same independent and dependent variables and meet the other assumptions required for regression.