Fish movement

Overview

I recently had a student ask about ANCOVA for ecological data. As this isn’t the first time, I decided to take the opportunity to show an example of how to generate synthetic data in R and then analyze it using Analysis of Covariance (ANCOVA). The hypothesis being tested in each case is whether or not fork length impacts the distance moved by fish in two different streams. I’ll call the streams the Shribble and the Rush. I’ll leave it to you to Google where I got the names from.

ANCOVA with different intercepts but the same slope

The data:

set.seed(650)

#Create dummy variable by repeating Shribble and Rush 325 times each
Stream <- rep(c("Shribble", "Rush"), each = 325)

#Create fork length covariate by taking 325 samples from a normal distribution with a mean of 80 and a standard deviation of 15, then repeat for a distribution with a mean of 100 and a standard deviation of 20
fork <- c(rnorm(325, 80, 15), rnorm(325, 100, 20))

#Create the response variable with two intercepts (15 and 55), a single slope of 0.5, and random normal noise
distance <- rep(c(15, 55), each = 325) + 0.5*fork + rnorm(length(fork), 0, 20)

#Pull all three variables together in a single data frame
movement <- data.frame(Stream = as.factor(Stream), fork = fork, distance = distance)

Exploratory Analysis

summary(movement)

##       Stream         fork           distance      
##  Rush    :325   Min.   : 41.39   Min.   : -5.137  
##  Shribble:325   1st Qu.: 75.80   1st Qu.: 55.190  
##                 Median : 88.10   Median : 80.115  
##                 Mean   : 89.78   Mean   : 80.551  
##                 3rd Qu.:103.07   3rd Qu.:104.192  
##                 Max.   :153.77   Max.   :167.602

It looks like the data is normally distributed. After all, the mean and median are nearly the same. But let’s check out the data for each stream.

Shribble <- subset(movement, Stream == "Shribble")
Rush <- subset(movement, Stream == "Rush")
summary(Shribble)

##       Stream         fork           distance      
##  Rush    :  0   Min.   : 41.75   Min.   : -5.137  
##  Shribble:325   1st Qu.: 69.29   1st Qu.: 41.816  
##                 Median : 80.37   Median : 55.354  
##                 Mean   : 79.94   Mean   : 55.814  
##                 3rd Qu.: 89.06   3rd Qu.: 69.905  
##                 Max.   :122.01   Max.   :113.096

summary(Rush)

##       Stream         fork           distance     
##  Rush    :325   Min.   : 41.39   Min.   : 45.05  
##  Shribble:  0   1st Qu.: 85.53   1st Qu.: 90.67  
##                 Median :100.34   Median :104.07  
##                 Mean   : 99.62   Mean   :105.29  
##                 3rd Qu.:112.57   3rd Qu.:119.88  
##                 Max.   :153.77   Max.   :167.60

boxplot(distance~Stream, data = movement, xlab = "Stream", ylab = "Distance moved")

We see the difference between the streams. Distance moved in the Shribble is far less than in the Rush (it better be—otherwise, we generated the data wrong).

shapiro.test(Shribble$distance)

## 
##  Shapiro-Wilk normality test
## 
## data:  Shribble$distance
## W = 0.99732, p-value = 0.8749

shapiro.test(Shribble$fork)

## 
##  Shapiro-Wilk normality test
## 
## data:  Shribble$fork
## W = 0.99632, p-value = 0.6576

shapiro.test(Rush$distance)

## 
##  Shapiro-Wilk normality test
## 
## data:  Rush$distance
## W = 0.99704, p-value = 0.8205

shapiro.test(Rush$fork)

## 
##  Shapiro-Wilk normality test
## 
## data:  Rush$fork
## W = 0.99734, p-value = 0.8793

The data is normal within each stream. Again, it better be. The data was generated by sampling from normal distributions.

Analysis of Covariance

I fitted a model with distance as the response, stream as the regressor, and fork length as the covariate, setting the model to calculate any interaction between stream and fork length.

model1 <- lm(distance~Stream*fork, data = movement)
summary(model1)

## 
## Call:
## lm(formula = distance ~ Stream * fork, data = movement)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -62.09 -12.55  -0.61  13.53  54.57 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          55.39601    5.46092  10.144  < 2e-16 ***
## StreamShribble      -38.50664    8.00365  -4.811 1.87e-06 ***
## fork                  0.50080    0.05376   9.315  < 2e-16 ***
## StreamShribble:fork  -0.01390    0.08983  -0.155    0.877    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.2 on 646 degrees of freedom
## Multiple R-squared:  0.6521, Adjusted R-squared:  0.6505 
## F-statistic: 403.7 on 3 and 646 DF,  p-value: < 2.2e-16

The good news? The interaction term is not significant. That means that the slopes of the regression lines in each stream are statistically the same and that they vary only by the intercept. If the interaction term was significant, you’d have to analyze each stream separately.

The other good news? The intercepts for the models for each stream are statistically significant (hey, just like I rigged the simulated data to show!). Taken all together, what it means is that regardless of fork length, the average distance moved in the Shribble is -38.506641 units less than the distance moved in the Rush.

library(ggplot2)
ggplot(movement, aes(x = fork, y = distance, color = Stream, shape = Stream)) +
        theme_bw() +
        labs(main = "Fish movement in response to fork length",
             y = "Distance moved",
             x = "Fork length") +
        geom_point() +
        geom_smooth(method = lm, formula = y ~ x, fullrange = TRUE)

And there you have it. You have two final models showing that even though the average movement was greater in the Rush, the impact of fork length on movement was the same for both rivers.

For the Rush, the regression model is

Distance moved = 55.3960084 + 0.5007981*fork length.

For the Shribble, the final model is found by adding the StreamShribble coefficient to the intercept term to get the final intercept:

Distance moved = 16.8893674 + 0.5007981*fork length.

GGPlot automatically calculates and plots the 95% confidence interval around each regression line but if you ever needed to calculate them by hand, use the predict function:

conf_int <- predict(model1, newdata = data.frame(movement$fork), interval = "confidence")
head(conf_int) #"head" prints out the first few lines. If you wanted the last few lines, use the "tail" function

##        fit      lwr      upr
## 1 56.29877 54.20228 58.39526
## 2 58.36319 56.14440 60.58198
## 3 66.05031 62.41669 69.68394
## 4 66.60754 62.84052 70.37456
## 5 46.81955 43.47436 50.16474
## 6 51.57365 49.14672 54.00058

Note: You can also get the prediction interval with the same code. Just change interval = “prediction”. As for the confidence interval for each regression term (if needed), use the confint function.

confint(model1, "fork", level = 0.95)

##         2.5 %    97.5 %
## fork 0.395228 0.6063683

If you need the confidence interval for the mean of each stream, use the following code:

Shribble <- subset(movement, Stream == "Shribble")
Rush <- subset(movement, Stream == "Rush")
Shrib_mean <- mean(Shribble$distance)
Rush_mean <- mean(Rush$distance)
Shrib_sd <- sd(Shribble$distance)
Rush_sd <- sd(Rush$distance)
Shrib_n <- length(Shribble$distance)
Rush_n <- length(Rush$distance)
Shrib_se <- Shrib_sd/sqrt(Shrib_n)
Rush_se <- Rush_sd/sqrt(Rush_n)
Shrib_confint <- Shrib_mean + c(-1,1)*qt(0.975, df = Shrib_n - 1)*Shrib_se
Rush_confint <- Rush_mean + c(-1,1)*qt(0.975, df = Rush_n - 1)*Rush_se
Shrib_confint

## [1] 53.57819 58.04919

Rush_confint

## [1] 102.9304 107.6445

Now on to the second possibility: Different slopes.

ANCOVA with different slopes

This time, I’ll simulate data with different slopes and rerun the ANCOVA.

Generating data

set.seed(650)

#Create dummy variable by repeating Shribble and Rush 325 times each
Stream <- rep(c("Shribble", "Rush"), each = 325)

#Create fork length covariate by taking 325 samples from a normal distribution with a mean of 80 and a standard deviation of 15, then repeat for a distribution with a mean of 100 and a standard deviation of 20
fork <- c(rnorm(325, 80, 15), rnorm(325, 100, 20))

#Create the response variable with two intercepts (15 and 55), two slopes of 0.5 and 1.5, and random normal noise
distance <- rep(c(15, 55), each = 325) + rep(c(0.5, 1.5), each = 325)*fork + rnorm(length(fork), 0, 20)

#Pull all three variables together in a single data frame
movement <- data.frame(Stream = as.factor(Stream), fork = fork, distance = distance)

Exploratory Analysis

summary(movement)

##       Stream         fork           distance      
##  Rush    :325   Min.   : 41.39   Min.   : -5.137  
##  Shribble:325   1st Qu.: 75.80   1st Qu.: 55.355  
##                 Median : 88.10   Median :109.506  
##                 Mean   : 89.78   Mean   :130.362  
##                 3rd Qu.:103.07   3rd Qu.:202.445  
##                 Max.   :153.77   Max.   :297.764

As with the other analysis, it looks like the data is normally distributed. After all, the mean and median are nearly the same. But let’s check out the data for each stream.

Shribble <- subset(movement, Stream == "Shribble")
Rush <- subset(movement, Stream == "Rush")
summary(Shribble)

##       Stream         fork           distance      
##  Rush    :  0   Min.   : 41.75   Min.   : -5.137  
##  Shribble:325   1st Qu.: 69.29   1st Qu.: 41.816  
##                 Median : 80.37   Median : 55.354  
##                 Mean   : 79.94   Mean   : 55.814  
##                 3rd Qu.: 89.06   3rd Qu.: 69.905  
##                 Max.   :122.01   Max.   :113.096

summary(Rush)

##       Stream         fork           distance     
##  Rush    :325   Min.   : 41.39   Min.   : 99.04  
##  Shribble:  0   1st Qu.: 85.53   1st Qu.:183.55  
##                 Median :100.34   Median :202.60  
##                 Mean   : 99.62   Mean   :204.91  
##                 3rd Qu.:112.57   3rd Qu.:227.72  
##                 Max.   :153.77   Max.   :297.76

So far, so good. The means and medians for fork length and distance appear to be normal for each stream but the Shapiro-Wilks test will tell us for certain.

boxplot(distance~Stream, data = movement, xlab = "Stream", ylab = "Distance moved")

The box plot clearly shows the difference between the Shribble and the Rush, with even the outliers barely overlapping. The median for each stream looks like it is in the center of the distribution, further supporting the hypothesis that the data are normally distributed.

shapiro.test(Shribble$distance)

## 
##  Shapiro-Wilk normality test
## 
## data:  Shribble$distance
## W = 0.99732, p-value = 0.8749

shapiro.test(Shribble$fork)

## 
##  Shapiro-Wilk normality test
## 
## data:  Shribble$fork
## W = 0.99632, p-value = 0.6576

shapiro.test(Rush$distance)

## 
##  Shapiro-Wilk normality test
## 
## data:  Rush$distance
## W = 0.99581, p-value = 0.5391

shapiro.test(Rush$fork)

## 
##  Shapiro-Wilk normality test
## 
## data:  Rush$fork
## W = 0.99734, p-value = 0.8793

Finally, the Shapiro-Wilks test tells us what we already suspected: That the data for fork length and distance are normal for each stream.

Analysis of Covariance

I fitted a model with distance as the response, stream as the regressor, and fork length as the covariate, setting the model to calculate any interaction between stream and fork length.

model1 <- lm(distance~Stream*fork, data = movement)
summary(model1)

## 
## Call:
## lm(formula = distance ~ Stream * fork, data = movement)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -62.09 -12.55  -0.61  13.53  54.57 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          55.39601    5.46092  10.144  < 2e-16 ***
## StreamShribble      -38.50664    8.00365  -4.811 1.87e-06 ***
## fork                  1.50080    0.05376  27.915  < 2e-16 ***
## StreamShribble:fork  -1.01390    0.08983 -11.287  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.2 on 646 degrees of freedom
## Multiple R-squared:  0.9427, Adjusted R-squared:  0.9424 
## F-statistic:  3540 on 3 and 646 DF,  p-value: < 2.2e-16

This time, the model shows that everything is significant. The StreamShribble variable shows us that the intercepts differ between streams. The interaction term (StreamShribble:fork) tells us that the slopes are different, as a plot of the data clearly shows.

library(ggplot2)
ggplot(movement, aes(x = fork, y = distance, color = Stream, shape = Stream)) +
        theme_bw() +
        labs(main = "Fish movement in response to fork length",
             y = "Distance moved",
             x = "Fork length") +
        geom_point() +
        geom_smooth(method = lm, formula = y ~ x, fullrange = TRUE)

From the model, we can tell that the regression model for the Rush is

Distance = 55.3960084 + 1.5007981*fork length, using only the intercept and the fork coefficient from the model.

The corresponding regression model for the Shribble is found by adding the StreamShribble coefficient to the intercept to get the final intercept and the StreamShribble:fork interaction term to the fork coefficient to get the slope.

Distance = 16.8893674 + 0.486896*fork length

We can verify each of these by splitting the data set, then analyzing each stream separately.

Shribble <- lm(distance~fork, data = Shribble)
summary(Shribble)

## 
## Call:
## lm(formula = distance ~ fork, data = Shribble)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.473 -12.540  -0.691  14.458  48.068 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.88937    5.85039   2.887  0.00415 ** 
## fork         0.48690    0.07196   6.766 6.22e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.2 on 323 degrees of freedom
## Multiple R-squared:  0.1241, Adjusted R-squared:  0.1214 
## F-statistic: 45.78 on 1 and 323 DF,  p-value: 6.218e-11

Rush_model <- lm(distance~fork, data = Rush)
summary(Rush_model)

## 
## Call:
## lm(formula = distance ~ fork, data = Rush)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.087 -12.551  -0.334  12.849  54.574 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.39601    5.46169   10.14   <2e-16 ***
## fork         1.50080    0.05377   27.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.21 on 323 degrees of freedom
## Multiple R-squared:  0.7069, Adjusted R-squared:  0.706 
## F-statistic:   779 on 1 and 323 DF,  p-value: < 2.2e-16

The impact of fork length on fish movement is significantly different between the rivers, with greater movement in the Rush as the fork length increased. Additionally, the overall average movement was greater in the Rush.

As above, you can always use predict to manually calculate the 95% confidence interval for the regression line and confint to get the confidence intervals for each regression term.

Conclusion

And there you have it: A breakdown of how to use ANCOVA to simultaneously analyze and compare two sets of data. The only requirements are that the data sets have the same independent and dependent variables and meet the other assumptions required for regression.

Fish movement

Jim Milks

3/11/2021

Overview

ANCOVA with different intercepts but the same slope

The data:

Exploratory Analysis

Analysis of Covariance

ANCOVA with different slopes

Generating data

Exploratory Analysis

Analysis of Covariance

Conclusion