Imagine that we are interested in the habitat characteristics that favor Ctenochaetus strigosus, known as kole in Hawaiian, and with the English common name spotted surgeonfish in this dataset. We will focus on the role of different types of benthic substrate, and the most common substrate types in this dataset are turf algae (column ‘ta’), hard coral (column ‘hard_coral’), sand (column ‘sand’), and crustose coralline algae (column ‘cca’). I have included a dataset that only includes this species, and only the sampling period during which turf algae percent cover was measured.

Load libraries and data

library(here)
library(tidyverse)
library(GGally)
library(gridExtra)
library(arm)
library(ggeffects)
library(car)
library(ggResidpanel)
library(patchwork)

kolesurvey <- read_csv(here("Week_04", "Data", "CRCP_Reef_Fish_Surveys_Hawaii_kole.csv"))

Explore the distributions!

(1) Start by exploring the distributions of the four predictor variables, and how they are correlated with one another. The function ggpairs() in the package GGally is one nice option for this.

ggpairs(kolesurvey, columns = c("ta", "hard_coral","sand", "cca"))

Single-predictor models

(2) For each of the four predictors, make single-predictor models where the response variable is kole counts. Consider which of the predictors could be transformed to reduce skew along the x-axis. For educational purposes, start by using a poisson distribution for the counts. Quantify the degree of overdispersion in these models, and describe what it means. Based on likelihood ratio tests and effects plots, what are the relationships between kole and the four substrate types?

substrates <- c("ta", "hard_coral","sand", "cca") #predictors

for (i in substrates) {
  substrateformula <- as.formula(paste("count ~", i)) #response variable is kole counts, predictors = i
  glm <- glm(formula = substrateformula, #count ~ ta, count ~ hard_coral, etc
             family = poisson, #poisson distribution for counts
             data = kolesurvey)
  cat("\n---------------- Model for:", i, "------------------------\n") #separate glms based on substrate
  print(summary(glm)$coefficients) #output only coefficients
  
  cat("\nDispersion for:", i, "\n")
  print(sum(residuals(glm, type = "pearson")^2)/(length(glm$y) - length(glm$coefficients)))
  
  cat("\nAnova for:", i, "\n") #likelihood ratio tests
  print(Anova(glm))
  
  cat("\nEffects plot for:", i, "\n")
  print(ggeffect(glm) %>% #effects plot
          plot() +
          labs(title = paste("Effects plot for:", i)))
  
  cat("\nResiduals plot for:", i, "\n") 
  print(resid_xpanel(glm, type = "deviance")) #residuals plot, the deviance quantifies the lack of fit of the model, relative to a saturated (perfectly fit) model
}
## 
## ---------------- Model for: ta ------------------------
##                Estimate   Std. Error   z value      Pr(>|z|)
## (Intercept)  3.57842025 0.0352930632 101.39160  0.000000e+00
## ta          -0.01770686 0.0006581199 -26.90522 1.908052e-159
## 
## Dispersion for: ta 
## [1] 30.10215
## 
## Anova for: ta 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
##    LR Chisq Df Pr(>Chisq)    
## ta   710.68  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: ta

## 
## Residuals plot for: ta

## 
## ---------------- Model for: hard_coral ------------------------
##               Estimate   Std. Error  z value      Pr(>|z|)
## (Intercept) 2.03793235 0.0257020728 79.29058  0.000000e+00
## hard_coral  0.02437143 0.0006952852 35.05242 3.582141e-269
## 
## Dispersion for: hard_coral 
## [1] 28.13073
## 
## Anova for: hard_coral 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
##            LR Chisq Df Pr(>Chisq)    
## hard_coral   1091.4  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: hard_coral

## 
## Residuals plot for: hard_coral

## 
## ---------------- Model for: sand ------------------------
##                Estimate  Std. Error   z value      Pr(>|z|)
## (Intercept)  2.94672701 0.018963622 155.38841  0.000000e+00
## sand        -0.03577014 0.001685489 -21.22242 5.929532e-100
## 
## Dispersion for: sand 
## [1] 30.97697
## 
## Anova for: sand 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
##      LR Chisq Df Pr(>Chisq)    
## sand   574.42  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: sand

## 
## Residuals plot for: sand

## 
## ---------------- Model for: cca ------------------------
##               Estimate  Std. Error   z value      Pr(>|z|)
## (Intercept) 2.27051551 0.021449453 105.85424  0.000000e+00
## cca         0.03443023 0.001153372  29.85179 8.321896e-196
## 
## Dispersion for: cca 
## [1] 30.8293
## 
## Anova for: cca 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
##     LR Chisq Df Pr(>Chisq)    
## cca   726.46  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: cca

## 
## Residuals plot for: cca

The intercept is the mean count of kole when turf algae = 0. The negative slope shows that kole decreases in the presence of turf algae. The likelihood ratio tests all also yield p-values of 2.2e-16, meaning they are significant predictors of kole count. Following this pattern based on the slopes and LRT, we observe that kole count increases as hard coral and crustose coralline algae increase, and that kole count declines as sand increases. Sand and cca appear to both have left skewed residuals.

Single-predictor models & overdispersion

(3) Now make new single-predictor models that account for overdispersion (we discussed two different ways to do this in lecture, you can do one or both). How have the results changed compared to #2, and why?

Negative Binomial

for (i in substrates) {
  substrateformula <- as.formula(paste("count ~", i)) #response variable is kole counts, predictors = i
  glmod <- glm.nb(formula = substrateformula, #count ~ ta, count ~ hard_coral, etc
                  data = kolesurvey)
  cat("\n---------------- Model for:", i, "------------------------\n") #separate glms based on substrate
  print(summary(glmod)$coefficients)
  
  cat("\nDispersion for:", i, "\n")
  print(sum(residuals(glmod, type = "pearson")^2)/(length(glmod$y) - length(glmod$coefficients)))
  
  cat("\nAnova for:", i, "\n") #likelihood ratio tests
  print(Anova(glmod, test = 'F'))
  
  cat("\nEffects plot for:", i, "\n")
  print(ggeffect(glmod) %>%
          plot() +
          labs(title = paste("Effects plot for:", i)))
}
## 
## ---------------- Model for: ta ------------------------
##                Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept)  3.64697114 0.262055821 13.916772 5.010256e-44
## ta          -0.01900419 0.004270196 -4.450425 8.570061e-06
## 
## Dispersion for: ta 
## [1] 0.9872545
## 
## Anova for: ta 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
## Error estimate based on Pearson residuals 
## 
##            Sum Sq  Df F value    Pr(>F)    
## ta         19.342   1  19.592 1.333e-05 ***
## Residuals 304.074 308                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: ta

## 
## ---------------- Model for: hard_coral ------------------------
##               Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) 1.85803249 0.140368812 13.236790 5.379955e-40
## hard_coral  0.03172025 0.005175731  6.128651 8.862709e-10
## 
## Dispersion for: hard_coral 
## [1] 1.075455
## 
## Anova for: hard_coral 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
## Error estimate based on Pearson residuals 
## 
##            Sum Sq  Df F value    Pr(>F)    
## hard_coral  33.44   1  31.091 5.388e-08 ***
## Residuals  331.24 308                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: hard_coral

## 
## ---------------- Model for: sand ------------------------
##                Estimate  Std. Error   z value      Pr(>|z|)
## (Intercept)  2.94178341 0.124215660 23.682871 5.414358e-124
## sand        -0.03500807 0.007544063 -4.640479  3.476027e-06
## 
## Dispersion for: sand 
## [1] 0.9493442
## 
## Anova for: sand 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
## Error estimate based on Pearson residuals 
## 
##            Sum Sq  Df F value    Pr(>F)    
## sand       17.888   1  18.842 1.928e-05 ***
## Residuals 292.398 308                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: sand

## 
## ---------------- Model for: cca ------------------------
##               Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) 1.99950691 0.126583611 15.795938 3.318190e-56
## cca         0.06009838 0.009659176  6.221895 4.911852e-10
## 
## Dispersion for: cca 
## [1] 1.03896
## 
## Anova for: cca 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
## Error estimate based on Pearson residuals 
## 
##           Sum Sq  Df F value    Pr(>F)    
## cca        26.53   1  25.534 7.466e-07 ***
## Residuals 320.00 308                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: cca

Quasipoisson

for (i in substrates) {
  substrateformula <- as.formula(paste("count ~", i)) #response variable is kole counts, predictors = i
  glmqp <- glm(formula = substrateformula, #count ~ ta, count ~ hard_coral, etc
               family = quasipoisson,
                  data = kolesurvey)
  
  cat("\n---------------- Model for:", i, "------------------------\n") #separate glms based on substrate
  print(summary(glmqp)$coefficients)
  
  cat("\nDispersion for:", i, "\n")
  print(sum(residuals(glmqp, type = "pearson")^2)/(length(glmqp$y) - length(glmqp$coefficients)))
  
  cat("\nAnova for:", i, "\n") #likelihood ratio tests
  print(Anova(glmqp, test = 'F'))
  
  cat("\nEffects plot for:", i, "\n")
  print(ggeffect(glmqp) %>%
          plot() +
          labs(title = paste("Effects plot for:", i)))
}
## 
## ---------------- Model for: ta ------------------------
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)  3.57842025 0.193637065 18.480038 7.835281e-52
## ta          -0.01770686 0.003610806 -4.903853 1.523712e-06
## 
## Dispersion for: ta 
## [1] 30.10215
## 
## Anova for: ta 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
## Error estimate based on Pearson residuals 
## 
##           Sum Sq  Df F value    Pr(>F)    
## ta         710.7   1  23.609 1.883e-06 ***
## Residuals 9271.5 308                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: ta

## 
## ---------------- Model for: hard_coral ------------------------
##               Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 2.03793235 0.136319765 14.949647 2.258606e-38
## hard_coral  0.02437143 0.003687684  6.608871 1.704069e-10
## 
## Dispersion for: hard_coral 
## [1] 28.13073
## 
## Anova for: hard_coral 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
## Error estimate based on Pearson residuals 
## 
##            Sum Sq  Df F value    Pr(>F)    
## hard_coral 1091.4   1  38.797 1.541e-09 ***
## Residuals  8664.3 308                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: hard_coral

## 
## ---------------- Model for: sand ------------------------
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)  2.94672701 0.105546388 27.918786 2.282019e-86
## sand        -0.03577014 0.009380974 -3.813052 1.658030e-04
## 
## Dispersion for: sand 
## [1] 30.97697
## 
## Anova for: sand 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
## Error estimate based on Pearson residuals 
## 
##           Sum Sq  Df F value    Pr(>F)    
## sand       574.4   1  18.543 2.235e-05 ***
## Residuals 9540.9 308                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: sand

## 
## ---------------- Model for: cca ------------------------
##               Estimate  Std. Error  t value     Pr(>|t|)
## (Intercept) 2.27051551 0.119096296 19.06454 4.628841e-54
## cca         0.03443023 0.006404003  5.37636 1.504400e-07
## 
## Dispersion for: cca 
## [1] 30.8293
## 
## Anova for: cca 
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
## Error estimate based on Pearson residuals 
## 
##           Sum Sq  Df F value    Pr(>F)    
## cca        726.5   1  23.564 1.924e-06 ***
## Residuals 9495.4 308                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Effects plot for: cca

If there is no overdispersion then the dispersion parameter should be ~1, and a value greater than 1 indicates overdispersion. Our dispersion parameter for the four models ranges between 28.13073 and 30.97697 for the negative binomial model. This indicates there is extra variance in the models which may result in a type 1 error if not corrected for overdispersion.

Multiple-predictor models

(4) Although the single-predictor models are useful, we may get a more accurate picture of the role of each substrate type if we include all the substrate types in one model. Create such a model. Based on likelihood ratio tests and effects plots, how do the results change when all predictors are included simultaneously? Why do you think the results have changed? Don’t forget to make some residual diagnostic plots as well. Finally, what are your overall conclusions about the substrate associations of kole, from this look at the data? Are there any additional analyses you would want to do that we have not done in this assignment?

Poisson

substratesmodel = glm(count ~ ta + hard_coral + sand + cca,
                      family = poisson,
                      data = kolesurvey)
summary(substratesmodel)
## 
## Call:
## glm(formula = count ~ ta + hard_coral + sand + cca, family = poisson, 
##     data = kolesurvey)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.853031   0.360774  -2.364   0.0181 *  
## ta           0.029757   0.003686   8.074  6.8e-16 ***
## hard_coral   0.048936   0.003659  13.373  < 2e-16 ***
## sand         0.010145   0.003997   2.538   0.0111 *  
## cca          0.055212   0.003973  13.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7573.1  on 309  degrees of freedom
## Residual deviance: 5817.6  on 305  degrees of freedom
## AIC: 6744.5
## 
## Number of Fisher Scoring iterations: 6
Anova(substratesmodel)
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
##            LR Chisq Df Pr(>Chisq)    
## ta           75.529  1  < 2.2e-16 ***
## hard_coral  230.581  1  < 2.2e-16 ***
## sand          6.645  1   0.009942 ** 
## cca         231.100  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dispersion <- sum(residuals(substratesmodel, type = "pearson")^2)/(length(substratesmodel$y) - length(substratesmodel$coefficients))
print(dispersion)
## [1] 26.93165
resid_xpanel(substratesmodel, type = "deviance")

Quassipoisson

mpqp = glm(count ~ ta + hard_coral + sand + cca, 
           family= quasipoisson, 
           data = kolesurvey)
Anova(mpqp, test= 'F')
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
## Error estimate based on Pearson residuals 
## 
##            Sum Sq  Df F value   Pr(>F)   
## ta           75.5   1  2.8045 0.095026 . 
## hard_coral  230.6   1  8.5617 0.003691 **
## sand          6.6   1  0.2467 0.619733   
## cca         231.1   1  8.5810 0.003653 **
## Residuals  8214.2 305                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
resid_xpanel(mpqp, type = "pearson") #use pearson for quaisipoisson

resid_panel(mpqp, plots = c('resid', 'qq', 'lev', 'hist'))
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggResidpanel package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggResidpanel package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Effects plot

taqp = plot(ggeffect(mpqp, terms = "ta"))
hcqp = plot(ggeffect(mpqp, terms = "hard_coral"))
sqp = plot(ggeffect(mpqp, terms = "sand"))
ccaqp = plot(ggeffect(mpqp, terms = "cca"))

(taqp|hcqp)/(sqp|ccaqp)

All predictors in the poisson model are significant with extremely low p-values, versus the quasipoisson model where only hard coral and crustose coralline algae are significant and have larger p-values (although below 0.05) than the poisson model.

There is a significant positive interaction between kole and hard coral and crustose coralline algae. As hard coral and crustose coralline algae increase, kole also increases. Additional analysis that could be interesting to conduct could be including habitat type and depth as predictors.