Imagine that we are interested in the habitat characteristics that favor Ctenochaetus strigosus, known as kole in Hawaiian, and with the English common name spotted surgeonfish in this dataset. We will focus on the role of different types of benthic substrate, and the most common substrate types in this dataset are turf algae (column ‘ta’), hard coral (column ‘hard_coral’), sand (column ‘sand’), and crustose coralline algae (column ‘cca’). I have included a dataset that only includes this species, and only the sampling period during which turf algae percent cover was measured.
library(here)
library(tidyverse)
library(GGally)
library(gridExtra)
library(arm)
library(ggeffects)
library(car)
library(ggResidpanel)
library(patchwork)
kolesurvey <- read_csv(here("Week_04", "Data", "CRCP_Reef_Fish_Surveys_Hawaii_kole.csv"))
(1) Start by exploring the distributions of the four predictor variables, and how they are correlated with one another. The function ggpairs() in the package GGally is one nice option for this.
ggpairs(kolesurvey, columns = c("ta", "hard_coral","sand", "cca"))
(2) For each of the four predictors, make single-predictor models where the response variable is kole counts. Consider which of the predictors could be transformed to reduce skew along the x-axis. For educational purposes, start by using a poisson distribution for the counts. Quantify the degree of overdispersion in these models, and describe what it means. Based on likelihood ratio tests and effects plots, what are the relationships between kole and the four substrate types?
substrates <- c("ta", "hard_coral","sand", "cca") #predictors
for (i in substrates) {
substrateformula <- as.formula(paste("count ~", i)) #response variable is kole counts, predictors = i
glm <- glm(formula = substrateformula, #count ~ ta, count ~ hard_coral, etc
family = poisson, #poisson distribution for counts
data = kolesurvey)
cat("\n---------------- Model for:", i, "------------------------\n") #separate glms based on substrate
print(summary(glm)$coefficients) #output only coefficients
cat("\nDispersion for:", i, "\n")
print(sum(residuals(glm, type = "pearson")^2)/(length(glm$y) - length(glm$coefficients)))
cat("\nAnova for:", i, "\n") #likelihood ratio tests
print(Anova(glm))
cat("\nEffects plot for:", i, "\n")
print(ggeffect(glm) %>% #effects plot
plot() +
labs(title = paste("Effects plot for:", i)))
cat("\nResiduals plot for:", i, "\n")
print(resid_xpanel(glm, type = "deviance")) #residuals plot, the deviance quantifies the lack of fit of the model, relative to a saturated (perfectly fit) model
}
##
## ---------------- Model for: ta ------------------------
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.57842025 0.0352930632 101.39160 0.000000e+00
## ta -0.01770686 0.0006581199 -26.90522 1.908052e-159
##
## Dispersion for: ta
## [1] 30.10215
##
## Anova for: ta
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## LR Chisq Df Pr(>Chisq)
## ta 710.68 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: ta
##
## Residuals plot for: ta
##
## ---------------- Model for: hard_coral ------------------------
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.03793235 0.0257020728 79.29058 0.000000e+00
## hard_coral 0.02437143 0.0006952852 35.05242 3.582141e-269
##
## Dispersion for: hard_coral
## [1] 28.13073
##
## Anova for: hard_coral
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## LR Chisq Df Pr(>Chisq)
## hard_coral 1091.4 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: hard_coral
##
## Residuals plot for: hard_coral
##
## ---------------- Model for: sand ------------------------
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.94672701 0.018963622 155.38841 0.000000e+00
## sand -0.03577014 0.001685489 -21.22242 5.929532e-100
##
## Dispersion for: sand
## [1] 30.97697
##
## Anova for: sand
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## LR Chisq Df Pr(>Chisq)
## sand 574.42 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: sand
##
## Residuals plot for: sand
##
## ---------------- Model for: cca ------------------------
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.27051551 0.021449453 105.85424 0.000000e+00
## cca 0.03443023 0.001153372 29.85179 8.321896e-196
##
## Dispersion for: cca
## [1] 30.8293
##
## Anova for: cca
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## LR Chisq Df Pr(>Chisq)
## cca 726.46 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: cca
##
## Residuals plot for: cca
The intercept is the mean count of kole when turf algae = 0. The negative slope shows that kole decreases in the presence of turf algae. The likelihood ratio tests all also yield p-values of 2.2e-16, meaning they are significant predictors of kole count. Following this pattern based on the slopes and LRT, we observe that kole count increases as hard coral and crustose coralline algae increase, and that kole count declines as sand increases. Sand and cca appear to both have left skewed residuals.
(3) Now make new single-predictor models that account for overdispersion (we discussed two different ways to do this in lecture, you can do one or both). How have the results changed compared to #2, and why?
for (i in substrates) {
substrateformula <- as.formula(paste("count ~", i)) #response variable is kole counts, predictors = i
glmod <- glm.nb(formula = substrateformula, #count ~ ta, count ~ hard_coral, etc
data = kolesurvey)
cat("\n---------------- Model for:", i, "------------------------\n") #separate glms based on substrate
print(summary(glmod)$coefficients)
cat("\nDispersion for:", i, "\n")
print(sum(residuals(glmod, type = "pearson")^2)/(length(glmod$y) - length(glmod$coefficients)))
cat("\nAnova for:", i, "\n") #likelihood ratio tests
print(Anova(glmod, test = 'F'))
cat("\nEffects plot for:", i, "\n")
print(ggeffect(glmod) %>%
plot() +
labs(title = paste("Effects plot for:", i)))
}
##
## ---------------- Model for: ta ------------------------
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.64697114 0.262055821 13.916772 5.010256e-44
## ta -0.01900419 0.004270196 -4.450425 8.570061e-06
##
## Dispersion for: ta
## [1] 0.9872545
##
## Anova for: ta
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## Error estimate based on Pearson residuals
##
## Sum Sq Df F value Pr(>F)
## ta 19.342 1 19.592 1.333e-05 ***
## Residuals 304.074 308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: ta
##
## ---------------- Model for: hard_coral ------------------------
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.85803249 0.140368812 13.236790 5.379955e-40
## hard_coral 0.03172025 0.005175731 6.128651 8.862709e-10
##
## Dispersion for: hard_coral
## [1] 1.075455
##
## Anova for: hard_coral
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## Error estimate based on Pearson residuals
##
## Sum Sq Df F value Pr(>F)
## hard_coral 33.44 1 31.091 5.388e-08 ***
## Residuals 331.24 308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: hard_coral
##
## ---------------- Model for: sand ------------------------
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.94178341 0.124215660 23.682871 5.414358e-124
## sand -0.03500807 0.007544063 -4.640479 3.476027e-06
##
## Dispersion for: sand
## [1] 0.9493442
##
## Anova for: sand
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## Error estimate based on Pearson residuals
##
## Sum Sq Df F value Pr(>F)
## sand 17.888 1 18.842 1.928e-05 ***
## Residuals 292.398 308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: sand
##
## ---------------- Model for: cca ------------------------
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.99950691 0.126583611 15.795938 3.318190e-56
## cca 0.06009838 0.009659176 6.221895 4.911852e-10
##
## Dispersion for: cca
## [1] 1.03896
##
## Anova for: cca
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## Error estimate based on Pearson residuals
##
## Sum Sq Df F value Pr(>F)
## cca 26.53 1 25.534 7.466e-07 ***
## Residuals 320.00 308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: cca
for (i in substrates) {
substrateformula <- as.formula(paste("count ~", i)) #response variable is kole counts, predictors = i
glmqp <- glm(formula = substrateformula, #count ~ ta, count ~ hard_coral, etc
family = quasipoisson,
data = kolesurvey)
cat("\n---------------- Model for:", i, "------------------------\n") #separate glms based on substrate
print(summary(glmqp)$coefficients)
cat("\nDispersion for:", i, "\n")
print(sum(residuals(glmqp, type = "pearson")^2)/(length(glmqp$y) - length(glmqp$coefficients)))
cat("\nAnova for:", i, "\n") #likelihood ratio tests
print(Anova(glmqp, test = 'F'))
cat("\nEffects plot for:", i, "\n")
print(ggeffect(glmqp) %>%
plot() +
labs(title = paste("Effects plot for:", i)))
}
##
## ---------------- Model for: ta ------------------------
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.57842025 0.193637065 18.480038 7.835281e-52
## ta -0.01770686 0.003610806 -4.903853 1.523712e-06
##
## Dispersion for: ta
## [1] 30.10215
##
## Anova for: ta
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## Error estimate based on Pearson residuals
##
## Sum Sq Df F value Pr(>F)
## ta 710.7 1 23.609 1.883e-06 ***
## Residuals 9271.5 308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: ta
##
## ---------------- Model for: hard_coral ------------------------
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.03793235 0.136319765 14.949647 2.258606e-38
## hard_coral 0.02437143 0.003687684 6.608871 1.704069e-10
##
## Dispersion for: hard_coral
## [1] 28.13073
##
## Anova for: hard_coral
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## Error estimate based on Pearson residuals
##
## Sum Sq Df F value Pr(>F)
## hard_coral 1091.4 1 38.797 1.541e-09 ***
## Residuals 8664.3 308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: hard_coral
##
## ---------------- Model for: sand ------------------------
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.94672701 0.105546388 27.918786 2.282019e-86
## sand -0.03577014 0.009380974 -3.813052 1.658030e-04
##
## Dispersion for: sand
## [1] 30.97697
##
## Anova for: sand
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## Error estimate based on Pearson residuals
##
## Sum Sq Df F value Pr(>F)
## sand 574.4 1 18.543 2.235e-05 ***
## Residuals 9540.9 308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: sand
##
## ---------------- Model for: cca ------------------------
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.27051551 0.119096296 19.06454 4.628841e-54
## cca 0.03443023 0.006404003 5.37636 1.504400e-07
##
## Dispersion for: cca
## [1] 30.8293
##
## Anova for: cca
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## Error estimate based on Pearson residuals
##
## Sum Sq Df F value Pr(>F)
## cca 726.5 1 23.564 1.924e-06 ***
## Residuals 9495.4 308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effects plot for: cca
If there is no overdispersion then the dispersion parameter should be ~1, and a value greater than 1 indicates overdispersion. Our dispersion parameter for the four models ranges between 28.13073 and 30.97697 for the negative binomial model. This indicates there is extra variance in the models which may result in a type 1 error if not corrected for overdispersion.
(4) Although the single-predictor models are useful, we may get a more accurate picture of the role of each substrate type if we include all the substrate types in one model. Create such a model. Based on likelihood ratio tests and effects plots, how do the results change when all predictors are included simultaneously? Why do you think the results have changed? Don’t forget to make some residual diagnostic plots as well. Finally, what are your overall conclusions about the substrate associations of kole, from this look at the data? Are there any additional analyses you would want to do that we have not done in this assignment?
substratesmodel = glm(count ~ ta + hard_coral + sand + cca,
family = poisson,
data = kolesurvey)
summary(substratesmodel)
##
## Call:
## glm(formula = count ~ ta + hard_coral + sand + cca, family = poisson,
## data = kolesurvey)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.853031 0.360774 -2.364 0.0181 *
## ta 0.029757 0.003686 8.074 6.8e-16 ***
## hard_coral 0.048936 0.003659 13.373 < 2e-16 ***
## sand 0.010145 0.003997 2.538 0.0111 *
## cca 0.055212 0.003973 13.897 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7573.1 on 309 degrees of freedom
## Residual deviance: 5817.6 on 305 degrees of freedom
## AIC: 6744.5
##
## Number of Fisher Scoring iterations: 6
Anova(substratesmodel)
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## LR Chisq Df Pr(>Chisq)
## ta 75.529 1 < 2.2e-16 ***
## hard_coral 230.581 1 < 2.2e-16 ***
## sand 6.645 1 0.009942 **
## cca 231.100 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dispersion <- sum(residuals(substratesmodel, type = "pearson")^2)/(length(substratesmodel$y) - length(substratesmodel$coefficients))
print(dispersion)
## [1] 26.93165
resid_xpanel(substratesmodel, type = "deviance")
mpqp = glm(count ~ ta + hard_coral + sand + cca,
family= quasipoisson,
data = kolesurvey)
Anova(mpqp, test= 'F')
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## Error estimate based on Pearson residuals
##
## Sum Sq Df F value Pr(>F)
## ta 75.5 1 2.8045 0.095026 .
## hard_coral 230.6 1 8.5617 0.003691 **
## sand 6.6 1 0.2467 0.619733
## cca 231.1 1 8.5810 0.003653 **
## Residuals 8214.2 305
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
resid_xpanel(mpqp, type = "pearson") #use pearson for quaisipoisson
resid_panel(mpqp, plots = c('resid', 'qq', 'lev', 'hist'))
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggResidpanel package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggResidpanel package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
taqp = plot(ggeffect(mpqp, terms = "ta"))
hcqp = plot(ggeffect(mpqp, terms = "hard_coral"))
sqp = plot(ggeffect(mpqp, terms = "sand"))
ccaqp = plot(ggeffect(mpqp, terms = "cca"))
(taqp|hcqp)/(sqp|ccaqp)
All predictors in the poisson model are significant with extremely low p-values, versus the quasipoisson model where only hard coral and crustose coralline algae are significant and have larger p-values (although below 0.05) than the poisson model.
There is a significant positive interaction between kole and hard coral and crustose coralline algae. As hard coral and crustose coralline algae increase, kole also increases. Additional analysis that could be interesting to conduct could be including habitat type and depth as predictors.