A generalized linear model (GLM) is essentially a linear model that does not assume a normally distributed response variable. GLMs are therefore non-parametric tests that use non-Gaussian distributions such as Poisson or negative binomial to ask: do the values of one or more independent variables predict the values of one dependent variable? GLMs are useful when data do not follow normal distributions. Count data in particular are typically characterized by zero-truncation leading to one-inflated data or data skewed to the right. In this case, and when response values are integers (as with count data), a Poisson error distribution may be appropriate to use. When data are overdispersed, a negative binomial error distribution may be more appropriate. To determine the best model for the data, the models can be evaluated with Akaike’s Information Criterion (AIC). Additionally, GLMs can incorporate multiple predictor variables as additive or interactive terms, which can help one better understand how multiple factors impact the response variable.
Load packages
library(ggplot2)
library(tidyverse)
library(knitr)
library(MASS)
library(car)
library(MuMIn)
library(multcomp)
Load data
data <- read.csv("PAINTS_data_v3.csv")
data$site <- as.factor(data$site)
Plot data
ggplot(data, aes(site, flowering_species_richness)) +
geom_boxplot() +
theme_classic() +
xlab("Site") +
ylab("Flowering Species Richness")
Figure 1. Species richness at three different sites near the Chicago Botanic Garden. Data is from the Plant-Animal Interactions course project. Data collected by Sam Rosa, Andrew Davies, and myself.
What does the response variable look like?
hist(data$flowering_species_richness, data=data)
## Warning in plot.window(xlim, ylim, "", ...): "data" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, ...): "data" is not a graphical parameter
Figure 2. These data do not look like they follow a normal distribution. There appears to be a lot of data clustered at lower values, which is a result of this being count data.
Create a GLM with a Poisson error distribution and view the model
summary
[code adapted from PAINTS salty nectar R activity by Paul]
sr.poisson <- glm(flowering_species_richness ~ site, family = poisson, data = data)
car::Anova(sr.poisson, type = 3, test.statistic = "F")
## Analysis of Deviance Table (Type III tests)
##
## Response: flowering_species_richness
## Error estimate based on Pearson residuals
##
## Sum Sq Df F values Pr(>F)
## site 6.4781 2 4.6738 0.01807 *
## Residuals 18.7116 27
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Create a GLM with a negative binomial error distribution and view the
model summary
[code adapted from PAINTS salty nectar R activity by Paul]
sr.nb <- MASS::glm.nb(flowering_species_richness ~ site, data = data)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
car::Anova(sr.nb, type = 3, test.statistic = "F") #view summary
## Analysis of Deviance Table (Type III tests)
##
## Response: flowering_species_richness
## Error estimate based on Pearson residuals
##
## Sum Sq Df F values Pr(>F)
## site 6.4779 2 4.6738 0.01807 *
## Residuals 18.7108 27
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which model is best? Compare AIC values to find out.
AIC <- AICc(sr.poisson, sr.nb)
knitr::kable(AIC,
align = "ccc",
caption = "Table 1. AIC values of two GLMs with Poisson or negative binomial error distributions. The model with the Poisson distribution is approximately 3 values lower than the other, suggesting that the GLM with the Poisson distribution is a better model for the data.")
| df | AICc | |
|---|---|---|
| sr.poisson | 3 | 87.84062 |
| sr.nb | 4 | 90.51802 |
Conduct a post-hoc test for the GLM with the Poisson distribution to detect which groups are different from one another.
tukey.sr.poisson <- glht(sr.poisson, mcp(site = "Tukey"))
summary(tukey.sr.poisson)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: glm(formula = flowering_species_richness ~ site, family = poisson,
## data = data)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## McDonald Woods - Barbara Brown == 0 0.1112 0.3338 0.333 0.9400
## Skokie Lagoons - Barbara Brown == 0 -0.8873 0.4491 -1.976 0.1162
## Skokie Lagoons - McDonald Woods == 0 -0.9985 0.4421 -2.258 0.0607 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
It appears that none of the groups are technically signficantly different from one another, but there is likely a biologically significant difference in species richness between McDonald Woods and Skokie Lagoons.