This material corresponds to Week 6 of the labs and demos for Bio 4259F (Undergraduate) and Bio 9915A (Graduate) at Western University.
It documents Poisson regression activities using R. The code loads necessary libraries, simulates data from Poisson distributions, fits Poisson distributions to simulated and real data, and performs Poisson regression analyses on example datasets (Soay sheep fitness and warpbreaks) from Beckerman et al. 2017 Chapter 7.
Load required libraries
library(MASS)
library(tidyr)
library(ggplot2) # workshop addition
library(dplyr) # workshop addition
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggfortify) # workshop addition
library(ggh4x) # Required for stat_theodensity()
rm(list=ls())
Generate random deviates and estimate λ
# create random deviates from a Poisson distribution with known lambda
set.seed(10)
x.pois <- rpois(80,5)
xvals <- seq(from=min(x.pois), to=max(x.pois))
table(x.pois) # view counts
## x.pois
## 1 2 3 4 5 6 7 8 9
## 4 8 12 17 19 8 8 1 3
# estimate lambda
lambda <- fitdistr(x.pois, "Poisson")$estimate
lambda
## lambda
## 4.475
# use right=FALSE to ensure 0 and 1 values are not pooled in the histogram
# the x values were adjusted to ensure histogram and points are aligned properly
hist(x.pois, freq=FALSE, right=FALSE, breaks=xvals, main="rpois(5000,5) with fitted line")
lines(xvals+0.5, dpois(xvals, lambda), lwd=2, col = 'blue')
points(xvals+0.5, dpois(xvals, lambda), pch=16, col = 'blue')
Alternative visualization using ggplot2
poisson.dat <- data.frame(x = x.pois)
poisson.fit <- data.frame(xvals = xvals, y = dpois(xvals,lambda))
ggplot(data=poisson.dat) +
geom_histogram(aes(x = x.pois,
y = after_stat(density)),
color = 'black', fill="magenta",
binwidth = 1) +
geom_point(data=poisson.fit, aes(x=xvals, y=y))+
geom_line(data=poisson.fit, aes(x=xvals, y=y))+
theme_bw() +
labs(x = 'Number of successes per period',
y = 'Proportion') +
ggtitle('5,000 samples of Pois(lambda = 5)')
Using stat_theodensity
ggplot(data=poisson.dat, aes(x.pois)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 1, fill = "lightblue", color = "black") +
stat_theodensity(distri = "pois", color = "red", linetype = "dashed") +
labs(
title = "Poisson Distribution Fit to Data",
x = "Count",
y = "Proportion"
) +
theme_minimal()
To explore how the shape of the Poisson distribution changes as λ changes, we can generate several samples, each from a Poisson distribution with a different rate parameter, and plot them side by side. The ggh4x::stat_theodensity() function is useful here: when a grouping variable is present, it automatically estimates the theoretical distribution for each group and overlays it on the histogram. This is not part of our codding learning, but may help to undestarnd the theory behind… so in this case not learning codding is needed at ggh4x.
# Visualising how the Poisson distribution changes with λ
set.seed(2025)
# Define several λ values and the number of observations per group
lambdas <- c(1, 3, 5, 8)
n_per_lambda <- 100
# Simulate count data for each λ
poisson_samples <- data.frame(
count = unlist(lapply(lambdas, function(l) rpois(n_per_lambda, l))),
lambda = factor(rep(lambdas, each = n_per_lambda))
)
# Plot histograms with overlaid theoretical densities for each λ
ggplot(poisson_samples, aes(x = count, group = lambda)) +
geom_histogram(aes(y = after_stat(density)),
binwidth = 1, color = "black", fill = "lightblue") +
stat_theodensity(aes(distri = "pois"), color = "red", linetype = "dashed") +
facet_wrap(~ lambda, scales = "free") +
theme_bw() +
labs(
title = "Poisson distributions with varying λ",
x = "Count",
y = "Proportion"
)
## Warning in stat_theodensity(aes(distri = "pois"), color = "red", linetype =
## "dashed"): Ignoring unknown aesthetics: distri
As λ increases, the distribution spreads out and becomes more symmetric. The dashed red lines show the fitted Poisson distributions estimated separately for each group by stat_theodensity()
The following example fits a Poisson distribution to limpet counts on 20 × 20 cm quadrats.
# Data are 50 counts of limpets
x <- c(rep(6,4), rep(7,6), rep(8,11), rep(9,16), rep(10,9), rep(11,4))
length(x)
## [1] 50
xvals <- seq(from=0, to=max(x)+3)
lambda <- fitdistr(x, "Poisson")$estimate
hist(x, freq=FALSE, right=FALSE, breaks=xvals, main="Limpet counts with fitted lines")
lines(xvals+0.5, dpois(xvals, lambda), lwd=2, col = 'blue')
points(xvals+0.5, dpois(xvals, lambda), pch=16, col = 'blue')
Visualizing limpet data with ggplot2
poisson.dat <- data.frame(x = x)
poisson.fit <- data.frame(xvals = xvals, y = dpois(xvals,lambda))
ggplot(data=poisson.dat,aes(x=x)) +
geom_histogram(aes(y = after_stat(density)),
color = 'black', fill="lightblue",
binwidth = 1) +
stat_theodensity(distri = "pois", color = "red", linetype = "dashed") +
theme_bw() +
labs(x = 'Number of limpets',
y = 'Proportion') +
ggtitle('Limpet counts with fitted lines')
Create a dataset of horse kicks per army corp using tidyr::uncount.
horse1 <- data.frame(x=0:6, corp.year=c(109,65,22,3,1,0,0))
horse2 <- horse1 %>%
uncount(corp.year)
ggplot(data=horse2, aes(x=x)) +
geom_histogram(aes(y = after_stat(density)),
color = 'black', fill="lightblue",
binwidth = 1) +
stat_theodensity(distri = "pois", color = "red", linetype = "dashed") +
theme_bw() +
labs(x = 'Number of horse kicks per army corp',
y = 'Proportion') +
ggtitle('Horse kicks')
Fit a simple linear regression and a Poisson regression to the Soay sheep fitness data, assess model diagnostics, and plot fitted values.
# select the data file interactively; replace with a specific path if needed
soay <- read.csv("data/SoaySheepFitness.csv", header=TRUE)
head(soay)
## fitness body.size
## 1 4 6.373546
## 2 3 7.183643
## 3 2 6.164371
## 4 14 8.595281
## 5 5 7.329508
## 6 2 6.179532
# Fit simple linear regression
mod1 <- lm(fitness ~ body.size, data=soay)
autoplot(mod1)
# Fit Poisson regression
mod1 <- glm(fitness ~ body.size, data=soay, family=poisson(link=log))
# Show model diagnostics
autoplot(mod1, smooth.colour=NA)
# Produce Analysis of Deviance table
anova(mod1, test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: fitness
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 49 85.081
## body.size 1 37.041 48 48.040 1.157e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Produce coefficients & check for overdispersion
summary(mod1)
##
## Call:
## glm(formula = fitness ~ body.size, family = poisson(link = log),
## data = soay)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.42203 0.69432 -3.488 0.000486 ***
## body.size 0.54087 0.09316 5.806 6.41e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 85.081 on 49 degrees of freedom
## Residual deviance: 48.040 on 48 degrees of freedom
## AIC: 210.85
##
## Number of Fisher Scoring iterations: 4
dispersion.index <- 48.040 / 48
# Fitness for a 6, 7 and 8 kg lamb
exp(-2.42203 + 0.54087*6)
## [1] 2.277754
exp(-2.42203 + 0.54087*7)
## [1] 3.912044
exp(-2.42203 + 0.54087*8)
## [1] 6.718937
min.size <- min(soay$body.size)
max.size <- max(soay$body.size)
new.x <- expand.grid(body.size=seq(min.size,max.size,length=1000))
new.y <- predict(mod1, newdata=new.x, se.fit=TRUE)
addThese <- data.frame(new.x, new.y)
addThese <- addThese %>% mutate(
fitness=exp(fit),
lwr = exp(fit-1.96*se.fit),
upr = exp(fit+1.96*se.fit))
ggplot(soay, aes(x=body.size, y=fitness))+
geom_point(size=3, alpha=0.5)+
geom_smooth(data=addThese,
aes(ymin=lwr, ymax=upr), stat='identity') + # identity means that y values are provided
theme_bw()
Analyze the warpbreaks dataset with Poisson GLMs, compare models and produce fitted lines.
help(warpbreaks)
## starting httpd help server ... done
data(warpbreaks)
# Fit data to a Poisson distribution
ggplot(data=warpbreaks, aes(x=breaks)) +
geom_histogram(aes(y = after_stat(density)),
color = 'black', fill="lightblue",
binwidth = 1) +
stat_theodensity(distri = "pois", color = "red", linetype = "dashed") +
theme_bw() +
labs(x = 'Number of warp breaks',
y = 'Proportion') +
ggtitle('Warp breaks')
head(warpbreaks)
## breaks wool tension
## 1 26 A L
## 2 30 A L
## 3 54 A L
## 4 25 A L
## 5 70 A L
## 6 52 A L
Fit glm with wool only
#
mod1 <- glm(data=warpbreaks, breaks ~ wool, family=poisson(link=log))
anova(mod1, test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: breaks
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 53 297.37
## wool 1 16.039 52 281.33 6.206e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod1)
##
## Call:
## glm(formula = breaks ~ wool, family = poisson(link = log), data = warpbreaks)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.43518 0.03454 99.443 < 2e-16 ***
## woolB -0.20599 0.05157 -3.994 6.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 297.37 on 53 degrees of freedom
## Residual deviance: 281.33 on 52 degrees of freedom
## AIC: 560
##
## Number of Fisher Scoring iterations: 4
autoplot(mod1)
Fit glm with wool and tension interaction
# Fit glm with wool and tension interaction
mod1 <- glm(data=warpbreaks, breaks ~ wool * tension, family=poisson(link=log))
anova(mod1, test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: breaks
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 53 297.37
## wool 1 16.039 52 281.33 6.206e-05 ***
## tension 2 70.942 50 210.39 3.938e-16 ***
## wool:tension 2 28.087 48 182.31 7.962e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod1)
##
## Call:
## glm(formula = breaks ~ wool * tension, family = poisson(link = log),
## data = warpbreaks)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.79674 0.04994 76.030 < 2e-16 ***
## woolB -0.45663 0.08019 -5.694 1.24e-08 ***
## tensionM -0.61868 0.08440 -7.330 2.30e-13 ***
## tensionH -0.59580 0.08378 -7.112 1.15e-12 ***
## woolB:tensionM 0.63818 0.12215 5.224 1.75e-07 ***
## woolB:tensionH 0.18836 0.12990 1.450 0.147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 297.37 on 53 degrees of freedom
## Residual deviance: 182.31 on 48 degrees of freedom
## AIC: 468.97
##
## Number of Fisher Scoring iterations: 4
autoplot(mod1)
car::vif(mod1)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## wool 2.417953 1 1.554977
## tension 3.494690 2 1.367263
## wool:tension 6.616585 2 1.603831
new.x <- expand.grid(wool=c("A","B"), tension=c("L","M","H"))
new.y <- predict(mod1, newdata=new.x, se.fit=TRUE)
addThese <- data.frame(new.x, new.y)
addThese <- addThese %>% mutate(
fitness=exp(fit),
lwr = exp(fit-1.96*se.fit),
upr = exp(fit+1.96*se.fit))
# see page 144 in Beckerman et al.
ggplot(addThese, aes(x=tension, y=fitness, group=wool, col=wool))+
geom_point()+
geom_line() +
geom_errorbar(aes(ymin=lwr, ymax=upr), width=0.1)+
scale_color_manual(values=c("black","magenta"))+
xlab("Tension")+
ylab("Breaks")+
theme_bw()
Even after fitting a Poisson GLM, count data may exhibit overdispersion, where the variance exceeds the mean. Overdispersion can lead to underestimated standard errors and inflated Type I error rates.
Checking for overdispersion: After fitting a Poisson model to the warpbreaks data, compute the dispersion index as the residual deviance divided by the residual degrees of freedom. In a well‐fitting Poisson model this ratio should be close to 1 because, under the null hypothesis that the model is correct, the deviance is approximately chi‑square distributed with mean equal to its degrees of freedom. Guidelines vary slightly across sources, but values below about 1.2 are generally considered acceptable. Ratios between 1 and 2 suggest mild overdispersion, while values above 2–4 indicate more substantial departures from the Poisson assumption. When you run the code below, you will see that the dispersion index for the wool × tension model is around 3.8. This means that the residual deviance is almost four times larger than we would expect, so the Poisson model is underestimating the variability in the data. In such cases, statistical tests based on the Poisson model may be anti‑conservative (p‑values too small) and you should consider fitting models that allow for extra‑Poisson variation. Two common remedies are the quasi‑Poisson model, which inflates the standard errors without changing the mean structure, and the negative binomial model, which introduces an extra dispersion parameter and often provides a better fit. Overdispersion can also arise from missing covariates or interaction terms, excess zeros, or a few influential outliers, so it is worth inspecting your data and model specification as well.
# Using the Poisson model with wool and tension
mod_poisson <- glm(breaks ~ wool * tension, family = poisson(link=log), data=warpbreaks)
dispersion_index <- mod_poisson$deviance / mod_poisson$df.residual
dispersion_index
## [1] 3.798024
car::vif(mod_poisson)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## wool 2.417953 1 1.554977
## tension 3.494690 2 1.367263
## wool:tension 6.616585 2 1.603831
The negative binomial model introduces an extra parameter to accommodate overdispersion. Use glm.nb() from the MASS package.
mod_nb <- glm.nb(breaks ~ wool * tension, data=warpbreaks)
summary(mod_nb)
##
## Call:
## glm.nb(formula = breaks ~ wool * tension, data = warpbreaks,
## init.theta = 12.08216462, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.7967 0.1081 35.116 < 2e-16 ***
## woolB -0.4566 0.1576 -2.898 0.003753 **
## tensionM -0.6187 0.1597 -3.873 0.000107 ***
## tensionH -0.5958 0.1594 -3.738 0.000186 ***
## woolB:tensionM 0.6382 0.2274 2.807 0.005008 **
## woolB:tensionH 0.1884 0.2316 0.813 0.416123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(12.0822) family taken to be 1)
##
## Null deviance: 86.759 on 53 degrees of freedom
## Residual deviance: 53.506 on 48 degrees of freedom
## AIC: 405.12
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 12.08
## Std. Err.: 3.30
##
## 2 x log-likelihood: -391.125
anova(mod_nb, test="Chisq")
## Warning in anova.negbin(mod_nb, test = "Chisq"): tests made without
## re-estimating 'theta'
## Analysis of Deviance Table
##
## Model: Negative Binomial(12.0822), link: log
##
## Response: breaks
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 53 86.759
## wool 1 4.8269 52 81.932 0.02802 *
## tension 2 20.2194 50 61.712 4.068e-05 ***
## wool:tension 2 8.2060 48 53.506 0.01652 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The quasi-Poisson model keeps the fitted mean the same as the Poisson model but adjusts the standard errors to account for overdispersion. Use the quasipoisson family in glm().
mod_qp <- glm(breaks ~ wool * tension, family = quasipoisson(link=log), data=warpbreaks)
summary(mod_qp)
##
## Call:
## glm(formula = breaks ~ wool * tension, family = quasipoisson(link = log),
## data = warpbreaks)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.79674 0.09688 39.189 < 2e-16 ***
## woolB -0.45663 0.15558 -2.935 0.005105 **
## tensionM -0.61868 0.16374 -3.778 0.000436 ***
## tensionH -0.59580 0.16253 -3.666 0.000616 ***
## woolB:tensionM 0.63818 0.23699 2.693 0.009727 **
## woolB:tensionH 0.18836 0.25201 0.747 0.458436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 3.76389)
##
## Null deviance: 297.37 on 53 degrees of freedom
## Residual deviance: 182.31 on 48 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
Compare the dispersion indices and the significance of predictors across models. The model with a dispersion index closer to 1 and a good fit is often preferred for inference.