This material corresponds to Week 6 of the labs and demos for Bio 4259F (Undergraduate) and Bio 9915A (Graduate) at Western University.

It documents Poisson regression activities using R. The code loads necessary libraries, simulates data from Poisson distributions, fits Poisson distributions to simulated and real data, and performs Poisson regression analyses on example datasets (Soay sheep fitness and warpbreaks) from Beckerman et al. 2017 Chapter 7.

Load required libraries

library(MASS)
library(tidyr)
library(ggplot2) # workshop addition
library(dplyr) # workshop addition
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggfortify) # workshop addition
library(ggh4x) # Required for stat_theodensity()

rm(list=ls())

Fitting a Poisson distribution

Generate random deviates and estimate λ

# create random deviates from a Poisson distribution with known lambda
set.seed(10)
x.pois <- rpois(80,5)  
xvals <- seq(from=min(x.pois), to=max(x.pois))
table(x.pois)  # view counts
## x.pois
##  1  2  3  4  5  6  7  8  9 
##  4  8 12 17 19  8  8  1  3
# estimate lambda
lambda <- fitdistr(x.pois, "Poisson")$estimate
lambda
## lambda 
##  4.475

Generate plots

# use right=FALSE to ensure 0 and 1 values are not pooled in the histogram
# the x values were adjusted to ensure histogram and points are aligned properly
hist(x.pois, freq=FALSE, right=FALSE, breaks=xvals, main="rpois(5000,5) with fitted line")
lines(xvals+0.5, dpois(xvals, lambda), lwd=2, col = 'blue')
points(xvals+0.5, dpois(xvals, lambda), pch=16, col = 'blue')

Alternative visualization using ggplot2

poisson.dat <- data.frame(x = x.pois)
poisson.fit <- data.frame(xvals = xvals, y = dpois(xvals,lambda))

ggplot(data=poisson.dat) +
  geom_histogram(aes(x = x.pois, 
                     y = after_stat(density)), 
                 color = 'black', fill="magenta",
                 binwidth = 1) +
  geom_point(data=poisson.fit, aes(x=xvals, y=y))+
  geom_line(data=poisson.fit, aes(x=xvals, y=y))+
  theme_bw() +
  labs(x = 'Number of successes per period',
       y = 'Proportion') +
  ggtitle('5,000 samples of Pois(lambda = 5)')

Using stat_theodensity

ggplot(data=poisson.dat, aes(x.pois)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 1, fill = "lightblue", color = "black") +
  stat_theodensity(distri = "pois", color = "red", linetype = "dashed") +
  labs(
    title = "Poisson Distribution Fit to Data",
    x = "Count",
    y = "Proportion"
  ) +
  theme_minimal()

Visualizing Poisson distributions with different λ values

To explore how the shape of the Poisson distribution changes as λ changes, we can generate several samples, each from a Poisson distribution with a different rate parameter, and plot them side by side. The ggh4x::stat_theodensity() function is useful here: when a grouping variable is present, it automatically estimates the theoretical distribution for each group and overlays it on the histogram. This is not part of our codding learning, but may help to undestarnd the theory behind… so in this case not learning codding is needed at ggh4x.

# Visualising how the Poisson distribution changes with λ
set.seed(2025)

# Define several λ values and the number of observations per group
lambdas <- c(1, 3, 5, 8)
n_per_lambda <- 100

# Simulate count data for each λ
poisson_samples <- data.frame(
  count = unlist(lapply(lambdas, function(l) rpois(n_per_lambda, l))),
  lambda = factor(rep(lambdas, each = n_per_lambda))
)

# Plot histograms with overlaid theoretical densities for each λ
ggplot(poisson_samples, aes(x = count, group = lambda)) +
  geom_histogram(aes(y = after_stat(density)),
                 binwidth = 1, color = "black", fill = "lightblue") +
  stat_theodensity(aes(distri = "pois"), color = "red", linetype = "dashed") +
  facet_wrap(~ lambda, scales = "free") +
  theme_bw() +
  labs(
    title = "Poisson distributions with varying λ",
    x = "Count",
    y = "Proportion"
  )
## Warning in stat_theodensity(aes(distri = "pois"), color = "red", linetype =
## "dashed"): Ignoring unknown aesthetics: distri

As λ increases, the distribution spreads out and becomes more symmetric. The dashed red lines show the fitted Poisson distributions estimated separately for each group by stat_theodensity()

Fit a Poisson distribution to real data

The following example fits a Poisson distribution to limpet counts on 20 × 20 cm quadrats.

# Data are 50 counts of limpets
x <- c(rep(6,4), rep(7,6), rep(8,11), rep(9,16), rep(10,9), rep(11,4))
length(x)
## [1] 50
xvals <- seq(from=0, to=max(x)+3)
lambda <- fitdistr(x, "Poisson")$estimate

hist(x, freq=FALSE, right=FALSE, breaks=xvals, main="Limpet counts with fitted lines")
lines(xvals+0.5, dpois(xvals, lambda), lwd=2, col = 'blue')
points(xvals+0.5, dpois(xvals, lambda), pch=16, col = 'blue')

Visualizing limpet data with ggplot2

poisson.dat <- data.frame(x = x)
poisson.fit <- data.frame(xvals = xvals, y = dpois(xvals,lambda))

ggplot(data=poisson.dat,aes(x=x)) + 
  geom_histogram(aes(y = after_stat(density)), 
                 color = 'black', fill="lightblue",
                 binwidth = 1) +
  stat_theodensity(distri = "pois", color = "red", linetype = "dashed") +
  theme_bw() +
  labs(x = 'Number of limpets',
       y = 'Proportion') +
  ggtitle('Limpet counts with fitted lines')

Horse death counts dataset

Create a dataset of horse kicks per army corp using tidyr::uncount.

horse1 <- data.frame(x=0:6, corp.year=c(109,65,22,3,1,0,0))
horse2 <- horse1 %>%
  uncount(corp.year)

ggplot(data=horse2, aes(x=x)) + 
  geom_histogram(aes(y = after_stat(density)), 
                 color = 'black', fill="lightblue",
                 binwidth = 1) +
  stat_theodensity(distri = "pois", color = "red", linetype = "dashed") +
  theme_bw() +
  labs(x = 'Number of horse kicks per army corp',
       y = 'Proportion') +
  ggtitle('Horse kicks')

Analysis of Soay fitness

Fit a simple linear regression and a Poisson regression to the Soay sheep fitness data, assess model diagnostics, and plot fitted values.

# select the data file interactively; replace with a specific path if needed

soay <- read.csv("data/SoaySheepFitness.csv", header=TRUE)
head(soay)
##   fitness body.size
## 1       4  6.373546
## 2       3  7.183643
## 3       2  6.164371
## 4      14  8.595281
## 5       5  7.329508
## 6       2  6.179532
# Fit simple linear regression
mod1 <- lm(fitness ~ body.size, data=soay)
autoplot(mod1)

# Fit Poisson regression
mod1 <- glm(fitness ~ body.size, data=soay, family=poisson(link=log))

# Show model diagnostics
autoplot(mod1, smooth.colour=NA)

# Produce Analysis of Deviance table
anova(mod1, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: fitness
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                         49     85.081              
## body.size  1   37.041        48     48.040 1.157e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Produce coefficients & check for overdispersion
summary(mod1)
## 
## Call:
## glm(formula = fitness ~ body.size, family = poisson(link = log), 
##     data = soay)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.42203    0.69432  -3.488 0.000486 ***
## body.size    0.54087    0.09316   5.806 6.41e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 85.081  on 49  degrees of freedom
## Residual deviance: 48.040  on 48  degrees of freedom
## AIC: 210.85
## 
## Number of Fisher Scoring iterations: 4
dispersion.index <- 48.040 / 48

# Fitness for a 6, 7 and 8 kg lamb
exp(-2.42203 + 0.54087*6)
## [1] 2.277754
exp(-2.42203 + 0.54087*7)
## [1] 3.912044
exp(-2.42203 + 0.54087*8)
## [1] 6.718937

Plotting the fitted curve for Soay fitness

min.size <- min(soay$body.size)
max.size <- max(soay$body.size)

new.x <- expand.grid(body.size=seq(min.size,max.size,length=1000))
new.y <- predict(mod1, newdata=new.x, se.fit=TRUE)

addThese <- data.frame(new.x, new.y)
addThese <- addThese %>% mutate(
  fitness=exp(fit),
  lwr = exp(fit-1.96*se.fit),
  upr = exp(fit+1.96*se.fit))

ggplot(soay, aes(x=body.size, y=fitness))+
  geom_point(size=3, alpha=0.5)+
  geom_smooth(data=addThese,
              aes(ymin=lwr, ymax=upr), stat='identity') +  # identity means that y values are provided
  theme_bw()

Analysis of warpbreaks

Analyze the warpbreaks dataset with Poisson GLMs, compare models and produce fitted lines.

help(warpbreaks)
## starting httpd help server ... done
data(warpbreaks)
# Fit data to a Poisson distribution
ggplot(data=warpbreaks, aes(x=breaks)) + 
  geom_histogram(aes(y = after_stat(density)), 
                 color = 'black', fill="lightblue",
                 binwidth = 1) +
  stat_theodensity(distri = "pois", color = "red", linetype = "dashed") +
  theme_bw() +
  labs(x = 'Number of warp breaks',
       y = 'Proportion') +
  ggtitle('Warp breaks')

head(warpbreaks)
##   breaks wool tension
## 1     26    A       L
## 2     30    A       L
## 3     54    A       L
## 4     25    A       L
## 5     70    A       L
## 6     52    A       L

Fit glm with wool only

# 
mod1 <- glm(data=warpbreaks, breaks ~ wool, family=poisson(link=log))
anova(mod1, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: breaks
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                    53     297.37              
## wool  1   16.039        52     281.33 6.206e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod1)
## 
## Call:
## glm(formula = breaks ~ wool, family = poisson(link = log), data = warpbreaks)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.43518    0.03454  99.443  < 2e-16 ***
## woolB       -0.20599    0.05157  -3.994 6.49e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 297.37  on 53  degrees of freedom
## Residual deviance: 281.33  on 52  degrees of freedom
## AIC: 560
## 
## Number of Fisher Scoring iterations: 4
autoplot(mod1)

Fit glm with wool and tension interaction

# Fit glm with wool and tension interaction
mod1 <- glm(data=warpbreaks, breaks ~ wool * tension, family=poisson(link=log))
anova(mod1, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: breaks
## 
## Terms added sequentially (first to last)
## 
## 
##              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                            53     297.37              
## wool          1   16.039        52     281.33 6.206e-05 ***
## tension       2   70.942        50     210.39 3.938e-16 ***
## wool:tension  2   28.087        48     182.31 7.962e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod1)
## 
## Call:
## glm(formula = breaks ~ wool * tension, family = poisson(link = log), 
##     data = warpbreaks)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     3.79674    0.04994  76.030  < 2e-16 ***
## woolB          -0.45663    0.08019  -5.694 1.24e-08 ***
## tensionM       -0.61868    0.08440  -7.330 2.30e-13 ***
## tensionH       -0.59580    0.08378  -7.112 1.15e-12 ***
## woolB:tensionM  0.63818    0.12215   5.224 1.75e-07 ***
## woolB:tensionH  0.18836    0.12990   1.450    0.147    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 297.37  on 53  degrees of freedom
## Residual deviance: 182.31  on 48  degrees of freedom
## AIC: 468.97
## 
## Number of Fisher Scoring iterations: 4
autoplot(mod1)

car::vif(mod1)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                  GVIF Df GVIF^(1/(2*Df))
## wool         2.417953  1        1.554977
## tension      3.494690  2        1.367263
## wool:tension 6.616585  2        1.603831

Plot fitted values for warpbreaks

new.x <- expand.grid(wool=c("A","B"), tension=c("L","M","H"))
new.y <- predict(mod1, newdata=new.x, se.fit=TRUE)

addThese <- data.frame(new.x, new.y)
addThese <- addThese %>% mutate(
  fitness=exp(fit),
  lwr = exp(fit-1.96*se.fit),
  upr = exp(fit+1.96*se.fit))

# see page 144 in Beckerman et al.

ggplot(addThese, aes(x=tension, y=fitness, group=wool, col=wool))+
  geom_point()+
  geom_line() +
  geom_errorbar(aes(ymin=lwr, ymax=upr), width=0.1)+
  scale_color_manual(values=c("black","magenta"))+ 
  xlab("Tension")+
  ylab("Breaks")+
  theme_bw()

Diagnosing overdispersion and alternative models

Even after fitting a Poisson GLM, count data may exhibit overdispersion, where the variance exceeds the mean. Overdispersion can lead to underestimated standard errors and inflated Type I error rates.

Checking for overdispersion: After fitting a Poisson model to the warpbreaks data, compute the dispersion index as the residual deviance divided by the residual degrees of freedom. In a well‐fitting Poisson model this ratio should be close to 1 because, under the null hypothesis that the model is correct, the deviance is approximately chi‑square distributed with mean equal to its degrees of freedom. Guidelines vary slightly across sources, but values below about 1.2 are generally considered acceptable. Ratios between 1 and 2 suggest mild overdispersion, while values above 2–4 indicate more substantial departures from the Poisson assumption. When you run the code below, you will see that the dispersion index for the wool × tension model is around 3.8. This means that the residual deviance is almost four times larger than we would expect, so the Poisson model is underestimating the variability in the data. In such cases, statistical tests based on the Poisson model may be anti‑conservative (p‑values too small) and you should consider fitting models that allow for extra‑Poisson variation. Two common remedies are the quasi‑Poisson model, which inflates the standard errors without changing the mean structure, and the negative binomial model, which introduces an extra dispersion parameter and often provides a better fit. Overdispersion can also arise from missing covariates or interaction terms, excess zeros, or a few influential outliers, so it is worth inspecting your data and model specification as well.

# Using the Poisson model with wool and tension
mod_poisson <- glm(breaks ~ wool * tension, family = poisson(link=log), data=warpbreaks)
dispersion_index <- mod_poisson$deviance / mod_poisson$df.residual
dispersion_index
## [1] 3.798024
car::vif(mod_poisson)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                  GVIF Df GVIF^(1/(2*Df))
## wool         2.417953  1        1.554977
## tension      3.494690  2        1.367263
## wool:tension 6.616585  2        1.603831

Fitting a negative binomial model

The negative binomial model introduces an extra parameter to accommodate overdispersion. Use glm.nb() from the MASS package.

mod_nb <- glm.nb(breaks ~ wool * tension, data=warpbreaks)
summary(mod_nb)
## 
## Call:
## glm.nb(formula = breaks ~ wool * tension, data = warpbreaks, 
##     init.theta = 12.08216462, link = log)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.7967     0.1081  35.116  < 2e-16 ***
## woolB           -0.4566     0.1576  -2.898 0.003753 ** 
## tensionM        -0.6187     0.1597  -3.873 0.000107 ***
## tensionH        -0.5958     0.1594  -3.738 0.000186 ***
## woolB:tensionM   0.6382     0.2274   2.807 0.005008 ** 
## woolB:tensionH   0.1884     0.2316   0.813 0.416123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(12.0822) family taken to be 1)
## 
##     Null deviance: 86.759  on 53  degrees of freedom
## Residual deviance: 53.506  on 48  degrees of freedom
## AIC: 405.12
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  12.08 
##           Std. Err.:  3.30 
## 
##  2 x log-likelihood:  -391.125
anova(mod_nb, test="Chisq")
## Warning in anova.negbin(mod_nb, test = "Chisq"): tests made without
## re-estimating 'theta'
## Analysis of Deviance Table
## 
## Model: Negative Binomial(12.0822), link: log
## 
## Response: breaks
## 
## Terms added sequentially (first to last)
## 
## 
##              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                            53     86.759              
## wool          1   4.8269        52     81.932   0.02802 *  
## tension       2  20.2194        50     61.712 4.068e-05 ***
## wool:tension  2   8.2060        48     53.506   0.01652 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fitting a quasi-Poisson model

The quasi-Poisson model keeps the fitted mean the same as the Poisson model but adjusts the standard errors to account for overdispersion. Use the quasipoisson family in glm().

mod_qp <- glm(breaks ~ wool * tension, family = quasipoisson(link=log), data=warpbreaks)
summary(mod_qp)
## 
## Call:
## glm(formula = breaks ~ wool * tension, family = quasipoisson(link = log), 
##     data = warpbreaks)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.79674    0.09688  39.189  < 2e-16 ***
## woolB          -0.45663    0.15558  -2.935 0.005105 ** 
## tensionM       -0.61868    0.16374  -3.778 0.000436 ***
## tensionH       -0.59580    0.16253  -3.666 0.000616 ***
## woolB:tensionM  0.63818    0.23699   2.693 0.009727 ** 
## woolB:tensionH  0.18836    0.25201   0.747 0.458436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.76389)
## 
##     Null deviance: 297.37  on 53  degrees of freedom
## Residual deviance: 182.31  on 48  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Compare the dispersion indices and the significance of predictors across models. The model with a dispersion index closer to 1 and a good fit is often preferred for inference.