Introduction

Throughout this course we have modeled continuous outcomes (mentally unhealthy days, BMI, blood pressure) and binary outcomes (frequent mental distress, yes/no). But many epidemiologic outcomes are counts: the number of hospital visits in a year, new infections per county per week, falls per nursing home resident, or mentally unhealthy days in the past 30 days. Counts are non-negative integers (\(0, 1, 2, \ldots\)) and are often right-skewed with many zeros and a long tail. Linear regression is inappropriate for counts because it can predict negative values and assumes constant variance.

Poisson regression is the standard generalized linear model for count outcomes. Like logistic regression, it is a member of the GLM family, but it uses a log link instead of a logit link, and it models the mean count rather than the probability of an event.

In this lecture we will:

Understand why linear regression fails for count outcomes
Learn the Poisson distribution and the Poisson regression model
Interpret coefficients as rate ratios
Use offsets to model rates when exposure time varies
Diagnose overdispersion and apply corrections (quasi-Poisson and negative binomial)
Fit and interpret models using the BRFSS 2020 data

Textbook reference: Kleinbaum & Klein, Logistic Regression: A Self-Learning Text (3rd ed.), Poisson regression chapter

Why/When/Watch-out

Why use Poisson regression?

The outcome variable is a count (non-negative integer)
We want to estimate how predictors affect the rate of events
We need adjusted rate ratios that control for confounders
We want to identify risk factors for count-based health outcomes

When to use it?

Cohort studies: number of events per person-time at risk
Ecological studies: disease counts per geographic area
Survey data: number of days with a specific condition
Any setting where the response is a discrete count

Watch out for:

Overdispersion: the variance exceeding the mean (very common in practice)
Excess zeros: more zeros than the Poisson distribution predicts
Non-independence: clustered or correlated counts require special methods
Large counts: when counts are very large and approximately normal, linear regression may be acceptable

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(ggeffects)
library(MASS)
library(plotly)
library(performance)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue using the BRFSS 2020 dataset. In the logistic regression lectures, we modeled frequent mental distress (a binary outcome). Today we return to the original count: number of mentally unhealthy days in the past 30 days (menthlth_days). This is a bounded count (0 to 30), right-skewed, and has many zeros, making it a good candidate for Poisson regression.

Research question:

What individual, behavioral, and socioeconomic factors are associated with the number of mentally unhealthy days reported per month among U.S. adults?

brfss <- readRDS(
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Logistic Regression/brfss_logistic_2020.rds"
)

glimpse(brfss)

## Rows: 5,000
## Columns: 10
## $ fmd           <fct> No, Yes, No, Yes, No, Yes, No, No, No, No, No, No, No, N…
## $ menthlth_days <dbl> 2, 30, 0, 28, 0, 20, 0, 1, 0, 0, 0, 0, 0, 2, 10, 0, 30, …
## $ physhlth_days <dbl> 0, 0, 14, 21, 0, 4, 0, 0, 30, 0, 5, 0, 0, 0, 0, 0, 0, 0,…
## $ sleep_hrs     <dbl> 8, 6, 6, 7, 8, 8, 8, 8, 3, 8, 6, 7, 6, 6, 8, 7, 9, 7, 6,…
## $ age           <dbl> 63, 67, 63, 76, 38, 50, 75, 33, 58, 33, 61, 29, 34, 77, …
## $ sex           <fct> Male, Male, Female, Male, Male, Female, Female, Male, Ma…
## $ bmi           <dbl> 25.83, 26.63, 30.23, 25.85, 23.06, 31.00, 30.99, 33.89, …
## $ exercise      <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes,…
## $ income_cat    <dbl> 5, 4, 5, 8, 6, 3, 5, 7, 4, 7, 2, 4, 1, 8, 7, 2, 5, 5, 7,…
## $ smoker        <fct> Current, Current, Former/Never, Former/Never, Former/Nev…

Descriptive Overview

brfss |>
  tbl_summary(
    include = c(menthlth_days, exercise, smoker, sex, age, income_cat),
    type = list(
      c(menthlth_days, age, income_cat) ~ "continuous"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd}); Median: {median}"
    ),
    label = list(
      menthlth_days ~ "Mentally unhealthy days (0-30)",
      exercise      ~ "Exercise in past 30 days",
      smoker        ~ "Smoking status",
      sex           ~ "Sex",
      age           ~ "Age (years)",
      income_cat    ~ "Income category (1-8)"
    )
  ) |>
  bold_labels()

Characteristic	N = 5,000¹
Mentally unhealthy days (0-30)	5 (9); Median: 0
Exercise in past 30 days	3,673 (73%)
Smoking status
Former/Never	3,280 (66%)
Current	1,720 (34%)
Sex
Male	2,701 (54%)
Female	2,299 (46%)
Age (years)	56 (16); Median: 59
Income category (1-8)	5.85 (2.11); Median: 6.00
¹ Mean (SD); Median: Median; n (%)

Distribution of the Outcome

p <- ggplot(brfss, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Mentally Unhealthy Days (Past 30)",
    subtitle = "BRFSS 2020 — Analytic Sample (n = 5,000)",
    x = "Number of Mentally Unhealthy Days",
    y = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p)

Interpretation: Notice the extreme right skew and the large spike at zero. Many respondents report zero mentally unhealthy days, while a smaller group reports many days. This distribution is far from normal, confirming that linear regression would be a poor fit. The mean and variance of this variable will help us assess whether Poisson regression is appropriate or whether we need to account for overdispersion.

tibble(
  Statistic = c("Mean", "Variance", "Variance/Mean Ratio"),
  Value     = c(
    round(mean(brfss$menthlth_days, na.rm = TRUE), 2),
    round(var(brfss$menthlth_days, na.rm = TRUE), 2),
    round(var(brfss$menthlth_days, na.rm = TRUE) /
            mean(brfss$menthlth_days, na.rm = TRUE), 2)
  )
) |>
  kable(caption = "Mean-Variance Relationship for Mentally Unhealthy Days") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Mean-Variance Relationship for Mentally Unhealthy Days
Statistic	Value
Mean	4.69
Variance	79.79
Variance/Mean Ratio	17.02

If the variance-to-mean ratio is close to 1, the Poisson assumption is reasonable. If it is much greater than 1, we have overdispersion and will need to consider alternatives. Keep this ratio in mind as we proceed.

Part 1: Why Linear Regression Fails for Count Outcomes

The Problem

What happens if we model a count outcome with ordinary linear regression?

lm_count <- lm(menthlth_days ~ age + sex + exercise + smoker, data = brfss)

augmented <- augment(lm_count, data = brfss)

p <- ggplot(augmented, aes(x = .fitted)) +
  geom_histogram(binwidth = 0.5, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_vline(xintercept = 0, color = "red", linewidth = 1, linetype = "dashed") +
  labs(
    title = "Fitted Values from a Linear Regression on Count Data",
    subtitle = "Red line marks zero — negative predictions are impossible for counts",
    x = "Predicted Mentally Unhealthy Days",
    y = "Frequency"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p)

Three problems with using linear regression for count outcomes:

Negative predictions: The model can predict negative counts, which are impossible.
Non-constant variance: For count data, the variance typically increases with the mean. Linear regression assumes constant variance.
Non-normal residuals: Count data, especially with many zeros, produce skewed residuals that violate the normality assumption.

par(mfrow = c(2, 2))
plot(lm_count, which = 1:4)

par(mfrow = c(1, 1))

The residual plots confirm these problems. The residuals-vs-fitted plot shows a clear fan shape (non-constant variance), and the Q-Q plot shows substantial departure from normality. Poisson regression addresses all three issues.

Part 2: The Poisson Distribution

Definition

If \(Y\) is a count, we assume \(Y \sim \text{Poisson}(\mu)\) where:

\[ \Pr(Y = y) = \frac{\mu^y e^{-\mu}}{y!}, \quad y = 0, 1, 2, \ldots \]

The parameter \(\mu\) is both the mean and the variance of the distribution: \(E(Y) = \text{Var}(Y) = \mu\). This mean-variance equality is a defining property that we will revisit when we discuss overdispersion.

Visualizing the Poisson Distribution

poisson_df <- tibble(
  y = rep(0:30, 4),
  mu = rep(c(1, 3, 7, 15), each = 31)
) |>
  mutate(
    prob = dpois(y, lambda = mu),
    mu_label = paste0("μ = ", mu)
  )

p <- ggplot(poisson_df, aes(x = y, y = prob, fill = mu_label)) +
  geom_col(alpha = 0.7, position = "identity", width = 0.8) +
  facet_wrap(~ mu_label, scales = "free_y") +
  labs(
    title = "Poisson Distributions with Different Means",
    subtitle = "Notice: as μ increases, the distribution becomes more symmetric and the variance grows",
    x = "Count (y)",
    y = "Probability"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

ggplotly(p)

Key observations:

When \(\mu\) is small, the distribution is highly right-skewed with a spike at zero
As \(\mu\) increases, the distribution becomes more symmetric and approaches a normal shape
The spread (variance) increases with \(\mu\) because \(\text{Var}(Y) = \mu\)

Part 3: The Poisson Regression Model

Model Specification

Poisson regression is a generalized linear model with a log link:

\[ \log(\mu_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} \]

Equivalently:

\[ \mu_i = \exp(\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi}) \]

The log link guarantees that the predicted mean \(\mu_i\) is always positive, solving the problem of negative predictions.

How It Compares to Other GLMs

Feature	Linear Regression	Logistic Regression	Poisson Regression
Outcome	Continuous	Binary (0/1)	Count (0, 1, 2, …)
Distribution	Normal	Binomial	Poisson
Link function	Identity	Logit	Log
Model	\(\mu = X\beta\)	\(\log\frac{p}{1-p} = X\beta\)	\(\log(\mu) = X\beta\)
Effect measure	Mean difference	Odds ratio (\(e^\beta\))	Rate ratio (\(e^\beta\))

Interpretation of Coefficients: Rate Ratios

Exponentiating a Poisson regression coefficient gives a rate ratio (RR):

\[ \text{RR} = e^{\beta_1} = \frac{E(Y \mid X_1 = x + 1)}{E(Y \mid X_1 = x)} \]

For a binary predictor (e.g., smoker vs. non-smoker): the RR is the ratio of expected counts between the two groups, holding other variables constant
For a continuous predictor (e.g., age): the RR is the multiplicative change in the expected count per one-unit increase

Example interpretation: If \(e^{\beta_{\text{smoker}}} = 1.45\), then current smokers are expected to report 1.45 times as many mentally unhealthy days as former/never smokers, holding other variables constant. That is a 45% higher rate.

Part 4: Fitting a Poisson Regression in R

Simple (Unadjusted) Model

Let us start with a single predictor to build intuition.

mod_pois_simple <- glm(
  menthlth_days ~ smoker,
  data    = brfss,
  family  = poisson(link = "log")
)

mod_pois_simple |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Unadjusted Poisson Regression: Smoking and Mentally Unhealthy Days**")

**Unadjusted Poisson Regression: Smoking and Mentally Unhealthy Days**
Characteristic	IRR	95% CI	p-value
smoker
Former/Never	—	—
Current	1.68	1.63, 1.72	<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

tidy_simple <- tidy(mod_pois_simple, exponentiate = TRUE, conf.int = TRUE)

rr_smoker <- tidy_simple |> filter(term == "smokerCurrent")

Interpretation: Without adjusting for other variables, current smokers report approximately 1.68 times as many mentally unhealthy days as former/never smokers (RR = 1.68; 95% CI: 1.63, 1.72).

Multiple (Adjusted) Poisson Regression

Now we fit the full model with all predictors.

mod_pois <- glm(
  menthlth_days ~ exercise + smoker + sex + age + income_cat,
  data    = brfss,
  family  = poisson(link = "log")
)

mod_pois |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Adjusted Poisson Regression: Rate Ratios for Mentally Unhealthy Days**")

**Adjusted Poisson Regression: Rate Ratios for Mentally Unhealthy Days**
Characteristic	IRR	95% CI	p-value
exercise
No	—	—
Yes	0.73	0.71, 0.75	<0.001
smoker
Former/Never	—	—
Current	1.21	1.18, 1.24	<0.001
sex
Male	—	—
Female	1.52	1.48, 1.56	<0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80	0.98	0.98, 0.98	<0.001
income_cat	0.90	0.89, 0.90	<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Model Summary Statistics

glance_pois <- glance(mod_pois)

tibble(
  Metric = c("Null Deviance", "Residual Deviance", "AIC", "Observations"),
  Value  = c(
    round(glance_pois$null.deviance, 1),
    round(glance_pois$deviance, 1),
    round(glance_pois$AIC, 1),
    glance_pois$nobs
  )
) |>
  kable(caption = "Poisson Model Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Poisson Model Fit Statistics
Metric	Value
Null Deviance	63426.3
Residual Deviance	56536.1
AIC	63891.1
Observations	5000.0

Interpretation

Exercise: Holding other variables constant, individuals who exercised in the past 30 days are expected to report fewer mentally unhealthy days compared to those who did not exercise.

Smoking: After adjustment, current smokers report a higher rate of mentally unhealthy days than former/never smokers. The rate ratio and 95% confidence interval quantify the strength and precision of this association.

Income: Higher income categories are associated with fewer mentally unhealthy days, consistent with the well-established socioeconomic gradient in mental health.

Visualizing Predicted Rates

pred_exercise <- ggpredict(mod_pois, terms = "exercise")
pred_smoker   <- ggpredict(mod_pois, terms = "smoker")
pred_income   <- ggpredict(mod_pois, terms = "income_cat [1:8]")

p1 <- plot(pred_exercise) +
  labs(title = "Predicted Mentally Unhealthy Days by Exercise Status",
       y = "Predicted Count") +
  theme_minimal(base_size = 13)

p2 <- plot(pred_smoker) +
  labs(title = "Predicted Mentally Unhealthy Days by Smoking Status",
       y = "Predicted Count") +
  theme_minimal(base_size = 13)

p3 <- plot(pred_income) +
  labs(title = "Predicted Mentally Unhealthy Days by Income",
       x = "Income Category (1 = Lowest, 8 = Highest)",
       y = "Predicted Count") +
  theme_minimal(base_size = 13)

p1

p2

p3

Part 5: Offsets — Modeling Rates When Exposure Time Varies

The Concept

In many epidemiologic studies, subjects are observed for different amounts of time. For example, in a cohort study, some subjects are followed for 5 years and others for only 2 years. We do not want to compare raw counts; we want to compare rates (events per unit of person-time).

The Poisson model for rates is:

\[ \log\left(\frac{\mu_i}{t_i}\right) = \beta_0 + \beta_1 X_{1i} + \cdots \]

Rearranging:

\[ \log(\mu_i) = \underbrace{\log(t_i)}_{\text{offset}} + \beta_0 + \beta_1 X_{1i} + \cdots \]

The term \(\log(t_i)\) is added to the linear predictor with a fixed coefficient of 1. This is called an offset. It adjusts for differences in exposure time so that the model estimates rates rather than raw counts.

Example with Simulated Data

set.seed(1220)

mortality <- tibble(
  region       = rep(c("Urban", "Rural"), each = 50),
  person_years = round(runif(100, 500, 5000)),
  age_group    = sample(c("Young", "Middle", "Old"), 100, replace = TRUE,
                        prob = c(0.3, 0.4, 0.3)),
  baseline_rate = case_when(
    age_group == "Young"  ~ 0.002,
    age_group == "Middle" ~ 0.005,
    age_group == "Old"    ~ 0.015
  )
) |>
  mutate(
    region_effect = ifelse(region == "Urban", 1.3, 1),
    expected = person_years * baseline_rate * region_effect,
    deaths   = rpois(n(), lambda = expected)
  )

mod_offset <- glm(
  deaths ~ region + age_group + offset(log(person_years)),
  data   = mortality,
  family = poisson(link = "log")
)

mod_offset |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Poisson Model with Offset: Mortality Rates by Region and Age**")

**Poisson Model with Offset: Mortality Rates by Region and Age**
Characteristic	IRR	95% CI	p-value
region
Rural	—	—
Urban	1.27	1.17, 1.39	<0.001
age_group
Middle	—	—
Old	3.19	2.89, 3.53	<0.001
Young	0.36	0.30, 0.43	<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Without the offset, the model would compare raw death counts, unfairly penalizing regions or groups with more person-time of observation. With the offset, the model compares death rates, which is the scientifically meaningful comparison.

In R Syntax

# With offset: models rates (events per person-time)
glm(deaths ~ exposure + age + offset(log(person_years)),
    family = poisson, data = mortality)

# Without offset: models raw counts (misleading if person-time differs)
glm(deaths ~ exposure + age,
    family = poisson, data = mortality)

Part 6: Overdispersion

What Is Overdispersion?

The Poisson distribution assumes that \(\text{Var}(Y) = E(Y) = \mu\). In practice, count data almost always have more variance than the mean, a phenomenon called overdispersion. Common causes include:

Unmeasured heterogeneity: important predictors are missing from the model
Excess zeros: more zero counts than the Poisson predicts
Clustering: non-independent observations within groups

Why It Matters

If overdispersion is ignored, the model underestimates standard errors, producing p-values that are too small and confidence intervals that are too narrow. You may declare effects “significant” when they are not. This is a serious problem.

Diagnosing Overdispersion

disp_ratio <- mod_pois$deviance / mod_pois$df.residual

tibble(
  Metric = c("Residual Deviance", "Residual df", "Dispersion Ratio (Deviance/df)"),
  Value  = c(
    round(mod_pois$deviance, 1),
    mod_pois$df.residual,
    round(disp_ratio, 2)
  )
) |>
  kable(caption = "Overdispersion Check") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Overdispersion Check
Metric	Value
Residual Deviance	56536.10
Residual df	4994.00
Dispersion Ratio (Deviance/df)	11.32

Rule of thumb: If the dispersion ratio (residual deviance / residual df) is much greater than 1, the data are overdispersed. Values between 1 and 1.5 are often tolerable; values above 2 strongly suggest the need for correction.

check_overdispersion(mod_pois)

## # Overdispersion test
## 
##        dispersion ratio =    16.007
##   Pearson's Chi-Squared = 79936.737
##                 p-value =   < 0.001

Fix 1: Quasi-Poisson

The quasi-Poisson model relaxes the mean-variance equality by estimating a dispersion parameter \(\phi\):

\[ \text{Var}(Y) = \phi \cdot \mu \]

If \(\phi > 1\), the data are overdispersed. The quasi-Poisson model inflates the standard errors by \(\sqrt{\phi}\), correcting the confidence intervals and p-values without changing the point estimates.

mod_quasi <- glm(
  menthlth_days ~ exercise + smoker + sex + age + income_cat,
  data    = brfss,
  family  = quasipoisson(link = "log")
)

mod_quasi |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Quasi-Poisson Regression: Corrected Standard Errors**")

**Quasi-Poisson Regression: Corrected Standard Errors**
Characteristic	IRR	95% CI	p-value
exercise
No	—	—
Yes	0.73	0.65, 0.82	<0.001
smoker
Former/Never	—	—
Current	1.21	1.08, 1.35	<0.001
sex
Male	—	—
Female	1.52	1.37, 1.68	<0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80	0.98	0.98, 0.98	<0.001
income_cat	0.90	0.88, 0.92	<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

se_pois  <- tidy(mod_pois)  |> dplyr::select(term, estimate, std.error) |> rename(SE_Poisson = std.error)
se_quasi <- tidy(mod_quasi) |> dplyr::select(term, std.error) |> rename(SE_QuasiPoisson = std.error)

se_pois |>
  left_join(se_quasi, by = "term") |>
  mutate(
    Inflation_Factor = round(SE_QuasiPoisson / SE_Poisson, 2),
    estimate = round(estimate, 4),
    SE_Poisson = round(SE_Poisson, 4),
    SE_QuasiPoisson = round(SE_QuasiPoisson, 4)
  ) |>
  kable(caption = "Standard Error Comparison: Poisson vs. Quasi-Poisson") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Standard Error Comparison: Poisson vs. Quasi-Poisson
term	estimate	SE_Poisson	SE_QuasiPoisson	Inflation_Factor
(Intercept)	3.1920	0.0326	0.1303	4
exerciseYes	-0.3144	0.0142	0.0568	4
smokerCurrent	0.1896	0.0140	0.0559	4
sexFemale	0.4163	0.0133	0.0533	4
age	-0.0208	0.0004	0.0016	4
income_cat	-0.1093	0.0030	0.0120	4

Notice that the point estimates are identical between Poisson and quasi-Poisson. Only the standard errors change. The inflation factor is the same for every coefficient and equals \(\sqrt{\hat\phi}\).

Fix 2: Negative Binomial Regression

The negative binomial model is a more flexible alternative that adds a parameter \(\theta\) to model the extra variation:

\[ \text{Var}(Y) = \mu + \frac{\mu^2}{\theta} \]

As \(\theta \to \infty\), the negative binomial converges to the Poisson. The smaller \(\theta\) is, the more overdispersion is present.

mod_nb <- glm.nb(
  menthlth_days ~ exercise + smoker + sex + age + income_cat,
  data = brfss
)

mod_nb |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Negative Binomial Regression: Rate Ratios for Mentally Unhealthy Days**")

**Negative Binomial Regression: Rate Ratios for Mentally Unhealthy Days**
Characteristic	IRR	95% CI	p-value
exercise
No	—	—
Yes	0.69	0.59, 0.82	<0.001
smoker
Former/Never	—	—
Current	1.24	1.06, 1.46	0.008
sex
Male	—	—
Female	1.56	1.35, 1.81	<0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80	0.98	0.97, 0.98	<0.001
income_cat	0.88	0.85, 0.91	<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

tibble(
  Metric = c("Theta (θ)", "2 × Log-Likelihood (NB)", "2 × Log-Likelihood (Poisson)", "AIC (NB)", "AIC (Poisson)"),
  Value  = c(
    round(mod_nb$theta, 3),
    round(-2 * logLik(mod_nb), 1),
    round(-2 * logLik(mod_pois), 1),
    round(AIC(mod_nb), 1),
    round(AIC(mod_pois), 1)
  )
) |>
  kable(caption = "Negative Binomial vs. Poisson: Model Comparison") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Negative Binomial vs. Poisson: Model Comparison
Metric	Value
Theta (θ)	0.155
2 × Log-Likelihood (NB)	19864.300
2 × Log-Likelihood (Poisson)	63879.100
AIC (NB)	19878.300
AIC (Poisson)	63891.100

A substantially lower AIC for the negative binomial model confirms that it provides a better fit when overdispersion is present.

Comparing All Three Models Side by Side

tbl_merge(
  list(
    tbl_regression(mod_pois,  exponentiate = TRUE) |> modify_header(label ~ "**Variable**"),
    tbl_regression(mod_quasi, exponentiate = TRUE),
    tbl_regression(mod_nb,    exponentiate = TRUE)
  ),
  tab_spanner = c("**Poisson**", "**Quasi-Poisson**", "**Negative Binomial**")
)

Variable	Poisson			Quasi-Poisson			Negative Binomial
Variable	IRR	95% CI	p-value	IRR	95% CI	p-value	IRR	95% CI	p-value
exercise
No	—	—		—	—		—	—
Yes	0.73	0.71, 0.75	<0.001	0.73	0.65, 0.82	<0.001	0.69	0.59, 0.82	<0.001
smoker
Former/Never	—	—		—	—		—	—
Current	1.21	1.18, 1.24	<0.001	1.21	1.08, 1.35	<0.001	1.24	1.06, 1.46	0.008
sex
Male	—	—		—	—		—	—
Female	1.52	1.48, 1.56	<0.001	1.52	1.37, 1.68	<0.001	1.56	1.35, 1.81	<0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80	0.98	0.98, 0.98	<0.001	0.98	0.98, 0.98	<0.001	0.98	0.97, 0.98	<0.001
income_cat	0.90	0.89, 0.90	<0.001	0.90	0.88, 0.92	<0.001	0.88	0.85, 0.91	<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Key takeaway: The rate ratio point estimates are very similar across models. The main differences are in the standard errors and confidence intervals. The Poisson model is overconfident (SEs too small). The quasi-Poisson and negative binomial both correct for this, with the negative binomial providing a more formal likelihood-based approach.

Part 7: Choosing Your Count Model

Decision Flowchart

Is the outcome a count? If yes, start with Poisson regression.

Check the dispersion ratio (residual deviance / df). Is it close to 1?

Yes (\(\approx\) 1): The Poisson model is appropriate. Report it.

No (much > 1): Overdispersion is present. Go to step 3.

Choose a correction:

Quasi-Poisson if you want a simple fix that inflates standard errors

Negative binomial if you want a full probability model (preferred for formal inference and AIC comparison)

Are there excessive zeros? Consider zero-inflated Poisson or zero-inflated negative binomial (beyond the scope of this course, but good to know about).

When to Use an Offset

Scenario	Offset?	Example
All subjects observed for the same time	No	Days of distress in past 30 (everyone has 30 days)
Subjects observed for different times	Yes	Deaths over variable follow-up in a cohort
Comparing areas with different populations	Yes	Cancer cases per county (offset = log(population))

Summary

Key Concepts

Concept	Key Point
Outcome	Counts (non-negative integers)
Distribution	Poisson: \(E(Y) = \text{Var}(Y) = \mu\)
Link function	Log: \(\log(\mu) = X^T \beta\)
Effect measure	Rate ratio = \(e^{\beta}\)
Offsets	Use `offset(log(person_time))` when exposure time varies
Overdispersion	Variance > Mean; check deviance/df ratio
Quasi-Poisson	Inflates SEs; same point estimates
Negative binomial	Full model for overdispersion; compare via AIC
In R	`glm(y ~ x, family = poisson)`, `glm.nb()`

Connecting to What We Know

From Logistic Regression	Poisson Analogue
Binary outcome (0/1)	Count outcome (0, 1, 2, …)
Logit link	Log link
Odds ratio (\(e^\beta\))	Rate ratio (\(e^\beta\))
Deviance and AIC	Deviance and AIC
`glm(..., family = binomial)`	`glm(..., family = poisson)`

References

Kleinbaum, D. G., & Klein, M. (2010). Logistic Regression: A Self-Learning Text (3rd ed.). Springer. (Poisson regression chapter)
Hilbe, J. M. (2014). Modeling Count Data. Cambridge University Press.
Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Wiley. (Chapter 9: Loglinear models for contingency tables)
Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). Springer.

No lab activity for this lecture.

Poisson Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Muntasir Masum

April 21, 2026