Introduction

Throughout this course we have modeled continuous outcomes (mentally unhealthy days, BMI, blood pressure) and binary outcomes (frequent mental distress, yes/no). But many epidemiologic outcomes are counts: the number of hospital visits in a year, new infections per county per week, falls per nursing home resident, or mentally unhealthy days in the past 30 days. Counts are non-negative integers (\(0, 1, 2, \ldots\)) and are often right-skewed with many zeros and a long tail. Linear regression is inappropriate for counts because it can predict negative values and assumes constant variance.

Poisson regression is the standard generalized linear model for count outcomes. Like logistic regression, it is a member of the GLM family, but it uses a log link instead of a logit link, and it models the mean count rather than the probability of an event.

In this lecture we will:

  1. Understand why linear regression fails for count outcomes
  2. Learn the Poisson distribution and the Poisson regression model
  3. Interpret coefficients as rate ratios
  4. Use offsets to model rates when exposure time varies
  5. Diagnose overdispersion and apply corrections (quasi-Poisson and negative binomial)
  6. Fit and interpret models using the BRFSS 2020 data

Textbook reference: Kleinbaum & Klein, Logistic Regression: A Self-Learning Text (3rd ed.), Poisson regression chapter


Why/When/Watch-out

Why use Poisson regression?

  • The outcome variable is a count (non-negative integer)
  • We want to estimate how predictors affect the rate of events
  • We need adjusted rate ratios that control for confounders
  • We want to identify risk factors for count-based health outcomes

When to use it?

  • Cohort studies: number of events per person-time at risk
  • Ecological studies: disease counts per geographic area
  • Survey data: number of days with a specific condition
  • Any setting where the response is a discrete count

Watch out for:

  • Overdispersion: the variance exceeding the mean (very common in practice)
  • Excess zeros: more zeros than the Poisson distribution predicts
  • Non-independence: clustered or correlated counts require special methods
  • Large counts: when counts are very large and approximately normal, linear regression may be acceptable

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(ggeffects)
library(MASS)
library(plotly)
library(performance)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue using the BRFSS 2020 dataset. In the logistic regression lectures, we modeled frequent mental distress (a binary outcome). Today we return to the original count: number of mentally unhealthy days in the past 30 days (menthlth_days). This is a bounded count (0 to 30), right-skewed, and has many zeros, making it a good candidate for Poisson regression.

Research question:

What individual, behavioral, and socioeconomic factors are associated with the number of mentally unhealthy days reported per month among U.S. adults?

brfss <- readRDS(
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Logistic Regression/brfss_logistic_2020.rds"
)

glimpse(brfss)
## Rows: 5,000
## Columns: 10
## $ fmd           <fct> No, Yes, No, Yes, No, Yes, No, No, No, No, No, No, No, N…
## $ menthlth_days <dbl> 2, 30, 0, 28, 0, 20, 0, 1, 0, 0, 0, 0, 0, 2, 10, 0, 30, …
## $ physhlth_days <dbl> 0, 0, 14, 21, 0, 4, 0, 0, 30, 0, 5, 0, 0, 0, 0, 0, 0, 0,…
## $ sleep_hrs     <dbl> 8, 6, 6, 7, 8, 8, 8, 8, 3, 8, 6, 7, 6, 6, 8, 7, 9, 7, 6,…
## $ age           <dbl> 63, 67, 63, 76, 38, 50, 75, 33, 58, 33, 61, 29, 34, 77, …
## $ sex           <fct> Male, Male, Female, Male, Male, Female, Female, Male, Ma…
## $ bmi           <dbl> 25.83, 26.63, 30.23, 25.85, 23.06, 31.00, 30.99, 33.89, …
## $ exercise      <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes,…
## $ income_cat    <dbl> 5, 4, 5, 8, 6, 3, 5, 7, 4, 7, 2, 4, 1, 8, 7, 2, 5, 5, 7,…
## $ smoker        <fct> Current, Current, Former/Never, Former/Never, Former/Nev…

Descriptive Overview

brfss |>
  tbl_summary(
    include = c(menthlth_days, exercise, smoker, sex, age, income_cat),
    type = list(
      c(menthlth_days, age, income_cat) ~ "continuous"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd}); Median: {median}"
    ),
    label = list(
      menthlth_days ~ "Mentally unhealthy days (0-30)",
      exercise      ~ "Exercise in past 30 days",
      smoker        ~ "Smoking status",
      sex           ~ "Sex",
      age           ~ "Age (years)",
      income_cat    ~ "Income category (1-8)"
    )
  ) |>
  bold_labels()
Characteristic N = 5,0001
Mentally unhealthy days (0-30) 5 (9); Median: 0
Exercise in past 30 days 3,673 (73%)
Smoking status
    Former/Never 3,280 (66%)
    Current 1,720 (34%)
Sex
    Male 2,701 (54%)
    Female 2,299 (46%)
Age (years) 56 (16); Median: 59
Income category (1-8) 5.85 (2.11); Median: 6.00
1 Mean (SD); Median: Median; n (%)

Distribution of the Outcome

p <- ggplot(brfss, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Mentally Unhealthy Days (Past 30)",
    subtitle = "BRFSS 2020 — Analytic Sample (n = 5,000)",
    x = "Number of Mentally Unhealthy Days",
    y = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p)

Interpretation: Notice the extreme right skew and the large spike at zero. Many respondents report zero mentally unhealthy days, while a smaller group reports many days. This distribution is far from normal, confirming that linear regression would be a poor fit. The mean and variance of this variable will help us assess whether Poisson regression is appropriate or whether we need to account for overdispersion.

tibble(
  Statistic = c("Mean", "Variance", "Variance/Mean Ratio"),
  Value     = c(
    round(mean(brfss$menthlth_days, na.rm = TRUE), 2),
    round(var(brfss$menthlth_days, na.rm = TRUE), 2),
    round(var(brfss$menthlth_days, na.rm = TRUE) /
            mean(brfss$menthlth_days, na.rm = TRUE), 2)
  )
) |>
  kable(caption = "Mean-Variance Relationship for Mentally Unhealthy Days") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Mean-Variance Relationship for Mentally Unhealthy Days
Statistic Value
Mean 4.69
Variance 79.79
Variance/Mean Ratio 17.02

If the variance-to-mean ratio is close to 1, the Poisson assumption is reasonable. If it is much greater than 1, we have overdispersion and will need to consider alternatives. Keep this ratio in mind as we proceed.


Part 1: Why Linear Regression Fails for Count Outcomes

The Problem

What happens if we model a count outcome with ordinary linear regression?

lm_count <- lm(menthlth_days ~ age + sex + exercise + smoker, data = brfss)

augmented <- augment(lm_count, data = brfss)

p <- ggplot(augmented, aes(x = .fitted)) +
  geom_histogram(binwidth = 0.5, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_vline(xintercept = 0, color = "red", linewidth = 1, linetype = "dashed") +
  labs(
    title = "Fitted Values from a Linear Regression on Count Data",
    subtitle = "Red line marks zero — negative predictions are impossible for counts",
    x = "Predicted Mentally Unhealthy Days",
    y = "Frequency"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p)

Three problems with using linear regression for count outcomes:

  1. Negative predictions: The model can predict negative counts, which are impossible.

  2. Non-constant variance: For count data, the variance typically increases with the mean. Linear regression assumes constant variance.

  3. Non-normal residuals: Count data, especially with many zeros, produce skewed residuals that violate the normality assumption.

par(mfrow = c(2, 2))
plot(lm_count, which = 1:4)

par(mfrow = c(1, 1))

The residual plots confirm these problems. The residuals-vs-fitted plot shows a clear fan shape (non-constant variance), and the Q-Q plot shows substantial departure from normality. Poisson regression addresses all three issues.


Part 2: The Poisson Distribution

Definition

If \(Y\) is a count, we assume \(Y \sim \text{Poisson}(\mu)\) where:

\[ \Pr(Y = y) = \frac{\mu^y e^{-\mu}}{y!}, \quad y = 0, 1, 2, \ldots \]

The parameter \(\mu\) is both the mean and the variance of the distribution: \(E(Y) = \text{Var}(Y) = \mu\). This mean-variance equality is a defining property that we will revisit when we discuss overdispersion.

Visualizing the Poisson Distribution

poisson_df <- tibble(
  y = rep(0:30, 4),
  mu = rep(c(1, 3, 7, 15), each = 31)
) |>
  mutate(
    prob = dpois(y, lambda = mu),
    mu_label = paste0("μ = ", mu)
  )

p <- ggplot(poisson_df, aes(x = y, y = prob, fill = mu_label)) +
  geom_col(alpha = 0.7, position = "identity", width = 0.8) +
  facet_wrap(~ mu_label, scales = "free_y") +
  labs(
    title = "Poisson Distributions with Different Means",
    subtitle = "Notice: as μ increases, the distribution becomes more symmetric and the variance grows",
    x = "Count (y)",
    y = "Probability"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

ggplotly(p)

Key observations:

  • When \(\mu\) is small, the distribution is highly right-skewed with a spike at zero
  • As \(\mu\) increases, the distribution becomes more symmetric and approaches a normal shape
  • The spread (variance) increases with \(\mu\) because \(\text{Var}(Y) = \mu\)

Part 3: The Poisson Regression Model

Model Specification

Poisson regression is a generalized linear model with a log link:

\[ \log(\mu_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} \]

Equivalently:

\[ \mu_i = \exp(\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi}) \]

The log link guarantees that the predicted mean \(\mu_i\) is always positive, solving the problem of negative predictions.

How It Compares to Other GLMs

Feature Linear Regression Logistic Regression Poisson Regression
Outcome Continuous Binary (0/1) Count (0, 1, 2, …)
Distribution Normal Binomial Poisson
Link function Identity Logit Log
Model \(\mu = X\beta\) \(\log\frac{p}{1-p} = X\beta\) \(\log(\mu) = X\beta\)
Effect measure Mean difference Odds ratio (\(e^\beta\)) Rate ratio (\(e^\beta\))

Interpretation of Coefficients: Rate Ratios

Exponentiating a Poisson regression coefficient gives a rate ratio (RR):

\[ \text{RR} = e^{\beta_1} = \frac{E(Y \mid X_1 = x + 1)}{E(Y \mid X_1 = x)} \]

  • For a binary predictor (e.g., smoker vs. non-smoker): the RR is the ratio of expected counts between the two groups, holding other variables constant
  • For a continuous predictor (e.g., age): the RR is the multiplicative change in the expected count per one-unit increase

Example interpretation: If \(e^{\beta_{\text{smoker}}} = 1.45\), then current smokers are expected to report 1.45 times as many mentally unhealthy days as former/never smokers, holding other variables constant. That is a 45% higher rate.


Part 4: Fitting a Poisson Regression in R

Simple (Unadjusted) Model

Let us start with a single predictor to build intuition.

mod_pois_simple <- glm(
  menthlth_days ~ smoker,
  data    = brfss,
  family  = poisson(link = "log")
)

mod_pois_simple |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Unadjusted Poisson Regression: Smoking and Mentally Unhealthy Days**")
Unadjusted Poisson Regression: Smoking and Mentally Unhealthy Days
Characteristic IRR 95% CI p-value
smoker


    Former/Never
    Current 1.68 1.63, 1.72 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
tidy_simple <- tidy(mod_pois_simple, exponentiate = TRUE, conf.int = TRUE)

rr_smoker <- tidy_simple |> filter(term == "smokerCurrent")

Interpretation: Without adjusting for other variables, current smokers report approximately 1.68 times as many mentally unhealthy days as former/never smokers (RR = 1.68; 95% CI: 1.63, 1.72).

Multiple (Adjusted) Poisson Regression

Now we fit the full model with all predictors.

mod_pois <- glm(
  menthlth_days ~ exercise + smoker + sex + age + income_cat,
  data    = brfss,
  family  = poisson(link = "log")
)

mod_pois |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Adjusted Poisson Regression: Rate Ratios for Mentally Unhealthy Days**")
Adjusted Poisson Regression: Rate Ratios for Mentally Unhealthy Days
Characteristic IRR 95% CI p-value
exercise


    No
    Yes 0.73 0.71, 0.75 <0.001
smoker


    Former/Never
    Current 1.21 1.18, 1.24 <0.001
sex


    Male
    Female 1.52 1.48, 1.56 <0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80 0.98 0.98, 0.98 <0.001
income_cat 0.90 0.89, 0.90 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Model Summary Statistics

glance_pois <- glance(mod_pois)

tibble(
  Metric = c("Null Deviance", "Residual Deviance", "AIC", "Observations"),
  Value  = c(
    round(glance_pois$null.deviance, 1),
    round(glance_pois$deviance, 1),
    round(glance_pois$AIC, 1),
    glance_pois$nobs
  )
) |>
  kable(caption = "Poisson Model Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Poisson Model Fit Statistics
Metric Value
Null Deviance 63426.3
Residual Deviance 56536.1
AIC 63891.1
Observations 5000.0

Interpretation

Exercise: Holding other variables constant, individuals who exercised in the past 30 days are expected to report fewer mentally unhealthy days compared to those who did not exercise.

Smoking: After adjustment, current smokers report a higher rate of mentally unhealthy days than former/never smokers. The rate ratio and 95% confidence interval quantify the strength and precision of this association.

Income: Higher income categories are associated with fewer mentally unhealthy days, consistent with the well-established socioeconomic gradient in mental health.

Visualizing Predicted Rates

pred_exercise <- ggpredict(mod_pois, terms = "exercise")
pred_smoker   <- ggpredict(mod_pois, terms = "smoker")
pred_income   <- ggpredict(mod_pois, terms = "income_cat [1:8]")

p1 <- plot(pred_exercise) +
  labs(title = "Predicted Mentally Unhealthy Days by Exercise Status",
       y = "Predicted Count") +
  theme_minimal(base_size = 13)

p2 <- plot(pred_smoker) +
  labs(title = "Predicted Mentally Unhealthy Days by Smoking Status",
       y = "Predicted Count") +
  theme_minimal(base_size = 13)

p3 <- plot(pred_income) +
  labs(title = "Predicted Mentally Unhealthy Days by Income",
       x = "Income Category (1 = Lowest, 8 = Highest)",
       y = "Predicted Count") +
  theme_minimal(base_size = 13)

p1

p2

p3


Part 5: Offsets — Modeling Rates When Exposure Time Varies

The Concept

In many epidemiologic studies, subjects are observed for different amounts of time. For example, in a cohort study, some subjects are followed for 5 years and others for only 2 years. We do not want to compare raw counts; we want to compare rates (events per unit of person-time).

The Poisson model for rates is:

\[ \log\left(\frac{\mu_i}{t_i}\right) = \beta_0 + \beta_1 X_{1i} + \cdots \]

Rearranging:

\[ \log(\mu_i) = \underbrace{\log(t_i)}_{\text{offset}} + \beta_0 + \beta_1 X_{1i} + \cdots \]

The term \(\log(t_i)\) is added to the linear predictor with a fixed coefficient of 1. This is called an offset. It adjusts for differences in exposure time so that the model estimates rates rather than raw counts.

Example with Simulated Data

set.seed(1220)

mortality <- tibble(
  region       = rep(c("Urban", "Rural"), each = 50),
  person_years = round(runif(100, 500, 5000)),
  age_group    = sample(c("Young", "Middle", "Old"), 100, replace = TRUE,
                        prob = c(0.3, 0.4, 0.3)),
  baseline_rate = case_when(
    age_group == "Young"  ~ 0.002,
    age_group == "Middle" ~ 0.005,
    age_group == "Old"    ~ 0.015
  )
) |>
  mutate(
    region_effect = ifelse(region == "Urban", 1.3, 1),
    expected = person_years * baseline_rate * region_effect,
    deaths   = rpois(n(), lambda = expected)
  )

mod_offset <- glm(
  deaths ~ region + age_group + offset(log(person_years)),
  data   = mortality,
  family = poisson(link = "log")
)

mod_offset |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Poisson Model with Offset: Mortality Rates by Region and Age**")
Poisson Model with Offset: Mortality Rates by Region and Age
Characteristic IRR 95% CI p-value
region


    Rural
    Urban 1.27 1.17, 1.39 <0.001
age_group


    Middle
    Old 3.19 2.89, 3.53 <0.001
    Young 0.36 0.30, 0.43 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Without the offset, the model would compare raw death counts, unfairly penalizing regions or groups with more person-time of observation. With the offset, the model compares death rates, which is the scientifically meaningful comparison.

In R Syntax

# With offset: models rates (events per person-time)
glm(deaths ~ exposure + age + offset(log(person_years)),
    family = poisson, data = mortality)

# Without offset: models raw counts (misleading if person-time differs)
glm(deaths ~ exposure + age,
    family = poisson, data = mortality)

Part 6: Overdispersion

What Is Overdispersion?

The Poisson distribution assumes that \(\text{Var}(Y) = E(Y) = \mu\). In practice, count data almost always have more variance than the mean, a phenomenon called overdispersion. Common causes include:

  • Unmeasured heterogeneity: important predictors are missing from the model
  • Excess zeros: more zero counts than the Poisson predicts
  • Clustering: non-independent observations within groups

Why It Matters

If overdispersion is ignored, the model underestimates standard errors, producing p-values that are too small and confidence intervals that are too narrow. You may declare effects “significant” when they are not. This is a serious problem.

Diagnosing Overdispersion

disp_ratio <- mod_pois$deviance / mod_pois$df.residual

tibble(
  Metric = c("Residual Deviance", "Residual df", "Dispersion Ratio (Deviance/df)"),
  Value  = c(
    round(mod_pois$deviance, 1),
    mod_pois$df.residual,
    round(disp_ratio, 2)
  )
) |>
  kable(caption = "Overdispersion Check") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Overdispersion Check
Metric Value
Residual Deviance 56536.10
Residual df 4994.00
Dispersion Ratio (Deviance/df) 11.32

Rule of thumb: If the dispersion ratio (residual deviance / residual df) is much greater than 1, the data are overdispersed. Values between 1 and 1.5 are often tolerable; values above 2 strongly suggest the need for correction.

check_overdispersion(mod_pois)
## # Overdispersion test
## 
##        dispersion ratio =    16.007
##   Pearson's Chi-Squared = 79936.737
##                 p-value =   < 0.001

Fix 1: Quasi-Poisson

The quasi-Poisson model relaxes the mean-variance equality by estimating a dispersion parameter \(\phi\):

\[ \text{Var}(Y) = \phi \cdot \mu \]

If \(\phi > 1\), the data are overdispersed. The quasi-Poisson model inflates the standard errors by \(\sqrt{\phi}\), correcting the confidence intervals and p-values without changing the point estimates.

mod_quasi <- glm(
  menthlth_days ~ exercise + smoker + sex + age + income_cat,
  data    = brfss,
  family  = quasipoisson(link = "log")
)

mod_quasi |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Quasi-Poisson Regression: Corrected Standard Errors**")
Quasi-Poisson Regression: Corrected Standard Errors
Characteristic IRR 95% CI p-value
exercise


    No
    Yes 0.73 0.65, 0.82 <0.001
smoker


    Former/Never
    Current 1.21 1.08, 1.35 <0.001
sex


    Male
    Female 1.52 1.37, 1.68 <0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80 0.98 0.98, 0.98 <0.001
income_cat 0.90 0.88, 0.92 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
se_pois  <- tidy(mod_pois)  |> dplyr::select(term, estimate, std.error) |> rename(SE_Poisson = std.error)
se_quasi <- tidy(mod_quasi) |> dplyr::select(term, std.error) |> rename(SE_QuasiPoisson = std.error)

se_pois |>
  left_join(se_quasi, by = "term") |>
  mutate(
    Inflation_Factor = round(SE_QuasiPoisson / SE_Poisson, 2),
    estimate = round(estimate, 4),
    SE_Poisson = round(SE_Poisson, 4),
    SE_QuasiPoisson = round(SE_QuasiPoisson, 4)
  ) |>
  kable(caption = "Standard Error Comparison: Poisson vs. Quasi-Poisson") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Standard Error Comparison: Poisson vs. Quasi-Poisson
term estimate SE_Poisson SE_QuasiPoisson Inflation_Factor
(Intercept) 3.1920 0.0326 0.1303 4
exerciseYes -0.3144 0.0142 0.0568 4
smokerCurrent 0.1896 0.0140 0.0559 4
sexFemale 0.4163 0.0133 0.0533 4
age -0.0208 0.0004 0.0016 4
income_cat -0.1093 0.0030 0.0120 4

Notice that the point estimates are identical between Poisson and quasi-Poisson. Only the standard errors change. The inflation factor is the same for every coefficient and equals \(\sqrt{\hat\phi}\).

Fix 2: Negative Binomial Regression

The negative binomial model is a more flexible alternative that adds a parameter \(\theta\) to model the extra variation:

\[ \text{Var}(Y) = \mu + \frac{\mu^2}{\theta} \]

As \(\theta \to \infty\), the negative binomial converges to the Poisson. The smaller \(\theta\) is, the more overdispersion is present.

mod_nb <- glm.nb(
  menthlth_days ~ exercise + smoker + sex + age + income_cat,
  data = brfss
)

mod_nb |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Negative Binomial Regression: Rate Ratios for Mentally Unhealthy Days**")
Negative Binomial Regression: Rate Ratios for Mentally Unhealthy Days
Characteristic IRR 95% CI p-value
exercise


    No
    Yes 0.69 0.59, 0.82 <0.001
smoker


    Former/Never
    Current 1.24 1.06, 1.46 0.008
sex


    Male
    Female 1.56 1.35, 1.81 <0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80 0.98 0.97, 0.98 <0.001
income_cat 0.88 0.85, 0.91 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
tibble(
  Metric = c("Theta (θ)", "2 × Log-Likelihood (NB)", "2 × Log-Likelihood (Poisson)", "AIC (NB)", "AIC (Poisson)"),
  Value  = c(
    round(mod_nb$theta, 3),
    round(-2 * logLik(mod_nb), 1),
    round(-2 * logLik(mod_pois), 1),
    round(AIC(mod_nb), 1),
    round(AIC(mod_pois), 1)
  )
) |>
  kable(caption = "Negative Binomial vs. Poisson: Model Comparison") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Negative Binomial vs. Poisson: Model Comparison
Metric Value
Theta (θ) 0.155
2 × Log-Likelihood (NB) 19864.300
2 × Log-Likelihood (Poisson) 63879.100
AIC (NB) 19878.300
AIC (Poisson) 63891.100

A substantially lower AIC for the negative binomial model confirms that it provides a better fit when overdispersion is present.

Comparing All Three Models Side by Side

tbl_merge(
  list(
    tbl_regression(mod_pois,  exponentiate = TRUE) |> modify_header(label ~ "**Variable**"),
    tbl_regression(mod_quasi, exponentiate = TRUE),
    tbl_regression(mod_nb,    exponentiate = TRUE)
  ),
  tab_spanner = c("**Poisson**", "**Quasi-Poisson**", "**Negative Binomial**")
)
Variable
Poisson
Quasi-Poisson
Negative Binomial
IRR 95% CI p-value IRR 95% CI p-value IRR 95% CI p-value
exercise








    No


    Yes 0.73 0.71, 0.75 <0.001 0.73 0.65, 0.82 <0.001 0.69 0.59, 0.82 <0.001
smoker








    Former/Never


    Current 1.21 1.18, 1.24 <0.001 1.21 1.08, 1.35 <0.001 1.24 1.06, 1.46 0.008
sex








    Male


    Female 1.52 1.48, 1.56 <0.001 1.52 1.37, 1.68 <0.001 1.56 1.35, 1.81 <0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80 0.98 0.98, 0.98 <0.001 0.98 0.98, 0.98 <0.001 0.98 0.97, 0.98 <0.001
income_cat 0.90 0.89, 0.90 <0.001 0.90 0.88, 0.92 <0.001 0.88 0.85, 0.91 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Key takeaway: The rate ratio point estimates are very similar across models. The main differences are in the standard errors and confidence intervals. The Poisson model is overconfident (SEs too small). The quasi-Poisson and negative binomial both correct for this, with the negative binomial providing a more formal likelihood-based approach.


Part 7: Choosing Your Count Model

Decision Flowchart

  1. Is the outcome a count? If yes, start with Poisson regression.
  2. Check the dispersion ratio (residual deviance / df). Is it close to 1?
    • Yes (\(\approx\) 1): The Poisson model is appropriate. Report it.
    • No (much > 1): Overdispersion is present. Go to step 3.
  3. Choose a correction:
    • Quasi-Poisson if you want a simple fix that inflates standard errors
    • Negative binomial if you want a full probability model (preferred for formal inference and AIC comparison)
  4. Are there excessive zeros? Consider zero-inflated Poisson or zero-inflated negative binomial (beyond the scope of this course, but good to know about).

When to Use an Offset

Scenario Offset? Example
All subjects observed for the same time No Days of distress in past 30 (everyone has 30 days)
Subjects observed for different times Yes Deaths over variable follow-up in a cohort
Comparing areas with different populations Yes Cancer cases per county (offset = log(population))

Summary

Key Concepts

Concept Key Point
Outcome Counts (non-negative integers)
Distribution Poisson: \(E(Y) = \text{Var}(Y) = \mu\)
Link function Log: \(\log(\mu) = X^T \beta\)
Effect measure Rate ratio = \(e^{\beta}\)
Offsets Use offset(log(person_time)) when exposure time varies
Overdispersion Variance > Mean; check deviance/df ratio
Quasi-Poisson Inflates SEs; same point estimates
Negative binomial Full model for overdispersion; compare via AIC
In R glm(y ~ x, family = poisson), glm.nb()

Connecting to What We Know

From Logistic Regression Poisson Analogue
Binary outcome (0/1) Count outcome (0, 1, 2, …)
Logit link Log link
Odds ratio (\(e^\beta\)) Rate ratio (\(e^\beta\))
Deviance and AIC Deviance and AIC
glm(..., family = binomial) glm(..., family = poisson)

References

  • Kleinbaum, D. G., & Klein, M. (2010). Logistic Regression: A Self-Learning Text (3rd ed.). Springer. (Poisson regression chapter)
  • Hilbe, J. M. (2014). Modeling Count Data. Cambridge University Press.
  • Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Wiley. (Chapter 9: Loglinear models for contingency tables)
  • Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). Springer.

No lab activity for this lecture.