Throughout this course we have modeled continuous outcomes (mentally unhealthy days, BMI, blood pressure) and binary outcomes (frequent mental distress, yes/no). But many epidemiologic outcomes are counts: the number of hospital visits in a year, new infections per county per week, falls per nursing home resident, or mentally unhealthy days in the past 30 days. Counts are non-negative integers (\(0, 1, 2, \ldots\)) and are often right-skewed with many zeros and a long tail. Linear regression is inappropriate for counts because it can predict negative values and assumes constant variance.
Poisson regression is the standard generalized linear model for count outcomes. Like logistic regression, it is a member of the GLM family, but it uses a log link instead of a logit link, and it models the mean count rather than the probability of an event.
In this lecture we will:
Textbook reference: Kleinbaum & Klein, Logistic Regression: A Self-Learning Text (3rd ed.), Poisson regression chapter
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(ggeffects)
library(MASS)
library(plotly)
library(performance)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue using the BRFSS 2020 dataset. In the logistic regression
lectures, we modeled frequent mental distress (a binary
outcome). Today we return to the original count: number of
mentally unhealthy days in the past 30 days
(menthlth_days). This is a bounded count (0 to 30),
right-skewed, and has many zeros, making it a good candidate for Poisson
regression.
Research question:
What individual, behavioral, and socioeconomic factors are associated with the number of mentally unhealthy days reported per month among U.S. adults?
brfss <- readRDS(
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Logistic Regression/brfss_logistic_2020.rds"
)
glimpse(brfss)## Rows: 5,000
## Columns: 10
## $ fmd <fct> No, Yes, No, Yes, No, Yes, No, No, No, No, No, No, No, N…
## $ menthlth_days <dbl> 2, 30, 0, 28, 0, 20, 0, 1, 0, 0, 0, 0, 0, 2, 10, 0, 30, …
## $ physhlth_days <dbl> 0, 0, 14, 21, 0, 4, 0, 0, 30, 0, 5, 0, 0, 0, 0, 0, 0, 0,…
## $ sleep_hrs <dbl> 8, 6, 6, 7, 8, 8, 8, 8, 3, 8, 6, 7, 6, 6, 8, 7, 9, 7, 6,…
## $ age <dbl> 63, 67, 63, 76, 38, 50, 75, 33, 58, 33, 61, 29, 34, 77, …
## $ sex <fct> Male, Male, Female, Male, Male, Female, Female, Male, Ma…
## $ bmi <dbl> 25.83, 26.63, 30.23, 25.85, 23.06, 31.00, 30.99, 33.89, …
## $ exercise <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes,…
## $ income_cat <dbl> 5, 4, 5, 8, 6, 3, 5, 7, 4, 7, 2, 4, 1, 8, 7, 2, 5, 5, 7,…
## $ smoker <fct> Current, Current, Former/Never, Former/Never, Former/Nev…
brfss |>
tbl_summary(
include = c(menthlth_days, exercise, smoker, sex, age, income_cat),
type = list(
c(menthlth_days, age, income_cat) ~ "continuous"
),
statistic = list(
all_continuous() ~ "{mean} ({sd}); Median: {median}"
),
label = list(
menthlth_days ~ "Mentally unhealthy days (0-30)",
exercise ~ "Exercise in past 30 days",
smoker ~ "Smoking status",
sex ~ "Sex",
age ~ "Age (years)",
income_cat ~ "Income category (1-8)"
)
) |>
bold_labels()| Characteristic | N = 5,0001 |
|---|---|
| Mentally unhealthy days (0-30) | 5 (9); Median: 0 |
| Exercise in past 30 days | 3,673 (73%) |
| Smoking status | |
| Former/Never | 3,280 (66%) |
| Current | 1,720 (34%) |
| Sex | |
| Male | 2,701 (54%) |
| Female | 2,299 (46%) |
| Age (years) | 56 (16); Median: 59 |
| Income category (1-8) | 5.85 (2.11); Median: 6.00 |
| 1 Mean (SD); Median: Median; n (%) | |
p <- ggplot(brfss, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.8) +
labs(
title = "Distribution of Mentally Unhealthy Days (Past 30)",
subtitle = "BRFSS 2020 — Analytic Sample (n = 5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(p)Interpretation: Notice the extreme right skew and the large spike at zero. Many respondents report zero mentally unhealthy days, while a smaller group reports many days. This distribution is far from normal, confirming that linear regression would be a poor fit. The mean and variance of this variable will help us assess whether Poisson regression is appropriate or whether we need to account for overdispersion.
tibble(
Statistic = c("Mean", "Variance", "Variance/Mean Ratio"),
Value = c(
round(mean(brfss$menthlth_days, na.rm = TRUE), 2),
round(var(brfss$menthlth_days, na.rm = TRUE), 2),
round(var(brfss$menthlth_days, na.rm = TRUE) /
mean(brfss$menthlth_days, na.rm = TRUE), 2)
)
) |>
kable(caption = "Mean-Variance Relationship for Mentally Unhealthy Days") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Statistic | Value |
|---|---|
| Mean | 4.69 |
| Variance | 79.79 |
| Variance/Mean Ratio | 17.02 |
If the variance-to-mean ratio is close to 1, the Poisson assumption is reasonable. If it is much greater than 1, we have overdispersion and will need to consider alternatives. Keep this ratio in mind as we proceed.
What happens if we model a count outcome with ordinary linear regression?
lm_count <- lm(menthlth_days ~ age + sex + exercise + smoker, data = brfss)
augmented <- augment(lm_count, data = brfss)
p <- ggplot(augmented, aes(x = .fitted)) +
geom_histogram(binwidth = 0.5, fill = "steelblue", color = "white", alpha = 0.8) +
geom_vline(xintercept = 0, color = "red", linewidth = 1, linetype = "dashed") +
labs(
title = "Fitted Values from a Linear Regression on Count Data",
subtitle = "Red line marks zero — negative predictions are impossible for counts",
x = "Predicted Mentally Unhealthy Days",
y = "Frequency"
) +
theme_minimal(base_size = 13)
ggplotly(p)Three problems with using linear regression for count outcomes:
Negative predictions: The model can predict negative counts, which are impossible.
Non-constant variance: For count data, the variance typically increases with the mean. Linear regression assumes constant variance.
Non-normal residuals: Count data, especially with many zeros, produce skewed residuals that violate the normality assumption.
The residual plots confirm these problems. The residuals-vs-fitted plot shows a clear fan shape (non-constant variance), and the Q-Q plot shows substantial departure from normality. Poisson regression addresses all three issues.
If \(Y\) is a count, we assume \(Y \sim \text{Poisson}(\mu)\) where:
\[ \Pr(Y = y) = \frac{\mu^y e^{-\mu}}{y!}, \quad y = 0, 1, 2, \ldots \]
The parameter \(\mu\) is both the mean and the variance of the distribution: \(E(Y) = \text{Var}(Y) = \mu\). This mean-variance equality is a defining property that we will revisit when we discuss overdispersion.
poisson_df <- tibble(
y = rep(0:30, 4),
mu = rep(c(1, 3, 7, 15), each = 31)
) |>
mutate(
prob = dpois(y, lambda = mu),
mu_label = paste0("μ = ", mu)
)
p <- ggplot(poisson_df, aes(x = y, y = prob, fill = mu_label)) +
geom_col(alpha = 0.7, position = "identity", width = 0.8) +
facet_wrap(~ mu_label, scales = "free_y") +
labs(
title = "Poisson Distributions with Different Means",
subtitle = "Notice: as μ increases, the distribution becomes more symmetric and the variance grows",
x = "Count (y)",
y = "Probability"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
ggplotly(p)Key observations:
Poisson regression is a generalized linear model with a log link:
\[ \log(\mu_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} \]
Equivalently:
\[ \mu_i = \exp(\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi}) \]
The log link guarantees that the predicted mean \(\mu_i\) is always positive, solving the problem of negative predictions.
| Feature | Linear Regression | Logistic Regression | Poisson Regression |
|---|---|---|---|
| Outcome | Continuous | Binary (0/1) | Count (0, 1, 2, …) |
| Distribution | Normal | Binomial | Poisson |
| Link function | Identity | Logit | Log |
| Model | \(\mu = X\beta\) | \(\log\frac{p}{1-p} = X\beta\) | \(\log(\mu) = X\beta\) |
| Effect measure | Mean difference | Odds ratio (\(e^\beta\)) | Rate ratio (\(e^\beta\)) |
Exponentiating a Poisson regression coefficient gives a rate ratio (RR):
\[ \text{RR} = e^{\beta_1} = \frac{E(Y \mid X_1 = x + 1)}{E(Y \mid X_1 = x)} \]
Example interpretation: If \(e^{\beta_{\text{smoker}}} = 1.45\), then current smokers are expected to report 1.45 times as many mentally unhealthy days as former/never smokers, holding other variables constant. That is a 45% higher rate.
Let us start with a single predictor to build intuition.
mod_pois_simple <- glm(
menthlth_days ~ smoker,
data = brfss,
family = poisson(link = "log")
)
mod_pois_simple |>
tbl_regression(exponentiate = TRUE) |>
bold_labels() |>
modify_caption("**Unadjusted Poisson Regression: Smoking and Mentally Unhealthy Days**")| Characteristic | IRR | 95% CI | p-value |
|---|---|---|---|
| smoker | |||
| Former/Never | — | — | |
| Current | 1.68 | 1.63, 1.72 | <0.001 |
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||
tidy_simple <- tidy(mod_pois_simple, exponentiate = TRUE, conf.int = TRUE)
rr_smoker <- tidy_simple |> filter(term == "smokerCurrent")Interpretation: Without adjusting for other variables, current smokers report approximately 1.68 times as many mentally unhealthy days as former/never smokers (RR = 1.68; 95% CI: 1.63, 1.72).
Now we fit the full model with all predictors.
mod_pois <- glm(
menthlth_days ~ exercise + smoker + sex + age + income_cat,
data = brfss,
family = poisson(link = "log")
)
mod_pois |>
tbl_regression(exponentiate = TRUE) |>
bold_labels() |>
modify_caption("**Adjusted Poisson Regression: Rate Ratios for Mentally Unhealthy Days**")| Characteristic | IRR | 95% CI | p-value |
|---|---|---|---|
| exercise | |||
| No | — | — | |
| Yes | 0.73 | 0.71, 0.75 | <0.001 |
| smoker | |||
| Former/Never | — | — | |
| Current | 1.21 | 1.18, 1.24 | <0.001 |
| sex | |||
| Male | — | — | |
| Female | 1.52 | 1.48, 1.56 | <0.001 |
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 | 0.98 | 0.98, 0.98 | <0.001 |
| income_cat | 0.90 | 0.89, 0.90 | <0.001 |
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||
glance_pois <- glance(mod_pois)
tibble(
Metric = c("Null Deviance", "Residual Deviance", "AIC", "Observations"),
Value = c(
round(glance_pois$null.deviance, 1),
round(glance_pois$deviance, 1),
round(glance_pois$AIC, 1),
glance_pois$nobs
)
) |>
kable(caption = "Poisson Model Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Null Deviance | 63426.3 |
| Residual Deviance | 56536.1 |
| AIC | 63891.1 |
| Observations | 5000.0 |
Exercise: Holding other variables constant, individuals who exercised in the past 30 days are expected to report fewer mentally unhealthy days compared to those who did not exercise.
Smoking: After adjustment, current smokers report a higher rate of mentally unhealthy days than former/never smokers. The rate ratio and 95% confidence interval quantify the strength and precision of this association.
Income: Higher income categories are associated with fewer mentally unhealthy days, consistent with the well-established socioeconomic gradient in mental health.
pred_exercise <- ggpredict(mod_pois, terms = "exercise")
pred_smoker <- ggpredict(mod_pois, terms = "smoker")
pred_income <- ggpredict(mod_pois, terms = "income_cat [1:8]")
p1 <- plot(pred_exercise) +
labs(title = "Predicted Mentally Unhealthy Days by Exercise Status",
y = "Predicted Count") +
theme_minimal(base_size = 13)
p2 <- plot(pred_smoker) +
labs(title = "Predicted Mentally Unhealthy Days by Smoking Status",
y = "Predicted Count") +
theme_minimal(base_size = 13)
p3 <- plot(pred_income) +
labs(title = "Predicted Mentally Unhealthy Days by Income",
x = "Income Category (1 = Lowest, 8 = Highest)",
y = "Predicted Count") +
theme_minimal(base_size = 13)
p1In many epidemiologic studies, subjects are observed for different amounts of time. For example, in a cohort study, some subjects are followed for 5 years and others for only 2 years. We do not want to compare raw counts; we want to compare rates (events per unit of person-time).
The Poisson model for rates is:
\[ \log\left(\frac{\mu_i}{t_i}\right) = \beta_0 + \beta_1 X_{1i} + \cdots \]
Rearranging:
\[ \log(\mu_i) = \underbrace{\log(t_i)}_{\text{offset}} + \beta_0 + \beta_1 X_{1i} + \cdots \]
The term \(\log(t_i)\) is added to the linear predictor with a fixed coefficient of 1. This is called an offset. It adjusts for differences in exposure time so that the model estimates rates rather than raw counts.
set.seed(1220)
mortality <- tibble(
region = rep(c("Urban", "Rural"), each = 50),
person_years = round(runif(100, 500, 5000)),
age_group = sample(c("Young", "Middle", "Old"), 100, replace = TRUE,
prob = c(0.3, 0.4, 0.3)),
baseline_rate = case_when(
age_group == "Young" ~ 0.002,
age_group == "Middle" ~ 0.005,
age_group == "Old" ~ 0.015
)
) |>
mutate(
region_effect = ifelse(region == "Urban", 1.3, 1),
expected = person_years * baseline_rate * region_effect,
deaths = rpois(n(), lambda = expected)
)
mod_offset <- glm(
deaths ~ region + age_group + offset(log(person_years)),
data = mortality,
family = poisson(link = "log")
)
mod_offset |>
tbl_regression(exponentiate = TRUE) |>
bold_labels() |>
modify_caption("**Poisson Model with Offset: Mortality Rates by Region and Age**")| Characteristic | IRR | 95% CI | p-value |
|---|---|---|---|
| region | |||
| Rural | — | — | |
| Urban | 1.27 | 1.17, 1.39 | <0.001 |
| age_group | |||
| Middle | — | — | |
| Old | 3.19 | 2.89, 3.53 | <0.001 |
| Young | 0.36 | 0.30, 0.43 | <0.001 |
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||
Without the offset, the model would compare raw death counts, unfairly penalizing regions or groups with more person-time of observation. With the offset, the model compares death rates, which is the scientifically meaningful comparison.
The Poisson distribution assumes that \(\text{Var}(Y) = E(Y) = \mu\). In practice, count data almost always have more variance than the mean, a phenomenon called overdispersion. Common causes include:
If overdispersion is ignored, the model underestimates standard errors, producing p-values that are too small and confidence intervals that are too narrow. You may declare effects “significant” when they are not. This is a serious problem.
disp_ratio <- mod_pois$deviance / mod_pois$df.residual
tibble(
Metric = c("Residual Deviance", "Residual df", "Dispersion Ratio (Deviance/df)"),
Value = c(
round(mod_pois$deviance, 1),
mod_pois$df.residual,
round(disp_ratio, 2)
)
) |>
kable(caption = "Overdispersion Check") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Residual Deviance | 56536.10 |
| Residual df | 4994.00 |
| Dispersion Ratio (Deviance/df) | 11.32 |
Rule of thumb: If the dispersion ratio (residual deviance / residual df) is much greater than 1, the data are overdispersed. Values between 1 and 1.5 are often tolerable; values above 2 strongly suggest the need for correction.
## # Overdispersion test
##
## dispersion ratio = 16.007
## Pearson's Chi-Squared = 79936.737
## p-value = < 0.001
The quasi-Poisson model relaxes the mean-variance equality by estimating a dispersion parameter \(\phi\):
\[ \text{Var}(Y) = \phi \cdot \mu \]
If \(\phi > 1\), the data are overdispersed. The quasi-Poisson model inflates the standard errors by \(\sqrt{\phi}\), correcting the confidence intervals and p-values without changing the point estimates.
mod_quasi <- glm(
menthlth_days ~ exercise + smoker + sex + age + income_cat,
data = brfss,
family = quasipoisson(link = "log")
)
mod_quasi |>
tbl_regression(exponentiate = TRUE) |>
bold_labels() |>
modify_caption("**Quasi-Poisson Regression: Corrected Standard Errors**")| Characteristic | IRR | 95% CI | p-value |
|---|---|---|---|
| exercise | |||
| No | — | — | |
| Yes | 0.73 | 0.65, 0.82 | <0.001 |
| smoker | |||
| Former/Never | — | — | |
| Current | 1.21 | 1.08, 1.35 | <0.001 |
| sex | |||
| Male | — | — | |
| Female | 1.52 | 1.37, 1.68 | <0.001 |
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 | 0.98 | 0.98, 0.98 | <0.001 |
| income_cat | 0.90 | 0.88, 0.92 | <0.001 |
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||
se_pois <- tidy(mod_pois) |> dplyr::select(term, estimate, std.error) |> rename(SE_Poisson = std.error)
se_quasi <- tidy(mod_quasi) |> dplyr::select(term, std.error) |> rename(SE_QuasiPoisson = std.error)
se_pois |>
left_join(se_quasi, by = "term") |>
mutate(
Inflation_Factor = round(SE_QuasiPoisson / SE_Poisson, 2),
estimate = round(estimate, 4),
SE_Poisson = round(SE_Poisson, 4),
SE_QuasiPoisson = round(SE_QuasiPoisson, 4)
) |>
kable(caption = "Standard Error Comparison: Poisson vs. Quasi-Poisson") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | estimate | SE_Poisson | SE_QuasiPoisson | Inflation_Factor |
|---|---|---|---|---|
| (Intercept) | 3.1920 | 0.0326 | 0.1303 | 4 |
| exerciseYes | -0.3144 | 0.0142 | 0.0568 | 4 |
| smokerCurrent | 0.1896 | 0.0140 | 0.0559 | 4 |
| sexFemale | 0.4163 | 0.0133 | 0.0533 | 4 |
| age | -0.0208 | 0.0004 | 0.0016 | 4 |
| income_cat | -0.1093 | 0.0030 | 0.0120 | 4 |
Notice that the point estimates are identical between Poisson and quasi-Poisson. Only the standard errors change. The inflation factor is the same for every coefficient and equals \(\sqrt{\hat\phi}\).
The negative binomial model is a more flexible alternative that adds a parameter \(\theta\) to model the extra variation:
\[ \text{Var}(Y) = \mu + \frac{\mu^2}{\theta} \]
As \(\theta \to \infty\), the negative binomial converges to the Poisson. The smaller \(\theta\) is, the more overdispersion is present.
mod_nb <- glm.nb(
menthlth_days ~ exercise + smoker + sex + age + income_cat,
data = brfss
)
mod_nb |>
tbl_regression(exponentiate = TRUE) |>
bold_labels() |>
modify_caption("**Negative Binomial Regression: Rate Ratios for Mentally Unhealthy Days**")| Characteristic | IRR | 95% CI | p-value |
|---|---|---|---|
| exercise | |||
| No | — | — | |
| Yes | 0.69 | 0.59, 0.82 | <0.001 |
| smoker | |||
| Former/Never | — | — | |
| Current | 1.24 | 1.06, 1.46 | 0.008 |
| sex | |||
| Male | — | — | |
| Female | 1.56 | 1.35, 1.81 | <0.001 |
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 | 0.98 | 0.97, 0.98 | <0.001 |
| income_cat | 0.88 | 0.85, 0.91 | <0.001 |
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||
tibble(
Metric = c("Theta (θ)", "2 × Log-Likelihood (NB)", "2 × Log-Likelihood (Poisson)", "AIC (NB)", "AIC (Poisson)"),
Value = c(
round(mod_nb$theta, 3),
round(-2 * logLik(mod_nb), 1),
round(-2 * logLik(mod_pois), 1),
round(AIC(mod_nb), 1),
round(AIC(mod_pois), 1)
)
) |>
kable(caption = "Negative Binomial vs. Poisson: Model Comparison") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Theta (θ) | 0.155 |
| 2 × Log-Likelihood (NB) | 19864.300 |
| 2 × Log-Likelihood (Poisson) | 63879.100 |
| AIC (NB) | 19878.300 |
| AIC (Poisson) | 63891.100 |
A substantially lower AIC for the negative binomial model confirms that it provides a better fit when overdispersion is present.
tbl_merge(
list(
tbl_regression(mod_pois, exponentiate = TRUE) |> modify_header(label ~ "**Variable**"),
tbl_regression(mod_quasi, exponentiate = TRUE),
tbl_regression(mod_nb, exponentiate = TRUE)
),
tab_spanner = c("**Poisson**", "**Quasi-Poisson**", "**Negative Binomial**")
)| Variable |
Poisson
|
Quasi-Poisson
|
Negative Binomial
|
||||||
|---|---|---|---|---|---|---|---|---|---|
| IRR | 95% CI | p-value | IRR | 95% CI | p-value | IRR | 95% CI | p-value | |
| exercise | |||||||||
| No | — | — | — | — | — | — | |||
| Yes | 0.73 | 0.71, 0.75 | <0.001 | 0.73 | 0.65, 0.82 | <0.001 | 0.69 | 0.59, 0.82 | <0.001 |
| smoker | |||||||||
| Former/Never | — | — | — | — | — | — | |||
| Current | 1.21 | 1.18, 1.24 | <0.001 | 1.21 | 1.08, 1.35 | <0.001 | 1.24 | 1.06, 1.46 | 0.008 |
| sex | |||||||||
| Male | — | — | — | — | — | — | |||
| Female | 1.52 | 1.48, 1.56 | <0.001 | 1.52 | 1.37, 1.68 | <0.001 | 1.56 | 1.35, 1.81 | <0.001 |
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 | 0.98 | 0.98, 0.98 | <0.001 | 0.98 | 0.98, 0.98 | <0.001 | 0.98 | 0.97, 0.98 | <0.001 |
| income_cat | 0.90 | 0.89, 0.90 | <0.001 | 0.90 | 0.88, 0.92 | <0.001 | 0.88 | 0.85, 0.91 | <0.001 |
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||||||||
Key takeaway: The rate ratio point estimates are very similar across models. The main differences are in the standard errors and confidence intervals. The Poisson model is overconfident (SEs too small). The quasi-Poisson and negative binomial both correct for this, with the negative binomial providing a more formal likelihood-based approach.
- Is the outcome a count? If yes, start with Poisson regression.
- Check the dispersion ratio (residual deviance / df). Is it close to 1?
- Yes (\(\approx\) 1): The Poisson model is appropriate. Report it.
- No (much > 1): Overdispersion is present. Go to step 3.
- Choose a correction:
- Quasi-Poisson if you want a simple fix that inflates standard errors
- Negative binomial if you want a full probability model (preferred for formal inference and AIC comparison)
- Are there excessive zeros? Consider zero-inflated Poisson or zero-inflated negative binomial (beyond the scope of this course, but good to know about).
| Scenario | Offset? | Example |
|---|---|---|
| All subjects observed for the same time | No | Days of distress in past 30 (everyone has 30 days) |
| Subjects observed for different times | Yes | Deaths over variable follow-up in a cohort |
| Comparing areas with different populations | Yes | Cancer cases per county (offset = log(population)) |
| Concept | Key Point |
|---|---|
| Outcome | Counts (non-negative integers) |
| Distribution | Poisson: \(E(Y) = \text{Var}(Y) = \mu\) |
| Link function | Log: \(\log(\mu) = X^T \beta\) |
| Effect measure | Rate ratio = \(e^{\beta}\) |
| Offsets | Use offset(log(person_time)) when exposure time
varies |
| Overdispersion | Variance > Mean; check deviance/df ratio |
| Quasi-Poisson | Inflates SEs; same point estimates |
| Negative binomial | Full model for overdispersion; compare via AIC |
| In R | glm(y ~ x, family = poisson),
glm.nb() |
| From Logistic Regression | Poisson Analogue |
|---|---|
| Binary outcome (0/1) | Count outcome (0, 1, 2, …) |
| Logit link | Log link |
| Odds ratio (\(e^\beta\)) | Rate ratio (\(e^\beta\)) |
| Deviance and AIC | Deviance and AIC |
glm(..., family = binomial) |
glm(..., family = poisson) |
No lab activity for this lecture.