Over the last three lectures we have built up the machinery of logistic regression: the logit link, odds ratios, multiple predictors, Wald and likelihood ratio tests, interactions, goodness-of-fit, discrimination, and the polytomous and ordinal extensions. Today we put it all together.
This is an applications lecture. We will work through a complete analytic workflow, from research question to publication-ready table, the way you would for a manuscript or a thesis chapter.
Today’s roadmap: 1. A principled model-building strategy 2. Worked example: building a final model 3. Reporting results for publication 4. Common pitfalls and how to avoid them
There is no single “correct” way to build a regression model. But there is a sensible workflow that most epidemiologists follow:
| Step | Phase | Tools |
|---|---|---|
| 1 | Define the research question | DAG, prior literature, study design |
| 2 | Univariable screening | Simple glm() per predictor; tbl_uvregression() |
| 3 | Build the multivariable model | glm() with all candidates; tbl_regression() |
| 4 | Assess confounding | 10% change-in-estimate rule; compare crude vs adjusted |
| 5 | Test interactions |
|
| 6 | Check fit and assumptions | HL test, calibration, ROC/AUC, Cook’s D, VIF |
| 7 | Report and interpret | tbl_regression(); narrative interpretation |
Predictive modeling. Goal: maximize out-of-sample accuracy. Use stepwise selection, LASSO, cross-validation, AUC. The OR is a means to an end.
Etiologic / explanatory modeling. Goal: estimate the unbiased causal effect of an exposure on an outcome. Use a DAG to choose covariates. Confounders are included; mediators and colliders are excluded. The OR for the exposure is the answer.
Most public health research is etiologic. Today we adopt that mindset.
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
library(ResourceSelection)
library(pROC)
library(performance)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))
brfss_logistic <- readRDS(
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Logistic Regression/brfss_logistic_2020.rds"
)Among U.S. adults, is regular physical activity associated with lower odds of frequent mental distress (FMD), after adjusting for demographic, behavioral, and socioeconomic confounders? Does the association differ by sex?
Exposure: exercise (Yes / No)
Outcome: fmd (Yes / No) Candidate
confounders: age, sex, BMI, sleep hours, income category,
smoking status Effect modifier of interest: sex
brfss_logistic |>
tbl_summary(
by = fmd,
include = c(exercise, age, sex, bmi, sleep_hrs, income_cat, smoker),
statistic = list(all_continuous() ~ "{mean} ({sd})"),
label = list(
exercise ~ "Exercise (past 30 days)",
age ~ "Age (years)",
sex ~ "Sex",
bmi ~ "BMI (kg/m²)",
sleep_hrs ~ "Sleep (hours)",
income_cat ~ "Income category",
smoker ~ "Smoking status"
)
) |>
add_p() |>
add_overall() |>
bold_labels() |>
modify_header(label ~ "**Characteristic**") |>
modify_caption("**Table 1.** Sample characteristics by frequent mental distress status.")| Characteristic | Overall N = 5,0001 |
No N = 4,2431 |
Yes N = 7571 |
p-value2 |
|---|---|---|---|---|
| Exercise (past 30 days) | 3,673 (73%) | 3,192 (75%) | 481 (64%) | <0.001 |
| Age (years) | 56 (16) | 57 (16) | 50 (16) | <0.001 |
| Sex | <0.001 | |||
| Male | 2,701 (54%) | 2,378 (56%) | 323 (43%) | |
| Female | 2,299 (46%) | 1,865 (44%) | 434 (57%) | |
| BMI (kg/m²) | 28.5 (6.3) | 28.4 (6.2) | 29.3 (7.0) | 0.001 |
| Sleep (hours) | 7.00 (1.48) | 7.09 (1.40) | 6.51 (1.83) | <0.001 |
| Income category | <0.001 | |||
| 1 | 228 (4.6%) | 164 (3.9%) | 64 (8.5%) | |
| 2 | 243 (4.9%) | 170 (4.0%) | 73 (9.6%) | |
| 3 | 360 (7.2%) | 279 (6.6%) | 81 (11%) | |
| 4 | 480 (9.6%) | 389 (9.2%) | 91 (12%) | |
| 5 | 542 (11%) | 454 (11%) | 88 (12%) | |
| 6 | 726 (15%) | 625 (15%) | 101 (13%) | |
| 7 | 884 (18%) | 775 (18%) | 109 (14%) | |
| 8 | 1,537 (31%) | 1,387 (33%) | 150 (20%) | |
| Smoking status | <0.001 | |||
| Former/Never | 3,280 (66%) | 2,886 (68%) | 394 (52%) | |
| Current | 1,720 (34%) | 1,357 (32%) | 363 (48%) | |
| 1 n (%); Mean (SD) | ||||
| 2 Pearson’s Chi-squared test; Wilcoxon rank sum test | ||||
brfss_logistic |>
tbl_uvregression(
method = glm,
y = fmd,
method.args = list(family = binomial),
exponentiate = TRUE,
include = c(exercise, age, sex, bmi, sleep_hrs, income_cat, smoker)
) |>
bold_labels() |>
modify_caption("**Table 2.** Crude (unadjusted) odds ratios for frequent mental distress.")| Characteristic | N | OR | 95% CI | p-value |
|---|---|---|---|---|
| exercise | 5,000 | |||
| No | — | — | ||
| Yes | 0.57 | 0.49, 0.68 | <0.001 | |
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 | 5,000 | 0.97 | 0.97, 0.98 | <0.001 |
| sex | 5,000 | |||
| Male | — | — | ||
| Female | 1.71 | 1.47, 2.00 | <0.001 | |
| bmi | 5,000 | 1.02 | 1.01, 1.03 | <0.001 |
| sleep_hrs | 5,000 | 0.77 | 0.73, 0.81 | <0.001 |
| income_cat | 5,000 | 0.82 | 0.79, 0.85 | <0.001 |
| smoker | 5,000 | |||
| Former/Never | — | — | ||
| Current | 1.96 | 1.68, 2.29 | <0.001 | |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | ||||
mod_full <- glm(
fmd ~ exercise + age + sex + bmi + sleep_hrs + income_cat + smoker,
data = brfss_logistic,
family = binomial
)
mod_full |>
tbl_regression(
exponentiate = TRUE,
label = list(
exercise ~ "Exercise",
age ~ "Age",
sex ~ "Sex",
bmi ~ "BMI",
sleep_hrs ~ "Sleep hours",
income_cat ~ "Income category",
smoker ~ "Smoker"
)
) |>
bold_labels() |>
bold_p() |>
modify_caption("**Table 3.** Adjusted odds ratios for frequent mental distress.")| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Exercise | |||
| No | — | — | |
| Yes | 0.63 | 0.52, 0.75 | <0.001 |
| Age | 0.98 | 0.97, 0.98 | <0.001 |
| Sex | |||
| Male | — | — | |
| Female | 1.66 | 1.41, 1.96 | <0.001 |
| BMI | 1.01 | 1.00, 1.02 | 0.083 |
| Sleep hours | 0.82 | 0.77, 0.86 | <0.001 |
| Income category | 0.85 | 0.82, 0.89 | <0.001 |
| Smoker | |||
| Former/Never | — | — | |
| Current | 1.27 | 1.07, 1.51 | 0.007 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
The classic 10% change-in-estimate rule: a covariate is a confounder of the exposure-outcome relationship if including it changes the exposure OR by more than 10%.
mod_crude <- glm(fmd ~ exercise, data = brfss_logistic, family = binomial)
crude_or <- exp(coef(mod_crude)["exerciseYes"])
adj_or <- exp(coef(mod_full)["exerciseYes"])
pct_change <- 100 * (crude_or - adj_or) / crude_or
tibble(
Quantity = c("Crude OR (exercise)",
"Adjusted OR (exercise)",
"% change in estimate"),
Value = c(round(crude_or, 3),
round(adj_or, 3),
paste0(round(pct_change, 1), "%"))
) |>
kable(caption = "Change-in-Estimate for Exercise") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| Crude OR (exercise) | 0.574 |
| Adjusted OR (exercise) | 0.626 |
| % change in estimate | -9.1% |
Interpretation. A change > 10% suggests meaningful confounding by the included covariates. The adjusted OR is the more reliable estimate.
mod_int <- glm(
fmd ~ exercise * sex + age + bmi + sleep_hrs + income_cat + smoker,
data = brfss_logistic,
family = binomial
)
anova(mod_full, mod_int, test = "LRT") |>
kable(digits = 3, caption = "LR Test: Exercise × Sex Interaction") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 4992 | 3867.216 | NA | NA | NA |
| 4991 | 3867.208 | 1 | 0.008 | 0.93 |
ggpredict(mod_int, terms = c("exercise", "sex")) |>
plot() +
labs(title = "Predicted Probability of FMD by Exercise and Sex",
x = "Exercise", y = "Predicted Probability") +
theme_minimal()If the interaction is meaningful, report ORs within each level of sex:
mod_male <- glm(
fmd ~ exercise + age + bmi + sleep_hrs + income_cat + smoker,
data = filter(brfss_logistic, sex == "Male"),
family = binomial
)
mod_female <- glm(
fmd ~ exercise + age + bmi + sleep_hrs + income_cat + smoker,
data = filter(brfss_logistic, sex == "Female"),
family = binomial
)
bind_rows(
tidy(mod_male, exponentiate = TRUE, conf.int = TRUE) |>
filter(term == "exerciseYes") |> mutate(stratum = "Male"),
tidy(mod_female, exponentiate = TRUE, conf.int = TRUE) |>
filter(term == "exerciseYes") |> mutate(stratum = "Female")
) |>
dplyr::select(stratum, estimate, conf.low, conf.high, p.value) |>
mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
p.value = format.pval(p.value, digits = 3)) |>
kable(col.names = c("Sex", "OR (Exercise)", "Lower", "Upper", "p"),
caption = "Stratum-Specific ORs for Exercise") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Sex | OR (Exercise) | Lower | Upper | p |
|---|---|---|---|---|
| Male | 0.637 | 0.491 | 0.831 | 0.000785 |
| Female | 0.614 | 0.482 | 0.783 | 8.06e-05 |
list(
`McFadden R²` = round(performance::r2_mcfadden(mod_full)$R2, 3),
`HL p-value` = round(hoslem.test(
x = as.numeric(brfss_logistic$fmd) - 1,
y = fitted(mod_full), g = 10)$p.value, 3),
`AUC` = round(as.numeric(auc(roc(
brfss_logistic$fmd, fitted(mod_full),
levels = c("No", "Yes"), direction = "<", quiet = TRUE))), 3),
`Max VIF` = round(max(car::vif(mod_full)), 2)
)## $`McFadden R²`
## McFadden's R2
## 0.09
##
## $`HL p-value`
## [1] 0.011
##
## $AUC
## [1] 0.719
##
## $`Max VIF`
## [1] 1.13
tidy(mod_full, exponentiate = TRUE, conf.int = TRUE) |>
filter(term != "(Intercept)") |>
ggplot(aes(x = estimate, y = fct_reorder(term, estimate))) +
geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
color = "steelblue", linewidth = 0.8) +
scale_x_log10() +
labs(title = "Adjusted Odds Ratios for Frequent Mental Distress",
x = "Adjusted OR (log scale)", y = NULL) +
theme_minimal()In adjusted models, current smokers had X.XX times the odds of frequent mental distress compared with former or never smokers (aOR = X.XX; 95% CI: X.XX, X.XX). Each additional hour of sleep was associated with X% lower odds of FMD (aOR = X.XX; 95% CI: X.XX, X.XX). After adjustment for demographic, behavioral, and socioeconomic covariates, regular exercise remained protective (aOR = X.XX; 95% CI: X.XX, X.XX). The likelihood ratio test for an exercise × sex interaction was not statistically significant (p = X.XX), and stratum-specific ORs were similar in magnitude, suggesting that the protective association does not differ meaningfully by sex.
| Pitfall | Fix |
|---|---|
| Reporting p-values without ORs and CIs | Always show point estimates with 95% CIs |
| Stepwise selection on small samples | Use a DAG; pre-specify covariates |
| Adjusting for mediators | Draw a DAG; do not adjust on the causal pathway |
| Adjusting for colliders | Avoid conditioning on common effects of exposure and outcome |
| Ignoring the rare-events rule (EPV < 10) | Aim for ≥ 10 events per predictor; use penalized methods if not |
| Interpreting OR as relative risk when outcome is common | Use log-binomial or modified Poisson when prevalence > ~10% |
| Forgetting to check linearity in the logit | Restricted cubic splines or LOWESS-of-logit plots |
| Reporting only the p-value of an interaction | Always show stratum-specific estimates |
| Step | Key Action |
|---|---|
| 1. Question | Anchor your model in a clear research question and DAG |
| 2. Screen | Univariable models inform but do not select |
| 3. Build | Include confounders identified a priori |
| 4. Confound | Compare crude vs adjusted; 10% rule of thumb |
| 5. Modify | Plan interactions; report stratum-specific estimates |
| 6. Check | Fit, calibration, discrimination, diagnostics |
| 7. Report | ORs + CIs, narrative interpretation, full transparency |
No lab activity for this lecture — you have practiced this workflow in the previous three labs.