Introduction

Over the last three lectures we have built up the machinery of logistic regression: the logit link, odds ratios, multiple predictors, Wald and likelihood ratio tests, interactions, goodness-of-fit, discrimination, and the polytomous and ordinal extensions. Today we put it all together.

This is an applications lecture. We will work through a complete analytic workflow, from research question to publication-ready table, the way you would for a manuscript or a thesis chapter.

Today’s roadmap: 1. A principled model-building strategy 2. Worked example: building a final model 3. Reporting results for publication 4. Common pitfalls and how to avoid them

Part 1: A Model-Building Strategy

There is no single “correct” way to build a regression model. But there is a sensible workflow that most epidemiologists follow:

Seven-Step Logistic Regression Workflow
Step	Phase	Tools
1	Define the research question	DAG, prior literature, study design
2	Univariable screening	Simple glm() per predictor; tbl_uvregression()
3	Build the multivariable model	glm() with all candidates; tbl_regression()
4	Assess confounding	10% change-in-estimate rule; compare crude vs adjusted
5	Test interactions	operator; LR test; ggpredict()
6	Check fit and assumptions	HL test, calibration, ROC/AUC, Cook’s D, VIF
7	Report and interpret	tbl_regression(); narrative interpretation

Two Schools of Thought

Predictive modeling. Goal: maximize out-of-sample accuracy. Use stepwise selection, LASSO, cross-validation, AUC. The OR is a means to an end.

Etiologic / explanatory modeling. Goal: estimate the unbiased causal effect of an exposure on an outcome. Use a DAG to choose covariates. Confounders are included; mediators and colliders are excluded. The OR for the exposure is the answer.

Most public health research is etiologic. Today we adopt that mindset.

Part 2: Worked Example

Setup

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
library(ResourceSelection)
library(pROC)
library(performance)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_logistic <- readRDS(
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Logistic Regression/brfss_logistic_2020.rds"
)

Research Question

Among U.S. adults, is regular physical activity associated with lower odds of frequent mental distress (FMD), after adjusting for demographic, behavioral, and socioeconomic confounders? Does the association differ by sex?

Exposure: exercise (Yes / No) Outcome: fmd (Yes / No) Candidate confounders: age, sex, BMI, sleep hours, income category, smoking status Effect modifier of interest: sex

Step 1: Descriptive Table (Table 1)

brfss_logistic |>
  tbl_summary(
    by = fmd,
    include = c(exercise, age, sex, bmi, sleep_hrs, income_cat, smoker),
    statistic = list(all_continuous() ~ "{mean} ({sd})"),
    label = list(
      exercise   ~ "Exercise (past 30 days)",
      age        ~ "Age (years)",
      sex        ~ "Sex",
      bmi        ~ "BMI (kg/m²)",
      sleep_hrs  ~ "Sleep (hours)",
      income_cat ~ "Income category",
      smoker     ~ "Smoking status"
    )
  ) |>
  add_p() |>
  add_overall() |>
  bold_labels() |>
  modify_header(label ~ "**Characteristic**") |>
  modify_caption("**Table 1.** Sample characteristics by frequent mental distress status.")

**Table 1.** Sample characteristics by frequent mental distress status.
Characteristic	Overall N = 5,000¹	No N = 4,243¹	Yes N = 757¹	p-value²
Exercise (past 30 days)	3,673 (73%)	3,192 (75%)	481 (64%)	<0.001
Age (years)	56 (16)	57 (16)	50 (16)	<0.001
Sex				<0.001
Male	2,701 (54%)	2,378 (56%)	323 (43%)
Female	2,299 (46%)	1,865 (44%)	434 (57%)
BMI (kg/m²)	28.5 (6.3)	28.4 (6.2)	29.3 (7.0)	0.001
Sleep (hours)	7.00 (1.48)	7.09 (1.40)	6.51 (1.83)	<0.001
Income category				<0.001
1	228 (4.6%)	164 (3.9%)	64 (8.5%)
2	243 (4.9%)	170 (4.0%)	73 (9.6%)
3	360 (7.2%)	279 (6.6%)	81 (11%)
4	480 (9.6%)	389 (9.2%)	91 (12%)
5	542 (11%)	454 (11%)	88 (12%)
6	726 (15%)	625 (15%)	101 (13%)
7	884 (18%)	775 (18%)	109 (14%)
8	1,537 (31%)	1,387 (33%)	150 (20%)
Smoking status				<0.001
Former/Never	3,280 (66%)	2,886 (68%)	394 (52%)
Current	1,720 (34%)	1,357 (32%)	363 (48%)
¹ n (%); Mean (SD)
² Pearson’s Chi-squared test; Wilcoxon rank sum test

Step 2: Univariable Screening

brfss_logistic |>
  tbl_uvregression(
    method = glm,
    y = fmd,
    method.args = list(family = binomial),
    exponentiate = TRUE,
    include = c(exercise, age, sex, bmi, sleep_hrs, income_cat, smoker)
  ) |>
  bold_labels() |>
  modify_caption("**Table 2.** Crude (unadjusted) odds ratios for frequent mental distress.")

**Table 2.** Crude (unadjusted) odds ratios for frequent mental distress.
Characteristic	N	OR	95% CI	p-value
exercise	5,000
No		—	—
Yes		0.57	0.49, 0.68	<0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80	5,000	0.97	0.97, 0.98	<0.001
sex	5,000
Male		—	—
Female		1.71	1.47, 2.00	<0.001
bmi	5,000	1.02	1.01, 1.03	<0.001
sleep_hrs	5,000	0.77	0.73, 0.81	<0.001
income_cat	5,000	0.82	0.79, 0.85	<0.001
smoker	5,000
Former/Never		—	—
Current		1.96	1.68, 2.29	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Step 3: Build the Multivariable Model

mod_full <- glm(
  fmd ~ exercise + age + sex + bmi + sleep_hrs + income_cat + smoker,
  data = brfss_logistic,
  family = binomial
)

mod_full |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      exercise   ~ "Exercise",
      age        ~ "Age",
      sex        ~ "Sex",
      bmi        ~ "BMI",
      sleep_hrs  ~ "Sleep hours",
      income_cat ~ "Income category",
      smoker     ~ "Smoker"
    )
  ) |>
  bold_labels() |>
  bold_p() |>
  modify_caption("**Table 3.** Adjusted odds ratios for frequent mental distress.")

**Table 3.** Adjusted odds ratios for frequent mental distress.
Characteristic	OR	95% CI	p-value
Exercise
No	—	—
Yes	0.63	0.52, 0.75	<0.001
Age	0.98	0.97, 0.98	<0.001
Sex
Male	—	—
Female	1.66	1.41, 1.96	<0.001
BMI	1.01	1.00, 1.02	0.083
Sleep hours	0.82	0.77, 0.86	<0.001
Income category	0.85	0.82, 0.89	<0.001
Smoker
Former/Never	—	—
Current	1.27	1.07, 1.51	0.007
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Step 4: Assess Confounding

The classic 10% change-in-estimate rule: a covariate is a confounder of the exposure-outcome relationship if including it changes the exposure OR by more than 10%.

mod_crude <- glm(fmd ~ exercise, data = brfss_logistic, family = binomial)

crude_or <- exp(coef(mod_crude)["exerciseYes"])
adj_or   <- exp(coef(mod_full)["exerciseYes"])
pct_change <- 100 * (crude_or - adj_or) / crude_or

tibble(
  Quantity = c("Crude OR (exercise)",
               "Adjusted OR (exercise)",
               "% change in estimate"),
  Value    = c(round(crude_or, 3),
               round(adj_or, 3),
               paste0(round(pct_change, 1), "%"))
) |>
  kable(caption = "Change-in-Estimate for Exercise") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Change-in-Estimate for Exercise
Quantity	Value
Crude OR (exercise)	0.574
Adjusted OR (exercise)	0.626
% change in estimate	-9.1%

Interpretation. A change > 10% suggests meaningful confounding by the included covariates. The adjusted OR is the more reliable estimate.

Step 5: Test for Effect Modification by Sex

mod_int <- glm(
  fmd ~ exercise * sex + age + bmi + sleep_hrs + income_cat + smoker,
  data = brfss_logistic,
  family = binomial
)

anova(mod_full, mod_int, test = "LRT") |>
  kable(digits = 3, caption = "LR Test: Exercise × Sex Interaction") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

LR Test: Exercise × Sex Interaction
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
4992	3867.216	NA	NA	NA
4991	3867.208	1	0.008	0.93

ggpredict(mod_int, terms = c("exercise", "sex")) |>
  plot() +
  labs(title = "Predicted Probability of FMD by Exercise and Sex",
       x = "Exercise", y = "Predicted Probability") +
  theme_minimal()

Stratum-Specific Estimates

If the interaction is meaningful, report ORs within each level of sex:

mod_male <- glm(
  fmd ~ exercise + age + bmi + sleep_hrs + income_cat + smoker,
  data = filter(brfss_logistic, sex == "Male"),
  family = binomial
)

mod_female <- glm(
  fmd ~ exercise + age + bmi + sleep_hrs + income_cat + smoker,
  data = filter(brfss_logistic, sex == "Female"),
  family = binomial
)

bind_rows(
  tidy(mod_male,   exponentiate = TRUE, conf.int = TRUE) |>
    filter(term == "exerciseYes") |> mutate(stratum = "Male"),
  tidy(mod_female, exponentiate = TRUE, conf.int = TRUE) |>
    filter(term == "exerciseYes") |> mutate(stratum = "Female")
) |>
  dplyr::select(stratum, estimate, conf.low, conf.high, p.value) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
         p.value = format.pval(p.value, digits = 3)) |>
  kable(col.names = c("Sex", "OR (Exercise)", "Lower", "Upper", "p"),
        caption = "Stratum-Specific ORs for Exercise") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Stratum-Specific ORs for Exercise
Sex	OR (Exercise)	Lower	Upper	p
Male	0.637	0.491	0.831	0.000785
Female	0.614	0.482	0.783	8.06e-05

Step 6: Check Fit and Assumptions

list(
  `McFadden R²`   = round(performance::r2_mcfadden(mod_full)$R2, 3),
  `HL p-value`    = round(hoslem.test(
                      x = as.numeric(brfss_logistic$fmd) - 1,
                      y = fitted(mod_full), g = 10)$p.value, 3),
  `AUC`           = round(as.numeric(auc(roc(
                      brfss_logistic$fmd, fitted(mod_full),
                      levels = c("No", "Yes"), direction = "<", quiet = TRUE))), 3),
  `Max VIF`       = round(max(car::vif(mod_full)), 2)
)

## $`McFadden R²`
## McFadden's R2 
##          0.09 
## 
## $`HL p-value`
## [1] 0.011
## 
## $AUC
## [1] 0.719
## 
## $`Max VIF`
## [1] 1.13

Step 7: Forest Plot of Final Model

tidy(mod_full, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  ggplot(aes(x = estimate, y = fct_reorder(term, estimate))) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
                  color = "steelblue", linewidth = 0.8) +
  scale_x_log10() +
  labs(title = "Adjusted Odds Ratios for Frequent Mental Distress",
       x = "Adjusted OR (log scale)", y = NULL) +
  theme_minimal()

Part 3: Reporting Results

A Publication-Ready Paragraph

In adjusted models, current smokers had X.XX times the odds of frequent mental distress compared with former or never smokers (aOR = X.XX; 95% CI: X.XX, X.XX). Each additional hour of sleep was associated with X% lower odds of FMD (aOR = X.XX; 95% CI: X.XX, X.XX). After adjustment for demographic, behavioral, and socioeconomic covariates, regular exercise remained protective (aOR = X.XX; 95% CI: X.XX, X.XX). The likelihood ratio test for an exercise × sex interaction was not statistically significant (p = X.XX), and stratum-specific ORs were similar in magnitude, suggesting that the protective association does not differ meaningfully by sex.

What Every Logistic Regression Paper Should Report

Sample size and outcome prevalence
Table 1 stratified by outcome
Crude and adjusted ORs with 95% CIs (not just p-values)
Which covariates were included and why (DAG, prior literature, change-in-estimate)
Tests of effect modification that were planned
Goodness-of-fit and discrimination metrics (HL or calibration plot, AUC)
Software and packages with versions

Part 4: Common Pitfalls

Common Pitfalls in Logistic Regression
Pitfall	Fix
Reporting p-values without ORs and CIs	Always show point estimates with 95% CIs
Stepwise selection on small samples	Use a DAG; pre-specify covariates
Adjusting for mediators	Draw a DAG; do not adjust on the causal pathway
Adjusting for colliders	Avoid conditioning on common effects of exposure and outcome
Ignoring the rare-events rule (EPV < 10)	Aim for ≥ 10 events per predictor; use penalized methods if not
Interpreting OR as relative risk when outcome is common	Use log-binomial or modified Poisson when prevalence > ~10%
Forgetting to check linearity in the logit	Restricted cubic splines or LOWESS-of-logit plots
Reporting only the p-value of an interaction	Always show stratum-specific estimates

Summary

Step	Key Action
1. Question	Anchor your model in a clear research question and DAG
2. Screen	Univariable models inform but do not select
3. Build	Include confounders identified a priori
4. Confound	Compare crude vs adjusted; 10% rule of thumb
5. Modify	Plan interactions; report stratum-specific estimates
6. Check	Fit, calibration, discrimination, diagnostics
7. Report	ORs + CIs, narrative interpretation, full transparency

Next Lecture (April 23)

Poisson regression for count outcomes
Survival analysis: an introduction

References

Kleinbaum, D. G., Kupper, L. L., Nizam, A., & Rosenberg, E. S. (2013). Applied Regression Analysis and Other Multivariable Methods (5th ed.), Chapters 22-23.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
Vittinghoff, E., et al. (2012). Regression Methods in Biostatistics (2nd ed.). Springer.
Greenland, S. (1989). Modeling and variable selection in epidemiologic analysis. American Journal of Public Health, 79(3), 340-349.

No lab activity for this lecture — you have practiced this workflow in the previous three labs.

Logistic Regression Applications

EPI 553 — Principles of Statistical Inference II

Muntasir Masum

April 21, 2026