Introduction

Over the last three lectures we have built up the machinery of logistic regression: the logit link, odds ratios, multiple predictors, Wald and likelihood ratio tests, interactions, goodness-of-fit, discrimination, and the polytomous and ordinal extensions. Today we put it all together.

This is an applications lecture. We will work through a complete analytic workflow, from research question to publication-ready table, the way you would for a manuscript or a thesis chapter.

Today’s roadmap: 1. A principled model-building strategy 2. Worked example: building a final model 3. Reporting results for publication 4. Common pitfalls and how to avoid them


Part 1: A Model-Building Strategy

There is no single “correct” way to build a regression model. But there is a sensible workflow that most epidemiologists follow:

Seven-Step Logistic Regression Workflow
Step Phase Tools
1 Define the research question DAG, prior literature, study design
2 Univariable screening Simple glm() per predictor; tbl_uvregression()
3 Build the multivariable model glm() with all candidates; tbl_regression()
4 Assess confounding 10% change-in-estimate rule; compare crude vs adjusted
5 Test interactions
  • operator; LR test; ggpredict()
6 Check fit and assumptions HL test, calibration, ROC/AUC, Cook’s D, VIF
7 Report and interpret tbl_regression(); narrative interpretation

Two Schools of Thought

Predictive modeling. Goal: maximize out-of-sample accuracy. Use stepwise selection, LASSO, cross-validation, AUC. The OR is a means to an end.

Etiologic / explanatory modeling. Goal: estimate the unbiased causal effect of an exposure on an outcome. Use a DAG to choose covariates. Confounders are included; mediators and colliders are excluded. The OR for the exposure is the answer.

Most public health research is etiologic. Today we adopt that mindset.


Part 2: Worked Example

Setup

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
library(ResourceSelection)
library(pROC)
library(performance)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_logistic <- readRDS(
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Logistic Regression/brfss_logistic_2020.rds"
)

Research Question

Among U.S. adults, is regular physical activity associated with lower odds of frequent mental distress (FMD), after adjusting for demographic, behavioral, and socioeconomic confounders? Does the association differ by sex?

Exposure: exercise (Yes / No) Outcome: fmd (Yes / No) Candidate confounders: age, sex, BMI, sleep hours, income category, smoking status Effect modifier of interest: sex

Step 1: Descriptive Table (Table 1)

brfss_logistic |>
  tbl_summary(
    by = fmd,
    include = c(exercise, age, sex, bmi, sleep_hrs, income_cat, smoker),
    statistic = list(all_continuous() ~ "{mean} ({sd})"),
    label = list(
      exercise   ~ "Exercise (past 30 days)",
      age        ~ "Age (years)",
      sex        ~ "Sex",
      bmi        ~ "BMI (kg/m²)",
      sleep_hrs  ~ "Sleep (hours)",
      income_cat ~ "Income category",
      smoker     ~ "Smoking status"
    )
  ) |>
  add_p() |>
  add_overall() |>
  bold_labels() |>
  modify_header(label ~ "**Characteristic**") |>
  modify_caption("**Table 1.** Sample characteristics by frequent mental distress status.")
Table 1. Sample characteristics by frequent mental distress status.
Characteristic Overall
N = 5,000
1
No
N = 4,243
1
Yes
N = 757
1
p-value2
Exercise (past 30 days) 3,673 (73%) 3,192 (75%) 481 (64%) <0.001
Age (years) 56 (16) 57 (16) 50 (16) <0.001
Sex


<0.001
    Male 2,701 (54%) 2,378 (56%) 323 (43%)
    Female 2,299 (46%) 1,865 (44%) 434 (57%)
BMI (kg/m²) 28.5 (6.3) 28.4 (6.2) 29.3 (7.0) 0.001
Sleep (hours) 7.00 (1.48) 7.09 (1.40) 6.51 (1.83) <0.001
Income category


<0.001
    1 228 (4.6%) 164 (3.9%) 64 (8.5%)
    2 243 (4.9%) 170 (4.0%) 73 (9.6%)
    3 360 (7.2%) 279 (6.6%) 81 (11%)
    4 480 (9.6%) 389 (9.2%) 91 (12%)
    5 542 (11%) 454 (11%) 88 (12%)
    6 726 (15%) 625 (15%) 101 (13%)
    7 884 (18%) 775 (18%) 109 (14%)
    8 1,537 (31%) 1,387 (33%) 150 (20%)
Smoking status


<0.001
    Former/Never 3,280 (66%) 2,886 (68%) 394 (52%)
    Current 1,720 (34%) 1,357 (32%) 363 (48%)
1 n (%); Mean (SD)
2 Pearson’s Chi-squared test; Wilcoxon rank sum test

Step 2: Univariable Screening

brfss_logistic |>
  tbl_uvregression(
    method = glm,
    y = fmd,
    method.args = list(family = binomial),
    exponentiate = TRUE,
    include = c(exercise, age, sex, bmi, sleep_hrs, income_cat, smoker)
  ) |>
  bold_labels() |>
  modify_caption("**Table 2.** Crude (unadjusted) odds ratios for frequent mental distress.")
Table 2. Crude (unadjusted) odds ratios for frequent mental distress.
Characteristic N OR 95% CI p-value
exercise 5,000


    No

    Yes
0.57 0.49, 0.68 <0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80 5,000 0.97 0.97, 0.98 <0.001
sex 5,000


    Male

    Female
1.71 1.47, 2.00 <0.001
bmi 5,000 1.02 1.01, 1.03 <0.001
sleep_hrs 5,000 0.77 0.73, 0.81 <0.001
income_cat 5,000 0.82 0.79, 0.85 <0.001
smoker 5,000


    Former/Never

    Current
1.96 1.68, 2.29 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Step 3: Build the Multivariable Model

mod_full <- glm(
  fmd ~ exercise + age + sex + bmi + sleep_hrs + income_cat + smoker,
  data = brfss_logistic,
  family = binomial
)

mod_full |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      exercise   ~ "Exercise",
      age        ~ "Age",
      sex        ~ "Sex",
      bmi        ~ "BMI",
      sleep_hrs  ~ "Sleep hours",
      income_cat ~ "Income category",
      smoker     ~ "Smoker"
    )
  ) |>
  bold_labels() |>
  bold_p() |>
  modify_caption("**Table 3.** Adjusted odds ratios for frequent mental distress.")
Table 3. Adjusted odds ratios for frequent mental distress.
Characteristic OR 95% CI p-value
Exercise


    No
    Yes 0.63 0.52, 0.75 <0.001
Age 0.98 0.97, 0.98 <0.001
Sex


    Male
    Female 1.66 1.41, 1.96 <0.001
BMI 1.01 1.00, 1.02 0.083
Sleep hours 0.82 0.77, 0.86 <0.001
Income category 0.85 0.82, 0.89 <0.001
Smoker


    Former/Never
    Current 1.27 1.07, 1.51 0.007
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Step 4: Assess Confounding

The classic 10% change-in-estimate rule: a covariate is a confounder of the exposure-outcome relationship if including it changes the exposure OR by more than 10%.

mod_crude <- glm(fmd ~ exercise, data = brfss_logistic, family = binomial)

crude_or <- exp(coef(mod_crude)["exerciseYes"])
adj_or   <- exp(coef(mod_full)["exerciseYes"])
pct_change <- 100 * (crude_or - adj_or) / crude_or

tibble(
  Quantity = c("Crude OR (exercise)",
               "Adjusted OR (exercise)",
               "% change in estimate"),
  Value    = c(round(crude_or, 3),
               round(adj_or, 3),
               paste0(round(pct_change, 1), "%"))
) |>
  kable(caption = "Change-in-Estimate for Exercise") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Change-in-Estimate for Exercise
Quantity Value
Crude OR (exercise) 0.574
Adjusted OR (exercise) 0.626
% change in estimate -9.1%

Interpretation. A change > 10% suggests meaningful confounding by the included covariates. The adjusted OR is the more reliable estimate.

Step 5: Test for Effect Modification by Sex

mod_int <- glm(
  fmd ~ exercise * sex + age + bmi + sleep_hrs + income_cat + smoker,
  data = brfss_logistic,
  family = binomial
)

anova(mod_full, mod_int, test = "LRT") |>
  kable(digits = 3, caption = "LR Test: Exercise × Sex Interaction") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
LR Test: Exercise × Sex Interaction
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
4992 3867.216 NA NA NA
4991 3867.208 1 0.008 0.93
ggpredict(mod_int, terms = c("exercise", "sex")) |>
  plot() +
  labs(title = "Predicted Probability of FMD by Exercise and Sex",
       x = "Exercise", y = "Predicted Probability") +
  theme_minimal()

Stratum-Specific Estimates

If the interaction is meaningful, report ORs within each level of sex:

mod_male <- glm(
  fmd ~ exercise + age + bmi + sleep_hrs + income_cat + smoker,
  data = filter(brfss_logistic, sex == "Male"),
  family = binomial
)

mod_female <- glm(
  fmd ~ exercise + age + bmi + sleep_hrs + income_cat + smoker,
  data = filter(brfss_logistic, sex == "Female"),
  family = binomial
)

bind_rows(
  tidy(mod_male,   exponentiate = TRUE, conf.int = TRUE) |>
    filter(term == "exerciseYes") |> mutate(stratum = "Male"),
  tidy(mod_female, exponentiate = TRUE, conf.int = TRUE) |>
    filter(term == "exerciseYes") |> mutate(stratum = "Female")
) |>
  dplyr::select(stratum, estimate, conf.low, conf.high, p.value) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
         p.value = format.pval(p.value, digits = 3)) |>
  kable(col.names = c("Sex", "OR (Exercise)", "Lower", "Upper", "p"),
        caption = "Stratum-Specific ORs for Exercise") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Stratum-Specific ORs for Exercise
Sex OR (Exercise) Lower Upper p
Male 0.637 0.491 0.831 0.000785
Female 0.614 0.482 0.783 8.06e-05

Step 6: Check Fit and Assumptions

list(
  `McFadden R²`   = round(performance::r2_mcfadden(mod_full)$R2, 3),
  `HL p-value`    = round(hoslem.test(
                      x = as.numeric(brfss_logistic$fmd) - 1,
                      y = fitted(mod_full), g = 10)$p.value, 3),
  `AUC`           = round(as.numeric(auc(roc(
                      brfss_logistic$fmd, fitted(mod_full),
                      levels = c("No", "Yes"), direction = "<", quiet = TRUE))), 3),
  `Max VIF`       = round(max(car::vif(mod_full)), 2)
)
## $`McFadden R²`
## McFadden's R2 
##          0.09 
## 
## $`HL p-value`
## [1] 0.011
## 
## $AUC
## [1] 0.719
## 
## $`Max VIF`
## [1] 1.13

Step 7: Forest Plot of Final Model

tidy(mod_full, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  ggplot(aes(x = estimate, y = fct_reorder(term, estimate))) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
                  color = "steelblue", linewidth = 0.8) +
  scale_x_log10() +
  labs(title = "Adjusted Odds Ratios for Frequent Mental Distress",
       x = "Adjusted OR (log scale)", y = NULL) +
  theme_minimal()


Part 3: Reporting Results

A Publication-Ready Paragraph

In adjusted models, current smokers had X.XX times the odds of frequent mental distress compared with former or never smokers (aOR = X.XX; 95% CI: X.XX, X.XX). Each additional hour of sleep was associated with X% lower odds of FMD (aOR = X.XX; 95% CI: X.XX, X.XX). After adjustment for demographic, behavioral, and socioeconomic covariates, regular exercise remained protective (aOR = X.XX; 95% CI: X.XX, X.XX). The likelihood ratio test for an exercise × sex interaction was not statistically significant (p = X.XX), and stratum-specific ORs were similar in magnitude, suggesting that the protective association does not differ meaningfully by sex.

What Every Logistic Regression Paper Should Report

  1. Sample size and outcome prevalence
  2. Table 1 stratified by outcome
  3. Crude and adjusted ORs with 95% CIs (not just p-values)
  4. Which covariates were included and why (DAG, prior literature, change-in-estimate)
  5. Tests of effect modification that were planned
  6. Goodness-of-fit and discrimination metrics (HL or calibration plot, AUC)
  7. Software and packages with versions

Part 4: Common Pitfalls

Common Pitfalls in Logistic Regression
Pitfall Fix
Reporting p-values without ORs and CIs Always show point estimates with 95% CIs
Stepwise selection on small samples Use a DAG; pre-specify covariates
Adjusting for mediators Draw a DAG; do not adjust on the causal pathway
Adjusting for colliders Avoid conditioning on common effects of exposure and outcome
Ignoring the rare-events rule (EPV < 10) Aim for ≥ 10 events per predictor; use penalized methods if not
Interpreting OR as relative risk when outcome is common Use log-binomial or modified Poisson when prevalence > ~10%
Forgetting to check linearity in the logit Restricted cubic splines or LOWESS-of-logit plots
Reporting only the p-value of an interaction Always show stratum-specific estimates

Summary

Step Key Action
1. Question Anchor your model in a clear research question and DAG
2. Screen Univariable models inform but do not select
3. Build Include confounders identified a priori
4. Confound Compare crude vs adjusted; 10% rule of thumb
5. Modify Plan interactions; report stratum-specific estimates
6. Check Fit, calibration, discrimination, diagnostics
7. Report ORs + CIs, narrative interpretation, full transparency

Next Lecture (April 23)

  • Poisson regression for count outcomes
  • Survival analysis: an introduction

References

  • Kleinbaum, D. G., Kupper, L. L., Nizam, A., & Rosenberg, E. S. (2013). Applied Regression Analysis and Other Multivariable Methods (5th ed.), Chapters 22-23.
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
  • Vittinghoff, E., et al. (2012). Regression Methods in Biostatistics (2nd ed.). Springer.
  • Greenland, S. (1989). Modeling and variable selection in epidemiologic analysis. American Journal of Public Health, 79(3), 340-349.

No lab activity for this lecture — you have practiced this workflow in the previous three labs.