Introduction

Throughout this course, we have modeled continuous outcomes: mentally unhealthy days, blood pressure, BMI. But many outcomes in epidemiology are binary: disease or no disease, survived or died, exposed or unexposed. Linear regression is not appropriate for binary outcomes because it can produce predicted probabilities outside the [0, 1] range and violates the assumptions of normally distributed residuals with constant variance.

Logistic regression is the standard method for modeling binary outcomes. It is arguably the most widely used regression technique in epidemiologic research. In this lecture, we will:

Understand why linear regression fails for binary outcomes
Learn the logistic model and the logit transformation
Connect logistic regression to the odds ratio, the primary measure of effect in epidemiology
Fit and interpret a simple logistic regression model in R
Extend to multiple logistic regression (introduced today, continued next lecture)

Textbook reference: Kleinbaum et al., Chapter 22 (Sections 22.1 through 22.3)

Why/When/Watch-out

Why use logistic regression?

The outcome variable is binary (0/1, yes/no, disease/no disease)
We want to estimate the probability of an event occurring
We need adjusted odds ratios that control for confounders
We want to identify risk factors for a dichotomous health outcome

When to use it?

Cross-sectional studies: prevalence of a condition
Case-control studies: odds of exposure given disease status
Cohort studies: incidence of disease (when outcome is rare, OR approximates RR)

Watch out for:

Separation: when a predictor perfectly predicts the outcome, ML estimation fails
Small sample sizes: logistic regression needs roughly 10-20 events per predictor
Multicollinearity: same issues as in linear regression
Linearity in the logit: the relationship between continuous predictors and the log-odds must be linear

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(ggeffects)
library(plotly)
library(dplyr)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

Part 7: In-Class Lab Activity

EPI 553 — Logistic Regression Part 1 Lab Due: April 13, 2026

Instructions

Complete the four tasks below using the BRFSS 2020 dataset (brfss_logistic_2020.rds). Submit a knitted HTML file via Brightspace. You may collaborate, but each student must submit their own work.

Data

Variable	Description	Type
`fmd`	Frequent mental distress (No/Yes)	Factor (outcome)
`menthlth_days`	Mentally unhealthy days (0-30)	Numeric
`physhlth_days`	Physically unhealthy days (0-30)	Numeric
`sleep_hrs`	Hours of sleep per night	Numeric
`age`	Age in years	Numeric
`sex`	Male / Female	Factor
`bmi`	Body mass index	Numeric
`exercise`	Exercised in past 30 days (No/Yes)	Factor
`income_cat`	Household income category (1-8)	Numeric
`smoker`	Former/Never vs. Current	Factor

brfss_logistic <- readRDS("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Labs/BRFSS_logistic.XPT")

Task 1: Explore the Binary Outcome (15 points)

1a. (5 pts) Create a frequency table showing the number and percentage of individuals with and without frequent mental distress.

tibble(Metric = c("N", "FMD Cases", "FMD Prevalence"),
       Value  = c(nrow(brfss_logistic), 
                  sum(brfss_logistic$fmd == "Yes"),
                  paste0(round(100 * mean(brfss_logistic$fmd == "Yes"), 1), "%"))) |>
  kable(caption = "Analytic Dataset Overview") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Overview
Metric	Value
N	5000
FMD Cases	757
FMD Prevalence	15.1%

1b. (5 pts) Create a descriptive summary table of at least 4 predictors, stratified by FMD status. Use tbl_summary().

brfss_logistic |>
  tbl_summary(
    by = fmd,
    include = c(physhlth_days, sleep_hrs, age, sex),
    type = list(
      c(physhlth_days, sleep_hrs, age) ~ "continuous"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})"
    ),
    label = list(
      physhlth_days ~ "Physical unhealthy days",
      sleep_hrs     ~ "Sleep hours",
      age           ~ "Age (years)",
      sex           ~ "Sex"
    )
  ) |>
  add_overall() |>
  add_p() |>
  bold_labels()

Characteristic	Overall N = 5,000¹	No N = 4,243¹	Yes N = 757¹	p-value²
Physical unhealthy days	4 (9)	3 (8)	10 (13)	<0.001
Sleep hours	7.00 (1.48)	7.09 (1.40)	6.51 (1.83)	<0.001
Age (years)	56 (16)	57 (16)	50 (16)	<0.001
Sex				<0.001
Male	2,701 (54%)	2,378 (56%)	323 (43%)
Female	2,299 (46%)	1,865 (44%)	434 (57%)
¹ Mean (SD); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

1c. (5 pts) Create a bar chart showing the proportion of FMD by exercise status OR smoking status.

ggplot(brfss_logistic, aes(x = exercise, fill = fmd)) +
  geom_bar(alpha = 0.85) +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Frequent Mental Distress by Exercise",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x = "Exercise",
    y = "FMD count"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Task 2: Simple Logistic Regression (20 points)

2a. (5 pts) Fit a simple logistic regression model predicting FMD from exercise. Report the coefficients on the log-odds scale.

brfss_logistic <- brfss_logistic |>
  mutate(fmd_num = as.numeric(fmd == "Yes"))

model_exer <- glm(fmd_num ~ exercise, data = brfss_logistic,
                      family = binomial(link = "logit"))

tidy(model_exer, conf.int = TRUE, exponentiate = FALSE) |>
  kable(digits = 3, caption = "Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-1.337	0.068	-19.769	0	-1.471	-1.206
exerciseYes	-0.555	0.083	-6.655	0	-0.718	-0.391

2b. (5 pts) Exponentiate the coefficients to obtain odds ratios with 95% confidence intervals.

tidy(model_exer, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.263	0.068	-19.769	0	0.230	0.299
exerciseYes	0.574	0.083	-6.655	0	0.488	0.676

2c. (5 pts) Interpret the odds ratio for exercise in the context of the research question.

Exercise is associated with 0.574 the risk of frequent mental distress, or around a 43% reduction compared to no exercise.

2d. (5 pts) Create a plot showing the predicted probability of FMD across levels of a continuous predictor (e.g., age or sleep hours).

model_age <- glm(fmd ~ age, data = brfss_logistic,
               family = binomial(link = "logit"))

ggpredict(model_age, terms = "age [18:80]") |>
  plot() +
  labs(title = "Predicted Probability of Frequent Mental Distress by Age",
       x = "Age (years)", y = "Predicted Probability of FMD") +
  theme_minimal()

Task 3: Comparing Predictors (20 points)

3a. (5 pts) Fit three separate simple logistic regression models, each with a different predictor of your choice.

model_1 <- glm(fmd ~ sleep_hrs, data = brfss_logistic,
               family = binomial(link = "logit"))

model_2 <- glm(fmd ~ bmi, data = brfss_logistic,
               family = binomial(link = "logit"))

model_3 <- glm(fmd ~ income_cat, data = brfss_logistic,
               family = binomial(link = "logit"))

3b. (10 pts) Create a table comparing the odds ratios from all three models.

library(modelsummary)
modelsummary(list(model_1, model_2, model_3), exponentiate = TRUE)

	(1)	(2)	(3)
(Intercept)	1.101	0.094	0.531
	(0.203)	(0.017)	(0.054)
sleep_hrs	0.765
	(0.021)
bmi		1.023
		(0.006)
income_cat			0.821
			(0.015)
Num.Obs.	5000	5000	5000
AIC	4156.3	4241.7	4134.1
BIC	4169.4	4254.7	4147.2
Log.Lik.	-2076.174	-2118.852	-2065.073
F	96.736	14.023	123.259
RMSE	0.35	0.36	0.35

3c. (5 pts) Which predictor has the strongest crude association with FMD? Justify your answer.

Sleep has the strongest crude association with FMD. It is the strongest because the Odds Ratio is the furthest from 1.0, which indicates no association.

Task 4: Introduction to Multiple Logistic Regression (20 points)

4a. (5 pts) Fit a multiple logistic regression model predicting FMD from at least 3 predictors.

multi <- glm(fmd ~ sleep_hrs + bmi + income_cat,
                 data = brfss_logistic,
                 family = binomial(link = "logit"))

4b. (5 pts) Report the adjusted odds ratios using tbl_regression().

multi |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      sleep_hrs  ~ "Sleep hours (per hour)",
      bmi ~ "BMI",
      income_cat ~ "Income category (per unit)"
    )
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
Sleep hours (per hour)	0.78	0.74, 0.82	<0.001
BMI	1.02	1.01, 1.03	0.003
Income category (per unit)	0.83	0.80, 0.86	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4c. (5 pts) For one predictor, compare the crude OR (from Task 3) with the adjusted OR (from Task 4). Show both values.

crude_sleep <- tidy(model_1, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "sleep_hrs") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Crude")

adj_sleep <- tidy(multi, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "sleep_hrs") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Adjusted")

bind_rows(crude_sleep, adj_sleep) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
  kable(col.names = c("Predictor", "OR", "95% CI Lower", "95% CI Upper", "Type"),
        caption = "Crude vs. Adjusted Odds Ratios") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Crude vs. Adjusted Odds Ratios
Predictor	OR	95% CI Lower	95% CI Upper	Type
sleep_hrs	0.765	0.725	0.807	Crude
sleep_hrs	0.783	0.743	0.825	Adjusted

4d. (5 pts) In 2-3 sentences, assess whether confounding is present for the predictor you chose. Which direction did the OR change, and what does this mean?

Confounding is not present for the predictor I chose because the difference between the crude and adjusted odds ratios is not more than 10% of the crude ratio. The odds ratio increased from the crude to the adjusted model, moving slightly closer to an odds ratio of 1.0. This indicates that the relationship between sleep hours and frequent mental distress is slightly lower in magnitude when adjusting for bmi and income.

Completion credit (25 points): Awarded for a complete, good-faith attempt at all tasks. Total: 75 + 25 = 100 points.

End of Lab Activity

Logistic Regression — Part 1

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Josh Macera

April 13, 2026