Introduction

Throughout this course, we have modeled continuous outcomes: mentally unhealthy days, blood pressure, BMI. But many outcomes in epidemiology are binary: disease or no disease, survived or died, exposed or unexposed. Linear regression is not appropriate for binary outcomes because it can produce predicted probabilities outside the [0, 1] range and violates the assumptions of normally distributed residuals with constant variance.

Logistic regression is the standard method for modeling binary outcomes. It is arguably the most widely used regression technique in epidemiologic research. In this lecture, we will:

Understand why linear regression fails for binary outcomes
Learn the logistic model and the logit transformation
Connect logistic regression to the odds ratio, the primary measure of effect in epidemiology
Fit and interpret a simple logistic regression model in R
Extend to multiple logistic regression (introduced today, continued next lecture)

Textbook reference: Kleinbaum et al., Chapter 22 (Sections 22.1 through 22.3)

Why/When/Watch-out

Why use logistic regression?

The outcome variable is binary (0/1, yes/no, disease/no disease)
We want to estimate the probability of an event occurring
We need adjusted odds ratios that control for confounders
We want to identify risk factors for a dichotomous health outcome

When to use it?

Cross-sectional studies: prevalence of a condition
Case-control studies: odds of exposure given disease status
Cohort studies: incidence of disease (when outcome is rare, OR approximates RR)

Watch out for:

Separation: when a predictor perfectly predicts the outcome, ML estimation fails
Small sample sizes: logistic regression needs roughly 10-20 events per predictor
Multicollinearity: same issues as in linear regression
Linearity in the logit: the relationship between continuous predictors and the log-odds must be linear

Part 7: In-Class Lab Activity

EPI 553 — Logistic Regression Part 1 Lab Due: April 13, 2026

Instructions

Complete the four tasks below using the BRFSS 2020 dataset (brfss_logistic_2020.rds). Submit a knitted HTML file via Brightspace. You may collaborate, but each student must submit their own work.

Data

Variable	Description	Type
`fmd`	Frequent mental distress (No/Yes)	Factor (outcome)
`menthlth_days`	Mentally unhealthy days (0-30)	Numeric
`physhlth_days`	Physically unhealthy days (0-30)	Numeric
`sleep_hrs`	Hours of sleep per night	Numeric
`age`	Age in years	Numeric
`sex`	Male / Female	Factor
`bmi`	Body mass index	Numeric
`exercise`	Exercised in past 30 days (No/Yes)	Factor
`income_cat`	Household income category (1-8)	Numeric
`smoker`	Former/Never vs. Current	Factor

============================================================================

TASK 1: Explore the Binary Outcome (15 points)

============================================================================

# TASK 1a: Frequency table of FMD (Base R version)

# Create frequency table
fmd_counts <- table(brfss_logistic$fmd)
fmd_percent <- prop.table(fmd_counts) * 100

# Create data frame for display
fmd_result <- data.frame(
  `Frequent Mental Distress` = names(fmd_counts),
  Frequency = as.numeric(fmd_counts),
  `Percentage (%)` = round(as.numeric(fmd_percent), 1)
)

# Display table
knitr::kable(fmd_result, 
             caption = "Table 1: Distribution of Frequent Mental Distress")

Table 1: Distribution of Frequent Mental Distress
Frequent.Mental.Distress	Frequency	Percentage….
No	4243	84.9
Yes	757	15.1

1b. Descriptive summary stratified by FMD (5 pts)

brfss_logistic |>
  dplyr::select(fmd, physhlth_days, sleep_hrs, age, sex, bmi, exercise, income_cat, smoker) |>
  tbl_summary(
    by = fmd,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    label = list(
      physhlth_days ~ "Physical unhealthy days",
      sleep_hrs ~ "Sleep hours (per night)",
      age ~ "Age (years)",
      sex ~ "Sex",
      bmi ~ "BMI",
      exercise ~ "Exercise (past 30 days)",
      income_cat ~ "Income category (1-8)",
      smoker ~ "Smoking status"
    )
  ) |>
  add_overall() |>
  add_p() |>
  bold_labels() |>
  modify_caption("Table 2: Participant Characteristics by FMD Status")

Table 2: Participant Characteristics by FMD Status
Characteristic	Overall N = 5,000¹	No N = 4,243¹	Yes N = 757¹
Physical unhealthy days	4.3 (9.1)	3.2 (7.9)	10.4 (12.5)
Sleep hours (per night)	7.0 (1.5)	7.1 (1.4)	6.5 (1.8)
Age (years)	56.4 (16.2)	57.4 (16.1)	50.5 (15.7)
Sex
Male	2,701 (54%)	2,378 (56%)	323 (43%)
Female	2,299 (46%)	1,865 (44%)	434 (57%)
BMI	28.5 (6.3)	28.4 (6.2)	29.3 (7.0)
Exercise (past 30 days)	3,673 (73%)	3,192 (75%)	481 (64%)
Income category (1-8)
1	228 (4.6%)	164 (3.9%)	64 (8.5%)
2	243 (4.9%)	170 (4.0%)	73 (9.6%)
3	360 (7.2%)	279 (6.6%)	81 (11%)
4	480 (9.6%)	389 (9.2%)	91 (12%)
5	542 (11%)	454 (11%)	88 (12%)
6	726 (15%)	625 (15%)	101 (13%)
7	884 (18%)	775 (18%)	109 (14%)
8	1,537 (31%)	1,387 (33%)	150 (20%)
Smoking status
Former/Never	3,280 (66%)	2,886 (68%)	394 (52%)
Current	1,720 (34%)	1,357 (32%)	363 (48%)
¹ Mean (SD); n (%)

1c. Bar chart: FMD by exercise status (5 pts)

# Option 1: By exercise
plot_exercise <- brfss_logistic |>
  group_by(exercise, fmd) |>
  summarise(count = n(), .groups = "drop") |>
  group_by(exercise) |>
  mutate(proportion = count / sum(count)) |>
  filter(fmd == "Yes") |>
  ggplot(aes(x = exercise, y = proportion, fill = exercise)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = scales::percent(proportion, accuracy = 0.1)),
            vjust = -0.5, size = 5) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.5)) +
  labs(
    title = "Proportion with Frequent Mental Distress by Exercise Status",
    x = "Exercised in Past 30 Days",
    y = "Proportion with FMD"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

print(plot_exercise)

# Option 2: By smoking status (alternative)
plot_smoker <- brfss_logistic |>
  group_by(smoker, fmd) |>
  summarise(count = n(), .groups = "drop") |>
  group_by(smoker) |>
  mutate(proportion = count / sum(count)) |>
  filter(fmd == "Yes") |>
  ggplot(aes(x = smoker, y = proportion, fill = smoker)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = scales::percent(proportion, accuracy = 0.1)),
            vjust = -0.5, size = 5) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.5)) +
  labs(
    title = "Proportion with Frequent Mental Distress by Smoking Status",
    x = "Smoking Status",
    y = "Proportion with FMD"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

print(plot_smoker)

============================================================================

TASK 2: Simple Logistic Regression (20 points)

============================================================================

# 2a. Fit simple logistic regression: FMD ~ exercise (5 pts)
mod_exercise <- glm(fmd ~ exercise, 
                    data = brfss_logistic, 
                    family = binomial(link = "logit"))

# Coefficients on log-odds scale
log_odds_coef <- tidy(mod_exercise, conf.int = FALSE, exponentiate = FALSE)

log_odds_coef |>
  kable(
    digits = 3,
    col.names = c("Term", "Estimate (log-odds)", "Std. Error", "z-statistic", "p-value"),
    caption = "Table 3: Logistic Regression Coefficients (Log-Odds Scale)"
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 3: Logistic Regression Coefficients (Log-Odds Scale)
Term	Estimate (log-odds)	Std. Error	z-statistic	p-value
(Intercept)	-1.337	0.068	-19.769	0
exerciseYes	-0.555	0.083	-6.655	0

2b. Exponentiate to get odds ratios with 95% CI (5 pts)

# Task 2b 

# Fit the model
mod_exercise <- glm(fmd ~ exercise, 
                    data = brfss_logistic, 
                    family = binomial(link = "logit"))

# Get odds ratios using broom::tidy
or_results <- broom::tidy(mod_exercise, conf.int = TRUE, exponentiate = TRUE)

# Create clean table using dplyr::select
or_table <- or_results |>
  dplyr::mutate(
    across(c(estimate, conf.low, conf.high), ~ round(.x, 3)),
    p.value = round(p.value, 4)
  ) |>
  dplyr::select(Term = term, OR = estimate, `CI Lower` = conf.low, 
                `CI Upper` = conf.high, `p-value` = p.value)

# Display the table
or_table |>
  knitr::kable(
    caption = "Table 4: Odds Ratios for Exercise and FMD"
  ) |>
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 4: Odds Ratios for Exercise and FMD
Term	OR	CI Lower	CI Upper	p-value
(Intercept)	0.263	0.230	0.299	0
exerciseYes	0.574	0.488	0.676	0

2c. Interpret the odds ratio (5 pts)

The odds of frequent mental distress among individuals who exercised in the past 30 days are 0.574 times the odds among those who did not exercise (95% CI: 0.488 to 0.676, p < 0.001).

Since the odds ratio is less than 1 and the confidence interval does not include 1, this indicates that exercise is statistically significantly associated with lower odds of frequent mental distress. Specifically, individuals who exercise have approximately 42.6% lower odds of experiencing frequent mental distress compared to those who do not exercise (calculated as 1 - 0.574 = 0.426, or 42.6%).

2d. Plot predicted probability for continuous predictor (5 pts)

# Using age as the continuous predictor
mod_age <- glm(fmd ~ age, 
               data = brfss_logistic, 
               family = binomial(link = "logit"))

# Create prediction plot
pred_age <- ggpredict(mod_age, terms = "age [18:80]")

plot_age <- plot(pred_age) +
  labs(
    title = "Predicted Probability of Frequent Mental Distress by Age",
    subtitle = "From simple logistic regression model",
    x = "Age (years)",
    y = "Predicted Probability of FMD"
  ) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.4)) +
  theme_minimal()

print(plot_age)

# Alternative: Using sleep hours
mod_sleep <- glm(fmd ~ sleep_hrs, 
                 data = brfss_logistic, 
                 family = binomial(link = "logit"))

pred_sleep <- ggpredict(mod_sleep, terms = "sleep_hrs [3:12]")

plot_sleep <- plot(pred_sleep) +
  labs(
    title = "Predicted Probability of Frequent Mental Distress by Sleep Hours",
    subtitle = "From simple logistic regression model",
    x = "Sleep Hours (per night)",
    y = "Predicted Probability of FMD"
  ) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.6)) +
  theme_minimal()

print(plot_sleep)

============================================================================

TASK 3: Comparing Predictors (20 points)

============================================================================

# 3a. Fit three separate simple logistic regression models
mod_sleep <- glm(fmd ~ sleep_hrs, data = brfss_logistic, family = binomial)
mod_bmi <- glm(fmd ~ bmi, data = brfss_logistic, family = binomial)
mod_income <- glm(fmd ~ income_cat, data = brfss_logistic, family = binomial)

3b. Create table comparing ORs from all three models

# Function to extract OR, CI, and p-value from a model
get_or_table <- function(model, predictor_name) {
  # Get coefficients and exponentiate
  coef_est <- coef(model)
  coef_ci <- confint(model)
  p_values <- summary(model)$coefficients[, 4]
  
  # Get the predictor coefficient (excluding intercept)
  pred_idx <- which(names(coef_est) != "(Intercept)")
  
  # Calculate OR and CI
  or_value <- exp(coef_est[pred_idx])
  or_ci <- exp(coef_ci[pred_idx, ])
  
  # Create data frame
  data.frame(
    Predictor = predictor_name,
    OR = round(or_value, 3),
    CI_Lower = round(or_ci[1], 3),
    CI_Upper = round(or_ci[2], 3),
    p_value = round(p_values[pred_idx], 4)
  )
}

# Apply function to each model
compare_or <- rbind(
  get_or_table(mod_sleep, "Sleep Hours (per hour)"),
  get_or_table(mod_bmi, "BMI (per unit)"),
  get_or_table(mod_income, "Income Category (per unit)")
)

# Display table
compare_or |>
  knitr::kable(
    col.names = c("Predictor", "OR", "95% CI Lower", "95% CI Upper", "p-value"),
    caption = "Table 5: Comparison of Crude Odds Ratios from Simple Logistic Models"
  ) |>
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 5: Comparison of Crude Odds Ratios from Simple Logistic Models
	Predictor	OR	95% CI Lower	95% CI Upper	p-value
sleep_hrs	Sleep Hours (per hour)	0.765	0.725	0.807	0e+00
bmi	BMI (per unit)	1.023	1.011	1.034	2e-04
income_cat	Income Category (per unit)	0.821	0.793	0.850	0e+00

3c. Which predictor has the strongest crude association? (5 pts)

# ============================================================================
# Task 3c: Determine which predictor has the strongest crude association
# ============================================================================

# Extract beta coefficients (log-odds) from each model
sleep_beta <- coef(mod_sleep)["sleep_hrs"]
bmi_beta <- coef(mod_bmi)["bmi"]
income_beta <- coef(mod_income)["income_cat"]

# Calculate absolute values
sleep_abs <- abs(sleep_beta)
bmi_abs <- abs(bmi_beta)
income_abs <- abs(income_beta)

# Get p-values
sleep_p <- summary(mod_sleep)$coefficients["sleep_hrs", 4]
bmi_p <- summary(mod_bmi)$coefficients["bmi", 4]
income_p <- summary(mod_income)$coefficients["income_cat", 4]

# Create comparison data frame
beta_comparison <- data.frame(
  Predictor = c("Sleep Hours", "BMI", "Income Category"),
  OR = c(exp(sleep_beta), exp(bmi_beta), exp(income_beta)),
  Beta = round(c(sleep_beta, bmi_beta, income_beta), 4),
  Abs_Beta = round(c(sleep_abs, bmi_abs, income_abs), 4),
  p_value = round(c(sleep_p, bmi_p, income_p), 4)
)

# Sort by absolute beta (largest first)
beta_comparison <- beta_comparison[order(-beta_comparison$Abs_Beta), ]

# Display the table
beta_comparison |>
  knitr::kable(
    col.names = c("Predictor", "OR", "Beta (log-odds)", "|Beta|", "p-value"),
    caption = "Table 6: Comparison of Effect Magnitudes (Absolute Beta Coefficients)"
  ) |>
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 6: Comparison of Effect Magnitudes (Absolute Beta Coefficients)
	Predictor	OR	Beta (log-odds)	\|Beta\|	p-value
sleep_hrs	Sleep Hours	0.7652127	-0.2676	0.2676	0e+00
income_cat	Income Category	0.8213139	-0.1968	0.1968	0e+00
bmi	BMI	1.0225243	0.0223	0.0223	2e-04

# Identify the strongest predictor
strongest <- beta_comparison$Predictor[1]
strongest_beta <- beta_comparison$Abs_Beta[1]

# Print the answer
cat("\n========================================\n")

## 
## ========================================

cat("Task 3c Answer:\n")

## Task 3c Answer:

cat("========================================\n")

## ========================================

cat("The predictor with the strongest crude association with FMD is:", strongest, "\n")

## The predictor with the strongest crude association with FMD is: Sleep Hours

cat("Absolute beta coefficient:", strongest_beta, "\n\n")

## Absolute beta coefficient: 0.2676

cat("Justification:\n")

## Justification:

cat("- Sleep hours |β| =", beta_comparison$Abs_Beta[beta_comparison$Predictor == "Sleep Hours"], "\n")

## - Sleep hours |β| = 0.2676

cat("- BMI |β| =", beta_comparison$Abs_Beta[beta_comparison$Predictor == "BMI"], "\n")

## - BMI |β| = 0.0223

cat("- Income Category |β| =", beta_comparison$Abs_Beta[beta_comparison$Predictor == "Income Category"], "\n")

## - Income Category |β| = 0.1968

cat("\nSleep hours has the largest absolute beta coefficient, meaning each additional\n")

## 
## Sleep hours has the largest absolute beta coefficient, meaning each additional

cat("hour of sleep produces the largest change in the log-odds of FMD compared to\n")

## hour of sleep produces the largest change in the log-odds of FMD compared to

cat("a one-unit change in BMI or income category.\n")

## a one-unit change in BMI or income category.

============================================================================

TASK 4: Introduction to Multiple Logistic Regression (20 points)

============================================================================

# 4a. Fit multiple logistic regression with at least 3 predictors (5 pts)
mod_multi <- glm(fmd ~ exercise + age + sex + smoker + sleep_hrs + income_cat,
                 data = brfss_logistic,
                 family = binomial(link = "logit"))

# 4b. Report adjusted odds ratios using tbl_regression() (5 pts)
tbl_multi <- mod_multi |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      exercise ~ "Exercise (past 30 days)",
      age ~ "Age (per year)",
      sex ~ "Sex",
      smoker ~ "Smoking status",
      sleep_hrs ~ "Sleep hours (per hour)",
      income_cat ~ "Income category (per unit)"
    ),
    ci = TRUE
  ) |>
  bold_labels() |>
  bold_p() |>
  modify_caption("Table 7: Multiple Logistic Regression - Adjusted Odds Ratios")

tbl_multi

Table 7: Multiple Logistic Regression - Adjusted Odds Ratios
Characteristic	OR	95% CI	p-value
Exercise (past 30 days)
No	—	—
Yes	0.61	0.51, 0.73	<0.001
Age (per year)	0.97	0.97, 0.98	<0.001
Sex
Male	—	—
Female	1.67	1.42, 1.96	<0.001
Smoking status
Former/Never	—	—
Current	1.25	1.05, 1.48	0.012
Sleep hours (per hour)	0.81	0.77, 0.86	<0.001
Income category (per unit)	0.85	0.82, 0.89	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4c. Compare crude OR vs adjusted OR for one predictor (5 pts)

# ============================================================================
# Task 4c: Compare crude OR vs adjusted OR for one predictor
# ============================================================================

# First, make sure models are fit
# Simple model (crude)
mod_exercise <- glm(fmd ~ exercise, data = brfss_logistic, family = binomial)

# Multiple model (adjusted)
mod_multi <- glm(fmd ~ exercise + age + sex + smoker + sleep_hrs + income_cat,
                 data = brfss_logistic, family = binomial)

# Get crude OR for exercise
crude_or <- exp(coef(mod_exercise)["exerciseYes"])
crude_ci <- exp(confint(mod_exercise)["exerciseYes", ])
crude_p <- summary(mod_exercise)$coefficients["exerciseYes", 4]

# Get adjusted OR for exercise
adj_or <- exp(coef(mod_multi)["exerciseYes"])
adj_ci <- exp(confint(mod_multi)["exerciseYes", ])
adj_p <- summary(mod_multi)$coefficients["exerciseYes", 4]

# Create comparison table
comparison_exercise <- data.frame(
  Model = c("Crude (Unadjusted)", "Adjusted"),
  OR = round(c(crude_or, adj_or), 3),
  CI_Lower = round(c(crude_ci[1], adj_ci[1]), 3),
  CI_Upper = round(c(crude_ci[2], adj_ci[2]), 3),
  p_value = round(c(crude_p, adj_p), 4)
)

# Display the table
comparison_exercise |>
  knitr::kable(
    col.names = c("Model", "OR", "95% CI Lower", "95% CI Upper", "p-value"),
    caption = "Table 8: Crude vs. Adjusted Odds Ratios for Exercise"
  ) |>
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 8: Crude vs. Adjusted Odds Ratios for Exercise
Model	OR	95% CI Lower	95% CI Upper	p-value
Crude (Unadjusted)	0.574	0.488	0.676	0
Adjusted	0.613	0.514	0.731	0

# Calculate percent change
percent_change <- ((adj_or - crude_or) / crude_or) * 100

# Print interpretation for 4d
cat("\n========================================\n")

## 
## ========================================

cat("Task 4c & 4d: Confounding Assessment\n")

## Task 4c & 4d: Confounding Assessment

cat("========================================\n")

## ========================================

cat("Crude OR:", round(crude_or, 3), "\n")

## Crude OR: 0.574

cat("Adjusted OR:", round(adj_or, 3), "\n")

## Adjusted OR: 0.613

cat("Percent change:", round(percent_change, 1), "%\n\n")

## Percent change: 6.8 %

if (abs(adj_or - crude_or) > 0.01) {
  if (adj_or < crude_or) {
    cat("The adjusted OR is LOWER than the crude OR.\n")
    cat("This indicates that confounding variables were inflating the association.\n")
    cat("After controlling for age, sex, smoking, sleep, and income,\n")
    cat("the protective effect of exercise attenuated (moved toward 1).\n")
  } else if (adj_or > crude_or) {
    cat("The adjusted OR is HIGHER than the crude OR.\n")
    cat("This indicates that confounding variables were masking the association.\n")
  }
} else {
  cat("The crude and adjusted ORs are similar, suggesting little confounding.\n")
}

## The adjusted OR is HIGHER than the crude OR.
## This indicates that confounding variables were masking the association.

4d. Assess confounding (5 pts)

Crude OR: 0.574 Adjusted OR: 0.613 Percent change: 6.8 %

The adjusted OR is HIGHER than the crude OR. This indicates that confounding variables were masking the association.

============================================================================

BONUS: Additional Model Diagnostics and Visualizations

============================================================================

# Likelihood Ratio Test comparing full model to null
null_mod <- glm(fmd ~ 1, data = brfss_logistic, family = binomial)

lr_test <- anova(null_mod, mod_multi, test = "Chisq")

lr_test |>
  kable(
    digits = 3,
    caption = "Table 10: Likelihood Ratio Test - Full Model vs. Null"
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 10: Likelihood Ratio Test - Full Model vs. Null
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
4999	4251.299	NA	NA	NA
4993	3870.197	6	381.101	0

End of Lab Activity

Logistic Regression — Part 1

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Fizza Zaheer

April 12, 2026

Introduction

Why/When/Watch-out

Why use logistic regression?

When to use it?

Watch out for:

Part 7: In-Class Lab Activity

Instructions

Data

============================================================================

TASK 1: Explore the Binary Outcome (15 points)

============================================================================

1b. Descriptive summary stratified by FMD (5 pts)

1c. Bar chart: FMD by exercise status (5 pts)

============================================================================

TASK 2: Simple Logistic Regression (20 points)

============================================================================

2b. Exponentiate to get odds ratios with 95% CI (5 pts)

2c. Interpret the odds ratio (5 pts)

2d. Plot predicted probability for continuous predictor (5 pts)

============================================================================

TASK 3: Comparing Predictors (20 points)

============================================================================

3b. Create table comparing ORs from all three models

3c. Which predictor has the strongest crude association? (5 pts)

============================================================================

TASK 4: Introduction to Multiple Logistic Regression (20 points)

============================================================================

4c. Compare crude OR vs adjusted OR for one predictor (5 pts)

4d. Assess confounding (5 pts)

============================================================================

BONUS: Additional Model Diagnostics and Visualizations

============================================================================