Introduction

Throughout this course, we have modeled continuous outcomes: mentally unhealthy days, blood pressure, BMI. But many outcomes in epidemiology are binary: disease or no disease, survived or died, exposed or unexposed. Linear regression is not appropriate for binary outcomes because it can produce predicted probabilities outside the [0, 1] range and violates the assumptions of normally distributed residuals with constant variance.

Logistic regression is the standard method for modeling binary outcomes. It is arguably the most widely used regression technique in epidemiologic research. In this lecture, we will:

  1. Understand why linear regression fails for binary outcomes
  2. Learn the logistic model and the logit transformation
  3. Connect logistic regression to the odds ratio, the primary measure of effect in epidemiology
  4. Fit and interpret a simple logistic regression model in R
  5. Extend to multiple logistic regression (introduced today, continued next lecture)

Textbook reference: Kleinbaum et al., Chapter 22 (Sections 22.1 through 22.3)


Why/When/Watch-out

Why use logistic regression?

  • The outcome variable is binary (0/1, yes/no, disease/no disease)
  • We want to estimate the probability of an event occurring
  • We need adjusted odds ratios that control for confounders
  • We want to identify risk factors for a dichotomous health outcome

When to use it?

  • Cross-sectional studies: prevalence of a condition
  • Case-control studies: odds of exposure given disease status
  • Cohort studies: incidence of disease (when outcome is rare, OR approximates RR)

Watch out for:

  • Separation: when a predictor perfectly predicts the outcome, ML estimation fails
  • Small sample sizes: logistic regression needs roughly 10-20 events per predictor
  • Multicollinearity: same issues as in linear regression
  • Linearity in the logit: the relationship between continuous predictors and the log-odds must be linear

Part 7: In-Class Lab Activity

EPI 553 — Logistic Regression Part 1 Lab Due: April 13, 2026

Instructions

Complete the four tasks below using the BRFSS 2020 dataset (brfss_logistic_2020.rds). Submit a knitted HTML file via Brightspace. You may collaborate, but each student must submit their own work.

Data

Variable Description Type
fmd Frequent mental distress (No/Yes) Factor (outcome)
menthlth_days Mentally unhealthy days (0-30) Numeric
physhlth_days Physically unhealthy days (0-30) Numeric
sleep_hrs Hours of sleep per night Numeric
age Age in years Numeric
sex Male / Female Factor
bmi Body mass index Numeric
exercise Exercised in past 30 days (No/Yes) Factor
income_cat Household income category (1-8) Numeric
smoker Former/Never vs. Current Factor

============================================================================

TASK 1: Explore the Binary Outcome (15 points)

============================================================================

# TASK 1a: Frequency table of FMD (Base R version)

# Create frequency table
fmd_counts <- table(brfss_logistic$fmd)
fmd_percent <- prop.table(fmd_counts) * 100

# Create data frame for display
fmd_result <- data.frame(
  `Frequent Mental Distress` = names(fmd_counts),
  Frequency = as.numeric(fmd_counts),
  `Percentage (%)` = round(as.numeric(fmd_percent), 1)
)

# Display table
knitr::kable(fmd_result, 
             caption = "Table 1: Distribution of Frequent Mental Distress")
Table 1: Distribution of Frequent Mental Distress
Frequent.Mental.Distress Frequency Percentage….
No 4243 84.9
Yes 757 15.1

1b. Descriptive summary stratified by FMD (5 pts)

brfss_logistic |>
  dplyr::select(fmd, physhlth_days, sleep_hrs, age, sex, bmi, exercise, income_cat, smoker) |>
  tbl_summary(
    by = fmd,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    label = list(
      physhlth_days ~ "Physical unhealthy days",
      sleep_hrs ~ "Sleep hours (per night)",
      age ~ "Age (years)",
      sex ~ "Sex",
      bmi ~ "BMI",
      exercise ~ "Exercise (past 30 days)",
      income_cat ~ "Income category (1-8)",
      smoker ~ "Smoking status"
    )
  ) |>
  add_overall() |>
  add_p() |>
  bold_labels() |>
  modify_caption("Table 2: Participant Characteristics by FMD Status")
Table 2: Participant Characteristics by FMD Status
Characteristic Overall
N = 5,000
1
No
N = 4,243
1
Yes
N = 757
1
p-value
Physical unhealthy days 4.3 (9.1) 3.2 (7.9) 10.4 (12.5)
Sleep hours (per night) 7.0 (1.5) 7.1 (1.4) 6.5 (1.8)
Age (years) 56.4 (16.2) 57.4 (16.1) 50.5 (15.7)
Sex



    Male 2,701 (54%) 2,378 (56%) 323 (43%)
    Female 2,299 (46%) 1,865 (44%) 434 (57%)
BMI 28.5 (6.3) 28.4 (6.2) 29.3 (7.0)
Exercise (past 30 days) 3,673 (73%) 3,192 (75%) 481 (64%)
Income category (1-8)



    1 228 (4.6%) 164 (3.9%) 64 (8.5%)
    2 243 (4.9%) 170 (4.0%) 73 (9.6%)
    3 360 (7.2%) 279 (6.6%) 81 (11%)
    4 480 (9.6%) 389 (9.2%) 91 (12%)
    5 542 (11%) 454 (11%) 88 (12%)
    6 726 (15%) 625 (15%) 101 (13%)
    7 884 (18%) 775 (18%) 109 (14%)
    8 1,537 (31%) 1,387 (33%) 150 (20%)
Smoking status



    Former/Never 3,280 (66%) 2,886 (68%) 394 (52%)
    Current 1,720 (34%) 1,357 (32%) 363 (48%)
1 Mean (SD); n (%)

1c. Bar chart: FMD by exercise status (5 pts)

# Option 1: By exercise
plot_exercise <- brfss_logistic |>
  group_by(exercise, fmd) |>
  summarise(count = n(), .groups = "drop") |>
  group_by(exercise) |>
  mutate(proportion = count / sum(count)) |>
  filter(fmd == "Yes") |>
  ggplot(aes(x = exercise, y = proportion, fill = exercise)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = scales::percent(proportion, accuracy = 0.1)),
            vjust = -0.5, size = 5) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.5)) +
  labs(
    title = "Proportion with Frequent Mental Distress by Exercise Status",
    x = "Exercised in Past 30 Days",
    y = "Proportion with FMD"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

print(plot_exercise)

# Option 2: By smoking status (alternative)
plot_smoker <- brfss_logistic |>
  group_by(smoker, fmd) |>
  summarise(count = n(), .groups = "drop") |>
  group_by(smoker) |>
  mutate(proportion = count / sum(count)) |>
  filter(fmd == "Yes") |>
  ggplot(aes(x = smoker, y = proportion, fill = smoker)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = scales::percent(proportion, accuracy = 0.1)),
            vjust = -0.5, size = 5) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.5)) +
  labs(
    title = "Proportion with Frequent Mental Distress by Smoking Status",
    x = "Smoking Status",
    y = "Proportion with FMD"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

print(plot_smoker)

============================================================================

TASK 2: Simple Logistic Regression (20 points)

============================================================================

# 2a. Fit simple logistic regression: FMD ~ exercise (5 pts)
mod_exercise <- glm(fmd ~ exercise, 
                    data = brfss_logistic, 
                    family = binomial(link = "logit"))

# Coefficients on log-odds scale
log_odds_coef <- tidy(mod_exercise, conf.int = FALSE, exponentiate = FALSE)

log_odds_coef |>
  kable(
    digits = 3,
    col.names = c("Term", "Estimate (log-odds)", "Std. Error", "z-statistic", "p-value"),
    caption = "Table 3: Logistic Regression Coefficients (Log-Odds Scale)"
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 3: Logistic Regression Coefficients (Log-Odds Scale)
Term Estimate (log-odds) Std. Error z-statistic p-value
(Intercept) -1.337 0.068 -19.769 0
exerciseYes -0.555 0.083 -6.655 0

2b. Exponentiate to get odds ratios with 95% CI (5 pts)

# Task 2b 

# Fit the model
mod_exercise <- glm(fmd ~ exercise, 
                    data = brfss_logistic, 
                    family = binomial(link = "logit"))

# Get odds ratios using broom::tidy
or_results <- broom::tidy(mod_exercise, conf.int = TRUE, exponentiate = TRUE)

# Create clean table using dplyr::select
or_table <- or_results |>
  dplyr::mutate(
    across(c(estimate, conf.low, conf.high), ~ round(.x, 3)),
    p.value = round(p.value, 4)
  ) |>
  dplyr::select(Term = term, OR = estimate, `CI Lower` = conf.low, 
                `CI Upper` = conf.high, `p-value` = p.value)

# Display the table
or_table |>
  knitr::kable(
    caption = "Table 4: Odds Ratios for Exercise and FMD"
  ) |>
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 4: Odds Ratios for Exercise and FMD
Term OR CI Lower CI Upper p-value
(Intercept) 0.263 0.230 0.299 0
exerciseYes 0.574 0.488 0.676 0

2c. Interpret the odds ratio (5 pts)

The odds of frequent mental distress among individuals who exercised in the past 30 days are 0.574 times the odds among those who did not exercise (95% CI: 0.488 to 0.676, p < 0.001).

Since the odds ratio is less than 1 and the confidence interval does not include 1, this indicates that exercise is statistically significantly associated with lower odds of frequent mental distress. Specifically, individuals who exercise have approximately 42.6% lower odds of experiencing frequent mental distress compared to those who do not exercise (calculated as 1 - 0.574 = 0.426, or 42.6%).

2d. Plot predicted probability for continuous predictor (5 pts)

# Using age as the continuous predictor
mod_age <- glm(fmd ~ age, 
               data = brfss_logistic, 
               family = binomial(link = "logit"))

# Create prediction plot
pred_age <- ggpredict(mod_age, terms = "age [18:80]")

plot_age <- plot(pred_age) +
  labs(
    title = "Predicted Probability of Frequent Mental Distress by Age",
    subtitle = "From simple logistic regression model",
    x = "Age (years)",
    y = "Predicted Probability of FMD"
  ) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.4)) +
  theme_minimal()

print(plot_age)

# Alternative: Using sleep hours
mod_sleep <- glm(fmd ~ sleep_hrs, 
                 data = brfss_logistic, 
                 family = binomial(link = "logit"))

pred_sleep <- ggpredict(mod_sleep, terms = "sleep_hrs [3:12]")

plot_sleep <- plot(pred_sleep) +
  labs(
    title = "Predicted Probability of Frequent Mental Distress by Sleep Hours",
    subtitle = "From simple logistic regression model",
    x = "Sleep Hours (per night)",
    y = "Predicted Probability of FMD"
  ) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.6)) +
  theme_minimal()

print(plot_sleep)

============================================================================

TASK 3: Comparing Predictors (20 points)

============================================================================

# 3a. Fit three separate simple logistic regression models
mod_sleep <- glm(fmd ~ sleep_hrs, data = brfss_logistic, family = binomial)
mod_bmi <- glm(fmd ~ bmi, data = brfss_logistic, family = binomial)
mod_income <- glm(fmd ~ income_cat, data = brfss_logistic, family = binomial)

3b. Create table comparing ORs from all three models

# Function to extract OR, CI, and p-value from a model
get_or_table <- function(model, predictor_name) {
  # Get coefficients and exponentiate
  coef_est <- coef(model)
  coef_ci <- confint(model)
  p_values <- summary(model)$coefficients[, 4]
  
  # Get the predictor coefficient (excluding intercept)
  pred_idx <- which(names(coef_est) != "(Intercept)")
  
  # Calculate OR and CI
  or_value <- exp(coef_est[pred_idx])
  or_ci <- exp(coef_ci[pred_idx, ])
  
  # Create data frame
  data.frame(
    Predictor = predictor_name,
    OR = round(or_value, 3),
    CI_Lower = round(or_ci[1], 3),
    CI_Upper = round(or_ci[2], 3),
    p_value = round(p_values[pred_idx], 4)
  )
}

# Apply function to each model
compare_or <- rbind(
  get_or_table(mod_sleep, "Sleep Hours (per hour)"),
  get_or_table(mod_bmi, "BMI (per unit)"),
  get_or_table(mod_income, "Income Category (per unit)")
)

# Display table
compare_or |>
  knitr::kable(
    col.names = c("Predictor", "OR", "95% CI Lower", "95% CI Upper", "p-value"),
    caption = "Table 5: Comparison of Crude Odds Ratios from Simple Logistic Models"
  ) |>
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 5: Comparison of Crude Odds Ratios from Simple Logistic Models
Predictor OR 95% CI Lower 95% CI Upper p-value
sleep_hrs Sleep Hours (per hour) 0.765 0.725 0.807 0e+00
bmi BMI (per unit) 1.023 1.011 1.034 2e-04
income_cat Income Category (per unit) 0.821 0.793 0.850 0e+00

3c. Which predictor has the strongest crude association? (5 pts)

# ============================================================================
# Task 3c: Determine which predictor has the strongest crude association
# ============================================================================

# Extract beta coefficients (log-odds) from each model
sleep_beta <- coef(mod_sleep)["sleep_hrs"]
bmi_beta <- coef(mod_bmi)["bmi"]
income_beta <- coef(mod_income)["income_cat"]

# Calculate absolute values
sleep_abs <- abs(sleep_beta)
bmi_abs <- abs(bmi_beta)
income_abs <- abs(income_beta)

# Get p-values
sleep_p <- summary(mod_sleep)$coefficients["sleep_hrs", 4]
bmi_p <- summary(mod_bmi)$coefficients["bmi", 4]
income_p <- summary(mod_income)$coefficients["income_cat", 4]

# Create comparison data frame
beta_comparison <- data.frame(
  Predictor = c("Sleep Hours", "BMI", "Income Category"),
  OR = c(exp(sleep_beta), exp(bmi_beta), exp(income_beta)),
  Beta = round(c(sleep_beta, bmi_beta, income_beta), 4),
  Abs_Beta = round(c(sleep_abs, bmi_abs, income_abs), 4),
  p_value = round(c(sleep_p, bmi_p, income_p), 4)
)

# Sort by absolute beta (largest first)
beta_comparison <- beta_comparison[order(-beta_comparison$Abs_Beta), ]

# Display the table
beta_comparison |>
  knitr::kable(
    col.names = c("Predictor", "OR", "Beta (log-odds)", "|Beta|", "p-value"),
    caption = "Table 6: Comparison of Effect Magnitudes (Absolute Beta Coefficients)"
  ) |>
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 6: Comparison of Effect Magnitudes (Absolute Beta Coefficients)
Predictor OR Beta (log-odds) &#124;Beta&#124; p-value
sleep_hrs Sleep Hours 0.7652127 -0.2676 0.2676 0e+00
income_cat Income Category 0.8213139 -0.1968 0.1968 0e+00
bmi BMI 1.0225243 0.0223 0.0223 2e-04
# Identify the strongest predictor
strongest <- beta_comparison$Predictor[1]
strongest_beta <- beta_comparison$Abs_Beta[1]

# Print the answer
cat("\n========================================\n")
## 
## ========================================
cat("Task 3c Answer:\n")
## Task 3c Answer:
cat("========================================\n")
## ========================================
cat("The predictor with the strongest crude association with FMD is:", strongest, "\n")
## The predictor with the strongest crude association with FMD is: Sleep Hours
cat("Absolute beta coefficient:", strongest_beta, "\n\n")
## Absolute beta coefficient: 0.2676
cat("Justification:\n")
## Justification:
cat("- Sleep hours |β| =", beta_comparison$Abs_Beta[beta_comparison$Predictor == "Sleep Hours"], "\n")
## - Sleep hours |β| = 0.2676
cat("- BMI |β| =", beta_comparison$Abs_Beta[beta_comparison$Predictor == "BMI"], "\n")
## - BMI |β| = 0.0223
cat("- Income Category |β| =", beta_comparison$Abs_Beta[beta_comparison$Predictor == "Income Category"], "\n")
## - Income Category |β| = 0.1968
cat("\nSleep hours has the largest absolute beta coefficient, meaning each additional\n")
## 
## Sleep hours has the largest absolute beta coefficient, meaning each additional
cat("hour of sleep produces the largest change in the log-odds of FMD compared to\n")
## hour of sleep produces the largest change in the log-odds of FMD compared to
cat("a one-unit change in BMI or income category.\n")
## a one-unit change in BMI or income category.

============================================================================

TASK 4: Introduction to Multiple Logistic Regression (20 points)

============================================================================

# 4a. Fit multiple logistic regression with at least 3 predictors (5 pts)
mod_multi <- glm(fmd ~ exercise + age + sex + smoker + sleep_hrs + income_cat,
                 data = brfss_logistic,
                 family = binomial(link = "logit"))
# 4b. Report adjusted odds ratios using tbl_regression() (5 pts)
tbl_multi <- mod_multi |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      exercise ~ "Exercise (past 30 days)",
      age ~ "Age (per year)",
      sex ~ "Sex",
      smoker ~ "Smoking status",
      sleep_hrs ~ "Sleep hours (per hour)",
      income_cat ~ "Income category (per unit)"
    ),
    ci = TRUE
  ) |>
  bold_labels() |>
  bold_p() |>
  modify_caption("Table 7: Multiple Logistic Regression - Adjusted Odds Ratios")

tbl_multi
Table 7: Multiple Logistic Regression - Adjusted Odds Ratios
Characteristic OR 95% CI p-value
Exercise (past 30 days)


    No
    Yes 0.61 0.51, 0.73 <0.001
Age (per year) 0.97 0.97, 0.98 <0.001
Sex


    Male
    Female 1.67 1.42, 1.96 <0.001
Smoking status


    Former/Never
    Current 1.25 1.05, 1.48 0.012
Sleep hours (per hour) 0.81 0.77, 0.86 <0.001
Income category (per unit) 0.85 0.82, 0.89 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4c. Compare crude OR vs adjusted OR for one predictor (5 pts)

# ============================================================================
# Task 4c: Compare crude OR vs adjusted OR for one predictor
# ============================================================================

# First, make sure models are fit
# Simple model (crude)
mod_exercise <- glm(fmd ~ exercise, data = brfss_logistic, family = binomial)

# Multiple model (adjusted)
mod_multi <- glm(fmd ~ exercise + age + sex + smoker + sleep_hrs + income_cat,
                 data = brfss_logistic, family = binomial)

# Get crude OR for exercise
crude_or <- exp(coef(mod_exercise)["exerciseYes"])
crude_ci <- exp(confint(mod_exercise)["exerciseYes", ])
crude_p <- summary(mod_exercise)$coefficients["exerciseYes", 4]

# Get adjusted OR for exercise
adj_or <- exp(coef(mod_multi)["exerciseYes"])
adj_ci <- exp(confint(mod_multi)["exerciseYes", ])
adj_p <- summary(mod_multi)$coefficients["exerciseYes", 4]

# Create comparison table
comparison_exercise <- data.frame(
  Model = c("Crude (Unadjusted)", "Adjusted"),
  OR = round(c(crude_or, adj_or), 3),
  CI_Lower = round(c(crude_ci[1], adj_ci[1]), 3),
  CI_Upper = round(c(crude_ci[2], adj_ci[2]), 3),
  p_value = round(c(crude_p, adj_p), 4)
)

# Display the table
comparison_exercise |>
  knitr::kable(
    col.names = c("Model", "OR", "95% CI Lower", "95% CI Upper", "p-value"),
    caption = "Table 8: Crude vs. Adjusted Odds Ratios for Exercise"
  ) |>
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 8: Crude vs. Adjusted Odds Ratios for Exercise
Model OR 95% CI Lower 95% CI Upper p-value
Crude (Unadjusted) 0.574 0.488 0.676 0
Adjusted 0.613 0.514 0.731 0
# Calculate percent change
percent_change <- ((adj_or - crude_or) / crude_or) * 100

# Print interpretation for 4d
cat("\n========================================\n")
## 
## ========================================
cat("Task 4c & 4d: Confounding Assessment\n")
## Task 4c & 4d: Confounding Assessment
cat("========================================\n")
## ========================================
cat("Crude OR:", round(crude_or, 3), "\n")
## Crude OR: 0.574
cat("Adjusted OR:", round(adj_or, 3), "\n")
## Adjusted OR: 0.613
cat("Percent change:", round(percent_change, 1), "%\n\n")
## Percent change: 6.8 %
if (abs(adj_or - crude_or) > 0.01) {
  if (adj_or < crude_or) {
    cat("The adjusted OR is LOWER than the crude OR.\n")
    cat("This indicates that confounding variables were inflating the association.\n")
    cat("After controlling for age, sex, smoking, sleep, and income,\n")
    cat("the protective effect of exercise attenuated (moved toward 1).\n")
  } else if (adj_or > crude_or) {
    cat("The adjusted OR is HIGHER than the crude OR.\n")
    cat("This indicates that confounding variables were masking the association.\n")
  }
} else {
  cat("The crude and adjusted ORs are similar, suggesting little confounding.\n")
}
## The adjusted OR is HIGHER than the crude OR.
## This indicates that confounding variables were masking the association.

4d. Assess confounding (5 pts)

Crude OR: 0.574 Adjusted OR: 0.613 Percent change: 6.8 %

The adjusted OR is HIGHER than the crude OR. This indicates that confounding variables were masking the association.

============================================================================

BONUS: Additional Model Diagnostics and Visualizations

============================================================================

# Likelihood Ratio Test comparing full model to null
null_mod <- glm(fmd ~ 1, data = brfss_logistic, family = binomial)

lr_test <- anova(null_mod, mod_multi, test = "Chisq")

lr_test |>
  kable(
    digits = 3,
    caption = "Table 10: Likelihood Ratio Test - Full Model vs. Null"
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 10: Likelihood Ratio Test - Full Model vs. Null
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
4999 4251.299 NA NA NA
4993 3870.197 6 381.101 0

End of Lab Activity