Throughout this course, we have modeled continuous outcomes: mentally unhealthy days, blood pressure, BMI. But many outcomes in epidemiology are binary: disease or no disease, survived or died, exposed or unexposed. Linear regression is not appropriate for binary outcomes because it can produce predicted probabilities outside the [0, 1] range and violates the assumptions of normally distributed residuals with constant variance.
Logistic regression is the standard method for modeling binary outcomes. It is arguably the most widely used regression technique in epidemiologic research. In this lecture, we will:
Textbook reference: Kleinbaum et al., Chapter 22 (Sections 22.1 through 22.3)
EPI 553 — Logistic Regression Part 1 Lab Due: April 13, 2026
Complete the four tasks below using the BRFSS 2020 dataset
(brfss_logistic_2020.rds). Submit a knitted HTML file via
Brightspace. You may collaborate, but each student must submit their own
work.
| Variable | Description | Type |
|---|---|---|
fmd |
Frequent mental distress (No/Yes) | Factor (outcome) |
menthlth_days |
Mentally unhealthy days (0-30) | Numeric |
physhlth_days |
Physically unhealthy days (0-30) | Numeric |
sleep_hrs |
Hours of sleep per night | Numeric |
age |
Age in years | Numeric |
sex |
Male / Female | Factor |
bmi |
Body mass index | Numeric |
exercise |
Exercised in past 30 days (No/Yes) | Factor |
income_cat |
Household income category (1-8) | Numeric |
smoker |
Former/Never vs. Current | Factor |
# TASK 1a: Frequency table of FMD (Base R version)
# Create frequency table
fmd_counts <- table(brfss_logistic$fmd)
fmd_percent <- prop.table(fmd_counts) * 100
# Create data frame for display
fmd_result <- data.frame(
`Frequent Mental Distress` = names(fmd_counts),
Frequency = as.numeric(fmd_counts),
`Percentage (%)` = round(as.numeric(fmd_percent), 1)
)
# Display table
knitr::kable(fmd_result,
caption = "Table 1: Distribution of Frequent Mental Distress")| Frequent.Mental.Distress | Frequency | Percentage…. |
|---|---|---|
| No | 4243 | 84.9 |
| Yes | 757 | 15.1 |
brfss_logistic |>
dplyr::select(fmd, physhlth_days, sleep_hrs, age, sex, bmi, exercise, income_cat, smoker) |>
tbl_summary(
by = fmd,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
label = list(
physhlth_days ~ "Physical unhealthy days",
sleep_hrs ~ "Sleep hours (per night)",
age ~ "Age (years)",
sex ~ "Sex",
bmi ~ "BMI",
exercise ~ "Exercise (past 30 days)",
income_cat ~ "Income category (1-8)",
smoker ~ "Smoking status"
)
) |>
add_overall() |>
add_p() |>
bold_labels() |>
modify_caption("Table 2: Participant Characteristics by FMD Status")| Characteristic | Overall N = 5,0001 |
No N = 4,2431 |
Yes N = 7571 |
p-value |
|---|---|---|---|---|
| Physical unhealthy days | 4.3 (9.1) | 3.2 (7.9) | 10.4 (12.5) | |
| Sleep hours (per night) | 7.0 (1.5) | 7.1 (1.4) | 6.5 (1.8) | |
| Age (years) | 56.4 (16.2) | 57.4 (16.1) | 50.5 (15.7) | |
| Sex | ||||
| Male | 2,701 (54%) | 2,378 (56%) | 323 (43%) | |
| Female | 2,299 (46%) | 1,865 (44%) | 434 (57%) | |
| BMI | 28.5 (6.3) | 28.4 (6.2) | 29.3 (7.0) | |
| Exercise (past 30 days) | 3,673 (73%) | 3,192 (75%) | 481 (64%) | |
| Income category (1-8) | ||||
| 1 | 228 (4.6%) | 164 (3.9%) | 64 (8.5%) | |
| 2 | 243 (4.9%) | 170 (4.0%) | 73 (9.6%) | |
| 3 | 360 (7.2%) | 279 (6.6%) | 81 (11%) | |
| 4 | 480 (9.6%) | 389 (9.2%) | 91 (12%) | |
| 5 | 542 (11%) | 454 (11%) | 88 (12%) | |
| 6 | 726 (15%) | 625 (15%) | 101 (13%) | |
| 7 | 884 (18%) | 775 (18%) | 109 (14%) | |
| 8 | 1,537 (31%) | 1,387 (33%) | 150 (20%) | |
| Smoking status | ||||
| Former/Never | 3,280 (66%) | 2,886 (68%) | 394 (52%) | |
| Current | 1,720 (34%) | 1,357 (32%) | 363 (48%) | |
| 1 Mean (SD); n (%) | ||||
# Option 1: By exercise
plot_exercise <- brfss_logistic |>
group_by(exercise, fmd) |>
summarise(count = n(), .groups = "drop") |>
group_by(exercise) |>
mutate(proportion = count / sum(count)) |>
filter(fmd == "Yes") |>
ggplot(aes(x = exercise, y = proportion, fill = exercise)) +
geom_col(width = 0.6) +
geom_text(aes(label = scales::percent(proportion, accuracy = 0.1)),
vjust = -0.5, size = 5) +
scale_y_continuous(labels = scales::percent, limits = c(0, 0.5)) +
labs(
title = "Proportion with Frequent Mental Distress by Exercise Status",
x = "Exercised in Past 30 Days",
y = "Proportion with FMD"
) +
theme_minimal() +
theme(legend.position = "none")
print(plot_exercise)# Option 2: By smoking status (alternative)
plot_smoker <- brfss_logistic |>
group_by(smoker, fmd) |>
summarise(count = n(), .groups = "drop") |>
group_by(smoker) |>
mutate(proportion = count / sum(count)) |>
filter(fmd == "Yes") |>
ggplot(aes(x = smoker, y = proportion, fill = smoker)) +
geom_col(width = 0.6) +
geom_text(aes(label = scales::percent(proportion, accuracy = 0.1)),
vjust = -0.5, size = 5) +
scale_y_continuous(labels = scales::percent, limits = c(0, 0.5)) +
labs(
title = "Proportion with Frequent Mental Distress by Smoking Status",
x = "Smoking Status",
y = "Proportion with FMD"
) +
theme_minimal() +
theme(legend.position = "none")
print(plot_smoker)# 2a. Fit simple logistic regression: FMD ~ exercise (5 pts)
mod_exercise <- glm(fmd ~ exercise,
data = brfss_logistic,
family = binomial(link = "logit"))
# Coefficients on log-odds scale
log_odds_coef <- tidy(mod_exercise, conf.int = FALSE, exponentiate = FALSE)
log_odds_coef |>
kable(
digits = 3,
col.names = c("Term", "Estimate (log-odds)", "Std. Error", "z-statistic", "p-value"),
caption = "Table 3: Logistic Regression Coefficients (Log-Odds Scale)"
) |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Term | Estimate (log-odds) | Std. Error | z-statistic | p-value |
|---|---|---|---|---|
| (Intercept) | -1.337 | 0.068 | -19.769 | 0 |
| exerciseYes | -0.555 | 0.083 | -6.655 | 0 |
# Task 2b
# Fit the model
mod_exercise <- glm(fmd ~ exercise,
data = brfss_logistic,
family = binomial(link = "logit"))
# Get odds ratios using broom::tidy
or_results <- broom::tidy(mod_exercise, conf.int = TRUE, exponentiate = TRUE)
# Create clean table using dplyr::select
or_table <- or_results |>
dplyr::mutate(
across(c(estimate, conf.low, conf.high), ~ round(.x, 3)),
p.value = round(p.value, 4)
) |>
dplyr::select(Term = term, OR = estimate, `CI Lower` = conf.low,
`CI Upper` = conf.high, `p-value` = p.value)
# Display the table
or_table |>
knitr::kable(
caption = "Table 4: Odds Ratios for Exercise and FMD"
) |>
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Term | OR | CI Lower | CI Upper | p-value |
|---|---|---|---|---|
| (Intercept) | 0.263 | 0.230 | 0.299 | 0 |
| exerciseYes | 0.574 | 0.488 | 0.676 | 0 |
The odds of frequent mental distress among individuals who exercised in the past 30 days are 0.574 times the odds among those who did not exercise (95% CI: 0.488 to 0.676, p < 0.001).
Since the odds ratio is less than 1 and the confidence interval does not include 1, this indicates that exercise is statistically significantly associated with lower odds of frequent mental distress. Specifically, individuals who exercise have approximately 42.6% lower odds of experiencing frequent mental distress compared to those who do not exercise (calculated as 1 - 0.574 = 0.426, or 42.6%).
# Using age as the continuous predictor
mod_age <- glm(fmd ~ age,
data = brfss_logistic,
family = binomial(link = "logit"))
# Create prediction plot
pred_age <- ggpredict(mod_age, terms = "age [18:80]")
plot_age <- plot(pred_age) +
labs(
title = "Predicted Probability of Frequent Mental Distress by Age",
subtitle = "From simple logistic regression model",
x = "Age (years)",
y = "Predicted Probability of FMD"
) +
scale_y_continuous(labels = scales::percent, limits = c(0, 0.4)) +
theme_minimal()
print(plot_age)# Alternative: Using sleep hours
mod_sleep <- glm(fmd ~ sleep_hrs,
data = brfss_logistic,
family = binomial(link = "logit"))
pred_sleep <- ggpredict(mod_sleep, terms = "sleep_hrs [3:12]")
plot_sleep <- plot(pred_sleep) +
labs(
title = "Predicted Probability of Frequent Mental Distress by Sleep Hours",
subtitle = "From simple logistic regression model",
x = "Sleep Hours (per night)",
y = "Predicted Probability of FMD"
) +
scale_y_continuous(labels = scales::percent, limits = c(0, 0.6)) +
theme_minimal()
print(plot_sleep)# Function to extract OR, CI, and p-value from a model
get_or_table <- function(model, predictor_name) {
# Get coefficients and exponentiate
coef_est <- coef(model)
coef_ci <- confint(model)
p_values <- summary(model)$coefficients[, 4]
# Get the predictor coefficient (excluding intercept)
pred_idx <- which(names(coef_est) != "(Intercept)")
# Calculate OR and CI
or_value <- exp(coef_est[pred_idx])
or_ci <- exp(coef_ci[pred_idx, ])
# Create data frame
data.frame(
Predictor = predictor_name,
OR = round(or_value, 3),
CI_Lower = round(or_ci[1], 3),
CI_Upper = round(or_ci[2], 3),
p_value = round(p_values[pred_idx], 4)
)
}
# Apply function to each model
compare_or <- rbind(
get_or_table(mod_sleep, "Sleep Hours (per hour)"),
get_or_table(mod_bmi, "BMI (per unit)"),
get_or_table(mod_income, "Income Category (per unit)")
)
# Display table
compare_or |>
knitr::kable(
col.names = c("Predictor", "OR", "95% CI Lower", "95% CI Upper", "p-value"),
caption = "Table 5: Comparison of Crude Odds Ratios from Simple Logistic Models"
) |>
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Predictor | OR | 95% CI Lower | 95% CI Upper | p-value | |
|---|---|---|---|---|---|
| sleep_hrs | Sleep Hours (per hour) | 0.765 | 0.725 | 0.807 | 0e+00 |
| bmi | BMI (per unit) | 1.023 | 1.011 | 1.034 | 2e-04 |
| income_cat | Income Category (per unit) | 0.821 | 0.793 | 0.850 | 0e+00 |
# ============================================================================
# Task 3c: Determine which predictor has the strongest crude association
# ============================================================================
# Extract beta coefficients (log-odds) from each model
sleep_beta <- coef(mod_sleep)["sleep_hrs"]
bmi_beta <- coef(mod_bmi)["bmi"]
income_beta <- coef(mod_income)["income_cat"]
# Calculate absolute values
sleep_abs <- abs(sleep_beta)
bmi_abs <- abs(bmi_beta)
income_abs <- abs(income_beta)
# Get p-values
sleep_p <- summary(mod_sleep)$coefficients["sleep_hrs", 4]
bmi_p <- summary(mod_bmi)$coefficients["bmi", 4]
income_p <- summary(mod_income)$coefficients["income_cat", 4]
# Create comparison data frame
beta_comparison <- data.frame(
Predictor = c("Sleep Hours", "BMI", "Income Category"),
OR = c(exp(sleep_beta), exp(bmi_beta), exp(income_beta)),
Beta = round(c(sleep_beta, bmi_beta, income_beta), 4),
Abs_Beta = round(c(sleep_abs, bmi_abs, income_abs), 4),
p_value = round(c(sleep_p, bmi_p, income_p), 4)
)
# Sort by absolute beta (largest first)
beta_comparison <- beta_comparison[order(-beta_comparison$Abs_Beta), ]
# Display the table
beta_comparison |>
knitr::kable(
col.names = c("Predictor", "OR", "Beta (log-odds)", "|Beta|", "p-value"),
caption = "Table 6: Comparison of Effect Magnitudes (Absolute Beta Coefficients)"
) |>
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Predictor | OR | Beta (log-odds) | |Beta| | p-value | |
|---|---|---|---|---|---|
| sleep_hrs | Sleep Hours | 0.7652127 | -0.2676 | 0.2676 | 0e+00 |
| income_cat | Income Category | 0.8213139 | -0.1968 | 0.1968 | 0e+00 |
| bmi | BMI | 1.0225243 | 0.0223 | 0.0223 | 2e-04 |
# Identify the strongest predictor
strongest <- beta_comparison$Predictor[1]
strongest_beta <- beta_comparison$Abs_Beta[1]
# Print the answer
cat("\n========================================\n")##
## ========================================
## Task 3c Answer:
## ========================================
## The predictor with the strongest crude association with FMD is: Sleep Hours
## Absolute beta coefficient: 0.2676
## Justification:
cat("- Sleep hours |β| =", beta_comparison$Abs_Beta[beta_comparison$Predictor == "Sleep Hours"], "\n")## - Sleep hours |β| = 0.2676
## - BMI |β| = 0.0223
cat("- Income Category |β| =", beta_comparison$Abs_Beta[beta_comparison$Predictor == "Income Category"], "\n")## - Income Category |β| = 0.1968
##
## Sleep hours has the largest absolute beta coefficient, meaning each additional
## hour of sleep produces the largest change in the log-odds of FMD compared to
## a one-unit change in BMI or income category.
# 4a. Fit multiple logistic regression with at least 3 predictors (5 pts)
mod_multi <- glm(fmd ~ exercise + age + sex + smoker + sleep_hrs + income_cat,
data = brfss_logistic,
family = binomial(link = "logit"))# 4b. Report adjusted odds ratios using tbl_regression() (5 pts)
tbl_multi <- mod_multi |>
tbl_regression(
exponentiate = TRUE,
label = list(
exercise ~ "Exercise (past 30 days)",
age ~ "Age (per year)",
sex ~ "Sex",
smoker ~ "Smoking status",
sleep_hrs ~ "Sleep hours (per hour)",
income_cat ~ "Income category (per unit)"
),
ci = TRUE
) |>
bold_labels() |>
bold_p() |>
modify_caption("Table 7: Multiple Logistic Regression - Adjusted Odds Ratios")
tbl_multi| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Exercise (past 30 days) | |||
| No | — | — | |
| Yes | 0.61 | 0.51, 0.73 | <0.001 |
| Age (per year) | 0.97 | 0.97, 0.98 | <0.001 |
| Sex | |||
| Male | — | — | |
| Female | 1.67 | 1.42, 1.96 | <0.001 |
| Smoking status | |||
| Former/Never | — | — | |
| Current | 1.25 | 1.05, 1.48 | 0.012 |
| Sleep hours (per hour) | 0.81 | 0.77, 0.86 | <0.001 |
| Income category (per unit) | 0.85 | 0.82, 0.89 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
# ============================================================================
# Task 4c: Compare crude OR vs adjusted OR for one predictor
# ============================================================================
# First, make sure models are fit
# Simple model (crude)
mod_exercise <- glm(fmd ~ exercise, data = brfss_logistic, family = binomial)
# Multiple model (adjusted)
mod_multi <- glm(fmd ~ exercise + age + sex + smoker + sleep_hrs + income_cat,
data = brfss_logistic, family = binomial)
# Get crude OR for exercise
crude_or <- exp(coef(mod_exercise)["exerciseYes"])
crude_ci <- exp(confint(mod_exercise)["exerciseYes", ])
crude_p <- summary(mod_exercise)$coefficients["exerciseYes", 4]
# Get adjusted OR for exercise
adj_or <- exp(coef(mod_multi)["exerciseYes"])
adj_ci <- exp(confint(mod_multi)["exerciseYes", ])
adj_p <- summary(mod_multi)$coefficients["exerciseYes", 4]
# Create comparison table
comparison_exercise <- data.frame(
Model = c("Crude (Unadjusted)", "Adjusted"),
OR = round(c(crude_or, adj_or), 3),
CI_Lower = round(c(crude_ci[1], adj_ci[1]), 3),
CI_Upper = round(c(crude_ci[2], adj_ci[2]), 3),
p_value = round(c(crude_p, adj_p), 4)
)
# Display the table
comparison_exercise |>
knitr::kable(
col.names = c("Model", "OR", "95% CI Lower", "95% CI Upper", "p-value"),
caption = "Table 8: Crude vs. Adjusted Odds Ratios for Exercise"
) |>
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Model | OR | 95% CI Lower | 95% CI Upper | p-value |
|---|---|---|---|---|
| Crude (Unadjusted) | 0.574 | 0.488 | 0.676 | 0 |
| Adjusted | 0.613 | 0.514 | 0.731 | 0 |
# Calculate percent change
percent_change <- ((adj_or - crude_or) / crude_or) * 100
# Print interpretation for 4d
cat("\n========================================\n")##
## ========================================
## Task 4c & 4d: Confounding Assessment
## ========================================
## Crude OR: 0.574
## Adjusted OR: 0.613
## Percent change: 6.8 %
if (abs(adj_or - crude_or) > 0.01) {
if (adj_or < crude_or) {
cat("The adjusted OR is LOWER than the crude OR.\n")
cat("This indicates that confounding variables were inflating the association.\n")
cat("After controlling for age, sex, smoking, sleep, and income,\n")
cat("the protective effect of exercise attenuated (moved toward 1).\n")
} else if (adj_or > crude_or) {
cat("The adjusted OR is HIGHER than the crude OR.\n")
cat("This indicates that confounding variables were masking the association.\n")
}
} else {
cat("The crude and adjusted ORs are similar, suggesting little confounding.\n")
}## The adjusted OR is HIGHER than the crude OR.
## This indicates that confounding variables were masking the association.
Crude OR: 0.574 Adjusted OR: 0.613 Percent change: 6.8 %
The adjusted OR is HIGHER than the crude OR. This indicates that confounding variables were masking the association.
# Likelihood Ratio Test comparing full model to null
null_mod <- glm(fmd ~ 1, data = brfss_logistic, family = binomial)
lr_test <- anova(null_mod, mod_multi, test = "Chisq")
lr_test |>
kable(
digits = 3,
caption = "Table 10: Likelihood Ratio Test - Full Model vs. Null"
) |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 4999 | 4251.299 | NA | NA | NA |
| 4993 | 3870.197 | 6 | 381.101 | 0 |
End of Lab Activity