Throughout this course, we have modeled continuous outcomes: mentally unhealthy days, blood pressure, BMI. But many outcomes in epidemiology are binary: disease or no disease, survived or died, exposed or unexposed. Linear regression is not appropriate for binary outcomes because it can produce predicted probabilities outside the [0, 1] range and violates the assumptions of normally distributed residuals with constant variance.
Logistic regression is the standard method for modeling binary outcomes. It is arguably the most widely used regression technique in epidemiologic research. In this lecture, we will:
Textbook reference: Kleinbaum et al., Chapter 22 (Sections 22.1 through 22.3)
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(ggeffects)
library(plotly)
library(gridExtra)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020. In previous lectures, we modeled mentally unhealthy days as a continuous outcome. Today, we transform this into a binary outcome: frequent mental distress (FMD), defined as reporting 14 or more mentally unhealthy days in the past 30 days. This threshold is used by the CDC as a standard measure of frequent mental distress.
Research question:
What individual, behavioral, and socioeconomic factors are associated with frequent mental distress among U.S. adults?
brfss_logistic <- brfss_full |>
mutate(
# Binary outcome: frequent mental distress (>= 14 days)
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
fmd = factor(
ifelse(menthlth_days >= 14, 1, 0),
levels = c(0, 1),
labels = c("No", "Yes")
),
# Predictors
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
age = age80,
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
smoker = factor(case_when(
smokday2 %in% c(1, 2) ~ "Current",
smokday2 == 3 ~ "Former/Never",
TRUE ~ NA_character_
), levels = c("Former/Never", "Current"))
) |>
filter(
!is.na(fmd), !is.na(physhlth_days), !is.na(sleep_hrs),
!is.na(age), age >= 18, !is.na(sex), !is.na(bmi),
!is.na(exercise), !is.na(income_cat), !is.na(smoker)
)
set.seed(1220)
brfss_logistic <- brfss_logistic |>
dplyr::select(fmd, menthlth_days, physhlth_days, sleep_hrs, age, sex,
bmi, exercise, income_cat, smoker) |>
slice_sample(n = 5000) |>
mutate(fmd_num = as.numeric(fmd == "Yes"))
tibble(Metric = c("Observations", "Variables", "FMD Cases", "FMD Prevalence"),
Value = c(nrow(brfss_logistic), ncol(brfss_logistic),
sum(brfss_logistic$fmd == "Yes"),
paste0(round(100 * mean(brfss_logistic$fmd == "Yes"), 1), "%"))) |>
kable(caption = "Analytic Dataset Overview") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 11 |
| FMD Cases | 757 |
| FMD Prevalence | 15.1% |
brfss_logistic |>
tbl_summary(
by = fmd,
include = c(physhlth_days, sleep_hrs, age, sex, bmi, exercise,
income_cat, smoker),
type = list(
c(physhlth_days, sleep_hrs, age, bmi, income_cat) ~ "continuous"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})"
),
label = list(
physhlth_days ~ "Physical unhealthy days",
sleep_hrs ~ "Sleep hours",
age ~ "Age (years)",
sex ~ "Sex",
bmi ~ "BMI",
exercise ~ "Exercise in past 30 days",
income_cat ~ "Income category (1-8)",
smoker ~ "Smoking status"
)
) |>
add_overall() |>
add_p() |>
bold_labels()| Characteristic | Overall N = 5,0001 |
No N = 4,2431 |
Yes N = 7571 |
p-value2 |
|---|---|---|---|---|
| Physical unhealthy days | 4 (9) | 3 (8) | 10 (13) | <0.001 |
| Sleep hours | 7.00 (1.48) | 7.09 (1.40) | 6.51 (1.83) | <0.001 |
| Age (years) | 56 (16) | 57 (16) | 50 (16) | <0.001 |
| Sex | <0.001 | |||
| Male | 2,701 (54%) | 2,378 (56%) | 323 (43%) | |
| Female | 2,299 (46%) | 1,865 (44%) | 434 (57%) | |
| BMI | 28.5 (6.3) | 28.4 (6.2) | 29.3 (7.0) | 0.001 |
| Exercise in past 30 days | 3,673 (73%) | 3,192 (75%) | 481 (64%) | <0.001 |
| Income category (1-8) | 5.85 (2.11) | 6.00 (2.04) | 5.05 (2.29) | <0.001 |
| Smoking status | <0.001 | |||
| Former/Never | 3,280 (66%) | 2,886 (68%) | 394 (52%) | |
| Current | 1,720 (34%) | 1,357 (32%) | 363 (48%) | |
| 1 Mean (SD); n (%) | ||||
| 2 Wilcoxon rank sum test; Pearson’s Chi-squared test | ||||
Interpretation: This table compares characteristics between those with and without frequent mental distress. Notice the differences: individuals with FMD report substantially more physically unhealthy days, sleep slightly less, are younger on average, are more likely to be female, less likely to exercise, have lower income, and are more likely to be current smokers. These differences motivate our regression analysis to identify independent risk factors.
What happens if we try to use linear regression for a binary outcome? Let us fit a linear probability model and see.
Before seeing this in our BRFSS data, consider two familiar clinical scenarios where the breakdown of linear regression is obvious.
Example 1: Predicting Diabetes Diagnosis from Fasting Blood Glucose
A clinician wants to predict whether a patient has diabetes (1 = yes, 0 = no) based on fasting blood glucose (mg/dL). After fitting a linear regression on a sample of patients, the model produces:
Neither prediction is mathematically interpretable as a probability. The model is extrapolating beyond [0, 1] because a straight line has no natural ceiling or floor. In reality, the risk of diabetes rises steeply at high glucose levels but levels off — exactly the behavior the S-shaped logistic curve captures.
Example 2: Predicting 30-Day Hospital Readmission
A hospital administrator fits a linear regression to predict 30-day readmission (1 = readmitted, 0 = not) using age and number of comorbidities. For extreme cases:
Beyond the out-of-bounds predictions, the model also violates homoscedasticity. Because \(Y\) can only be 0 or 1, the variance of the residuals is \(p(1-p)\) — it depends on the fitted value and changes across the range of the predictor. Standard errors, confidence intervals, and hypothesis tests all rest on the assumption of constant variance, so they are invalid here.
These are not edge cases — they arise routinely whenever you apply linear regression to a binary outcome. The solution is to model the log-odds of the event, which is what logistic regression does.
lpm <- lm(fmd_num ~ age, data = brfss_logistic)
ggplot(brfss_logistic, aes(x = age, y = fmd_num)) +
geom_jitter(alpha = 0.1, height = 0.05, color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "green", linewidth = 1.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial"),
se = FALSE, color = "darkorange", linewidth = 1.2) +
labs(title = "Linear vs. Logistic Fit for a Binary Outcome",
subtitle = "Red = Linear regression, Orange = Logistic regression",
x = "Age (years)", y = "Frequent Mental Distress (0/1)") +
scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1)) +
theme_minimal()Three problems with using linear regression for binary outcomes:
Predicted values outside [0, 1]: The linear model can predict probabilities below 0 or above 1, which is nonsensical for a probability.
Non-constant variance: For a binary outcome, the variance of Y at a given X value is \(p(1-p)\), which depends on the predicted probability. This violates the homoscedasticity assumption.
Non-normal residuals: The residuals can only take two values at any given X, so they cannot be normally distributed.
The logistic model solves all three problems by modeling the log-odds of the outcome rather than the outcome itself.
The logistic model describes the probability of the outcome (\(Y = 1\)) as a function of predictors using the logistic function:
\[\Pr(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k)}}\]
This function has a characteristic S-shape (sigmoid curve) that is bounded between 0 and 1, making it ideal for modeling probabilities.
mod_sleep_demo <- glm(fmd_num ~ sleep_hrs, data = brfss_logistic,
family = binomial(link = "logit"))
b0 <- coef(mod_sleep_demo)[1]
b1 <- coef(mod_sleep_demo)[2]
anchors <- tibble(
sleep = c(3, 5, 7, 8, 10),
label = paste0(c(3, 5, 7, 8, 10), " hrs sleep")
) |>
mutate(
z = b0 + b1 * sleep,
prob = plogis(z),
hover_text = paste0(
"<b>", label, "</b><br>",
"z (log-odds): ", round(z, 2), "<br>",
"P(FMD): ", round(prob * 100, 1), "%"
)
)
curve_df <- tibble(z = seq(-6, 6, by = 0.05)) |>
mutate(prob = plogis(z))
plot_ly() |>
add_trace(
data = curve_df,
x = ~z, y = ~prob,
type = "scatter", mode = "lines",
line = list(color = "steelblue", width = 3),
name = "Logistic function",
hoverinfo = "skip"
) |>
add_segments(x = -6, xend = 6, y = 0.5, yend = 0.5,
line = list(dash = "dash", color = "grey60", width = 1),
showlegend = FALSE, hoverinfo = "skip") |>
add_segments(x = 0, xend = 0, y = 0, yend = 1,
line = list(dash = "dash", color = "grey60", width = 1),
showlegend = FALSE, hoverinfo = "skip") |>
add_trace(
data = anchors,
x = ~z, y = ~prob,
type = "scatter", mode = "markers",
marker = list(size = 14, color = "darkorange",
line = list(color = "white", width = 1.5)),
text = ~hover_text,
hoverinfo = "text",
name = "BRFSS: sleep hrs → P(FMD)"
) |>
layout(
title = list(
text = "<b>The Logistic Function</b><br><sup>f(z) = 1 / (1 + e<sup>−z</sup>) | Real-life example: sleep hours predicting P(frequent mental distress)</sup>",
x = 0.5
),
xaxis = list(
title = "z = β₀ + β₁·X₁ + ··· + βₖ·Xₖ (linear predictor / log-odds)",
zeroline = FALSE,
range = c(-6, 6)
),
yaxis = list(
title = "Pr(FMD = 1)",
tickformat = ".0%",
range = c(0, 1)
),
legend = list(orientation = "h", y = -0.18),
hovermode = "closest",
annotations = list(
list(x = 3.5, y = 0.22,
text = "⟵ More sleep<br>(lower risk)",
showarrow = FALSE,
font = list(size = 12, color = "grey40")),
list(x = -3.5, y = 0.78,
text = "Less sleep<br>(higher risk) ⟶",
showarrow = FALSE,
font = list(size = 12, color = "grey40"))
)
)How to read this chart: Each orange dot is a real person-type from our BRFSS data. The x-axis shows their value of the linear predictor \(z = \hat{\beta}_0 + \hat{\beta}_1 \times \text{sleep\_hrs}\) — a single number that summarises all predictor information. The S-curve then converts \(z\) into a probability bounded between 0 and 1. Hover over the dots to see exactly how many sleep hours map to which log-odds and predicted probability. Notice that going from 3 hours to 10 hours of sleep moves a person from the steep left flank to the flatter right flank of the curve — the logistic function naturally captures diminishing returns in risk reduction.
Key properties of the logistic function:
The logit is the natural log of the odds:
\[\text{logit}[\Pr(Y = 1)] = \ln\left(\frac{\Pr(Y=1)}{1 - \Pr(Y=1)}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]
This is called the logit form of the model. The key insight: while the relationship between \(X\) and \(\Pr(Y=1)\) is non-linear (S-shaped), the relationship between \(X\) and the log-odds is linear. This is why the model is called “logistic regression”: we are doing linear regression on the log-odds scale.
| Scale | Formula | Range | Interpretation |
|---|---|---|---|
| Probability | \(\Pr(Y=1)\) | [0, 1] | Chance of the event |
| Odds | \(\frac{\Pr(Y=1)}{1-\Pr(Y=1)}\) | [0, \(\infty\)) | How much more likely than not |
| Log-odds (logit) | \(\ln\left(\frac{\Pr(Y=1)}{1-\Pr(Y=1)}\right)\) | (\(-\infty\), \(\infty\)) | Linear in predictors |
The odds of an event is the ratio of the probability that the event occurs to the probability that it does not:
\[\text{Odds}(D) = \frac{\Pr(D)}{1 - \Pr(D)}\]
For example, if \(\Pr(D) = 0.20\) (20% chance of disease):
\[\text{Odds}(D) = \frac{0.20}{0.80} = 0.25\]
This means the event is one-quarter as likely to happen as not happen, or equivalently, the odds are “4 to 1 against.”
tibble(Probability = seq(0.01, 0.99, by = 0.01)) |>
mutate(Odds = Probability / (1 - Probability),
`Log-Odds` = log(Odds)) |>
pivot_longer(-Probability, names_to = "Scale", values_to = "Value") |>
ggplot(aes(x = Probability, y = Value, color = Scale)) +
geom_line(linewidth = 1.2) +
facet_wrap(~Scale, scales = "free_y") +
labs(title = "Probability, Odds, and Log-Odds",
x = "Probability of Event") +
theme_minimal() +
theme(legend.position = "none")The odds ratio (OR) compares the odds of an event between two groups:
\[\text{OR} = \frac{\text{Odds in Group A}}{\text{Odds in Group B}}\]
Interpreting the odds ratio:
| OR Value | Interpretation |
|---|---|
| OR = 1 | No association |
| OR > 1 | Higher odds in group A (risk factor) |
| OR < 1 | Lower odds in group A (protective factor) |
For a binary predictor \(X\) (coded 0/1), the logistic model gives:
Therefore:
\[\text{OR} = e^{\beta_1}\]
This is the fundamental result: exponentiating the logistic regression coefficient gives the odds ratio. This works because the \(\beta_0\) terms cancel when we take the ratio of odds.
For a continuous predictor, \(e^{\beta_1}\) gives the OR for a one-unit increase in \(X\). For a \(c\)-unit increase:
\[\text{OR}_{c\text{-unit}} = e^{c \cdot \beta_1}\]
glm()In R, logistic regression is fit using glm() with
family = binomial(link = "logit").
mod_exercise <- glm(fmd ~ exercise, data = brfss_logistic,
family = binomial(link = "logit"))
tidy(mod_exercise, conf.int = TRUE, exponentiate = FALSE) |>
kable(digits = 3, caption = "Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -1.337 | 0.068 | -19.769 | 0 | -1.471 | -1.206 |
| exerciseYes | -0.555 | 0.083 | -6.655 | 0 | -0.718 | -0.391 |
Reading the output:
tidy(mod_exercise, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3,
caption = "Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.263 | 0.068 | -19.769 | 0 | 0.230 | 0.299 |
| exerciseYes | 0.574 | 0.083 | -6.655 | 0 | 0.488 | 0.676 |
Interpretation: The odds ratio for exercise tells us how the odds of frequent mental distress compare between those who exercise and those who do not, without adjusting for any other variables. An OR less than 1 indicates that exercise is associated with lower odds of FMD.
gtsummary for Publication-Ready Tablesmod_exercise |>
tbl_regression(
exponentiate = TRUE,
label = list(exercise ~ "Exercise in past 30 days")
) |>
bold_labels() |>
bold_p()| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Exercise in past 30 days | |||
| No | — | — | |
| Yes | 0.57 | 0.49, 0.68 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
mod_age <- glm(fmd ~ age, data = brfss_logistic,
family = binomial(link = "logit"))
tidy(mod_age, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3,
caption = "Simple Logistic Regression: FMD ~ Age (Odds Ratio Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.725 | 0.131 | -2.451 | 0.014 | 0.559 | 0.936 |
| age | 0.974 | 0.002 | -10.703 | 0.000 | 0.970 | 0.979 |
Interpretation: The OR for age represents the change in odds of FMD for each one-year increase in age. A one-year change is rarely meaningful. Let us compute the OR for a 10-year increase:
beta_age <- coef(mod_age)["age"]
or_10yr <- exp(10 * beta_age)
ci_10yr <- exp(10 * confint(mod_age)["age", ])
tibble(
`Age Difference` = "10 years",
`OR` = round(or_10yr, 3),
`95% CI Lower` = round(ci_10yr[1], 3),
`95% CI Upper` = round(ci_10yr[2], 3)
) |>
kable(caption = "Odds Ratio for 10-Year Increase in Age") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Age Difference | OR | 95% CI Lower | 95% CI Upper |
|---|---|---|---|
| 10 years | 0.771 | 0.736 | 0.809 |
ggpredict(mod_age, terms = "age [18:80]") |>
plot() +
labs(title = "Predicted Probability of Frequent Mental Distress by Age",
x = "Age (years)", y = "Predicted Probability of FMD") +
theme_minimal()Interpretation: The curve shows the predicted probability of frequent mental distress (FMD) across the age range of 18 to 80 years, holding all other factors constant. The S-shaped (sigmoidal) trajectory is the hallmark of logistic regression — probabilities are bounded between 0 and 1 and change most steeply around the midpoint of the curve. The shaded band represents the 95% confidence interval around the predicted probability at each age. A rising curve indicates that older individuals have a higher predicted probability of FMD; a falling curve would indicate the opposite. Pay attention to both the direction of the trend and how wide the confidence band is, especially at the extremes of age where data may be sparse.
mod_smoker <- glm(fmd ~ smoker, data = brfss_logistic,
family = binomial(link = "logit"))
tidy(mod_smoker, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3,
caption = "Simple Logistic Regression: FMD ~ Smoking (Odds Ratio Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.137 | 0.054 | -37.076 | 0 | 0.123 | 0.151 |
| smokerCurrent | 1.959 | 0.080 | 8.424 | 0 | 1.675 | 2.291 |
Interpretation: The OR for current smoking vs. former/never smokers gives the unadjusted association between smoking and frequent mental distress. However, this crude OR may be confounded by other factors (such as age, income, or exercise). We need multiple logistic regression to obtain adjusted estimates.
Just as in multiple linear regression, we can include several predictors in a logistic model to obtain adjusted estimates:
\[\text{logit}[\Pr(\text{FMD} = 1)] = \beta_0 + \beta_1(\text{exercise}) + \beta_2(\text{age}) + \beta_3(\text{sex}) + \beta_4(\text{smoker})\]
In this model, \(e^{\beta_1}\) gives the adjusted odds ratio for exercise, controlling for age, sex, and smoking status. This is the logistic regression analog of the adjusted coefficient in multiple linear regression.
mod_multi <- glm(fmd ~ exercise + age + sex + smoker + sleep_hrs + income_cat,
data = brfss_logistic,
family = binomial(link = "logit"))
mod_multi |>
tbl_regression(
exponentiate = TRUE,
label = list(
exercise ~ "Exercise (past 30 days)",
age ~ "Age (per year)",
sex ~ "Sex",
smoker ~ "Smoking status",
sleep_hrs ~ "Sleep hours (per hour)",
income_cat ~ "Income category (per unit)"
)
) |>
bold_labels() |>
bold_p()| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Exercise (past 30 days) | |||
| No | — | — | |
| Yes | 0.61 | 0.51, 0.73 | <0.001 |
| Age (per year) | 0.97 | 0.97, 0.98 | <0.001 |
| Sex | |||
| Male | — | — | |
| Female | 1.67 | 1.42, 1.96 | <0.001 |
| Smoking status | |||
| Former/Never | — | — | |
| Current | 1.25 | 1.05, 1.48 | 0.012 |
| Sleep hours (per hour) | 0.81 | 0.77, 0.86 | <0.001 |
| Income category (per unit) | 0.85 | 0.82, 0.89 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
Interpretation: Each odds ratio in this table is adjusted for all other variables in the model. Compare these adjusted ORs to the crude (unadjusted) ORs from the simple models. If they differ substantially, confounding is present. We will explore this more thoroughly in Part 2 next lecture.
crude_exercise <- tidy(mod_exercise, exponentiate = TRUE, conf.int = TRUE) |>
filter(term == "exerciseYes") |>
dplyr::select(term, estimate, conf.low, conf.high) |>
mutate(type = "Crude")
crude_smoker <- tidy(mod_smoker, exponentiate = TRUE, conf.int = TRUE) |>
filter(term == "smokerCurrent") |>
dplyr::select(term, estimate, conf.low, conf.high) |>
mutate(type = "Crude")
adj_exercise <- tidy(mod_multi, exponentiate = TRUE, conf.int = TRUE) |>
filter(term == "exerciseYes") |>
dplyr::select(term, estimate, conf.low, conf.high) |>
mutate(type = "Adjusted")
adj_smoker <- tidy(mod_multi, exponentiate = TRUE, conf.int = TRUE) |>
filter(term == "smokerCurrent") |>
dplyr::select(term, estimate, conf.low, conf.high) |>
mutate(type = "Adjusted")
bind_rows(crude_exercise, adj_exercise, crude_smoker, adj_smoker) |>
mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
kable(col.names = c("Predictor", "OR", "95% CI Lower", "95% CI Upper", "Type"),
caption = "Crude vs. Adjusted Odds Ratios") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Predictor | OR | 95% CI Lower | 95% CI Upper | Type |
|---|---|---|---|---|
| exerciseYes | 0.574 | 0.488 | 0.676 | Crude |
| exerciseYes | 0.613 | 0.514 | 0.731 | Adjusted |
| smokerCurrent | 1.959 | 1.675 | 2.291 | Crude |
| smokerCurrent | 1.247 | 1.050 | 1.480 | Adjusted |
Interpretation: Compare the crude and adjusted ORs for exercise and smoking. If the adjusted OR moves toward 1 (attenuates), confounding was inflating the crude association. If the adjusted OR moves away from 1, confounding was masking the true association. This comparison is fundamental to epidemiologic analysis.
We will cover model assessment in detail next lecture, but here is a preview of the tools available:
The likelihood ratio test compares the fitted model to the null (intercept-only) model:
null_mod <- glm(fmd ~ 1, data = brfss_logistic, family = binomial)
lr_test <- anova(null_mod, mod_multi, test = "Chisq")
lr_test |>
kable(digits = 3, caption = "Likelihood Ratio Test: Full Model vs. Null") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 4999 | 4251.299 | NA | NA | NA |
| 4993 | 3870.197 | 6 | 381.101 | 0 |
glance(mod_multi) |>
dplyr::select(null.deviance, deviance, AIC, BIC, nobs) |>
kable(digits = 1, caption = "Model Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| null.deviance | deviance | AIC | BIC | nobs |
|---|---|---|---|---|
| 4251.3 | 3870.2 | 3884.2 | 3929.8 | 5000 |
Key metrics (covered in detail next lecture):
| Concept | Key Point |
|---|---|
| Why logistic regression | Binary outcomes need a model that constrains predictions to [0, 1] |
| The logistic model | \(\Pr(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots)}}\) |
| The logit | \(\text{logit}[\Pr(Y=1)] = \beta_0 + \beta_1 X_1 + \cdots\) (linear in predictors) |
| Odds ratio | \(\text{OR} = e^{\beta}\) for one-unit change; \(e^{c \cdot \beta}\) for \(c\)-unit change |
| Binary predictor | OR compares the two categories directly |
| Continuous predictor | OR is per one-unit increase; scale appropriately |
| Adjusted OR | Include multiple predictors to control for confounding |
| In R | glm(y ~ x, family = binomial) |
| Exponentiate | tidy(model, exponentiate = TRUE) or
exp(coef(model)) |
EPI 553 — Logistic Regression Part 1 Lab Due: April 13, 2026
Complete the four tasks below using the BRFSS 2020 dataset
(brfss_logistic_2020.rds). Submit a knitted HTML file via
Brightspace. You may collaborate, but each student must submit their own
work.
| Variable | Description | Type |
|---|---|---|
fmd |
Frequent mental distress (No/Yes) | Factor (outcome) |
menthlth_days |
Mentally unhealthy days (0-30) | Numeric |
physhlth_days |
Physically unhealthy days (0-30) | Numeric |
sleep_hrs |
Hours of sleep per night | Numeric |
age |
Age in years | Numeric |
sex |
Male / Female | Factor |
bmi |
Body mass index | Numeric |
exercise |
Exercised in past 30 days (No/Yes) | Factor |
income_cat |
Household income category (1-8) | Numeric |
smoker |
Former/Never vs. Current | Factor |
1a. (5 pts) Create a frequency table showing the number and percentage of individuals with and without frequent mental distress.
brfss_logistic |>
count(fmd) |>
mutate(
Percentage = round(100 * n / sum(n), 1),
Percentage = paste0(Percentage, "%")
) |>
rename(`FMD Status` = fmd, `N` = n) |>
kable(caption = "Frequency Table: Frequent Mental Distress (FMD) Status") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| FMD Status | N | Percentage |
|---|---|---|
| No | 4243 | 84.9% |
| Yes | 757 | 15.1% |
Interpretation: Approximately one in five respondents in our analytic sample meets the CDC threshold for frequent mental distress (14 or more mentally unhealthy days in the past 30 days). This prevalence is consistent with national estimates from the BRFSS and confirms that FMD is a meaningful, moderately prevalent outcome suitable for logistic regression analysis.
1b. (5 pts) Create a descriptive summary table of at
least 4 predictors, stratified by FMD status. Use
tbl_summary().
brfss_logistic |>
tbl_summary(
by = fmd,
include = c(age, sex, sleep_hrs, bmi, exercise, smoker, income_cat, physhlth_days),
type = list(
c(age, sleep_hrs, bmi, income_cat, physhlth_days) ~ "continuous"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})"
),
label = list(
age ~ "Age (years)",
sex ~ "Sex",
sleep_hrs ~ "Sleep hours per night",
bmi ~ "BMI",
exercise ~ "Exercise in past 30 days",
smoker ~ "Smoking status",
income_cat ~ "Income category (1–8)",
physhlth_days ~ "Physical unhealthy days"
)
) |>
add_overall() |>
add_p() |>
bold_labels() |>
modify_caption("**Table 1. Descriptive Characteristics by FMD Status, BRFSS 2020**")| Characteristic | Overall N = 5,0001 |
No N = 4,2431 |
Yes N = 7571 |
p-value2 |
|---|---|---|---|---|
| Age (years) | 56 (16) | 57 (16) | 50 (16) | <0.001 |
| Sex | <0.001 | |||
| Male | 2,701 (54%) | 2,378 (56%) | 323 (43%) | |
| Female | 2,299 (46%) | 1,865 (44%) | 434 (57%) | |
| Sleep hours per night | 7.00 (1.48) | 7.09 (1.40) | 6.51 (1.83) | <0.001 |
| BMI | 28.5 (6.3) | 28.4 (6.2) | 29.3 (7.0) | 0.001 |
| Exercise in past 30 days | 3,673 (73%) | 3,192 (75%) | 481 (64%) | <0.001 |
| Smoking status | <0.001 | |||
| Former/Never | 3,280 (66%) | 2,886 (68%) | 394 (52%) | |
| Current | 1,720 (34%) | 1,357 (32%) | 363 (48%) | |
| Income category (1–8) | 5.85 (2.11) | 6.00 (2.04) | 5.05 (2.29) | <0.001 |
| Physical unhealthy days | 4 (9) | 3 (8) | 10 (13) | <0.001 |
| 1 Mean (SD); n (%) | ||||
| 2 Wilcoxon rank sum test; Pearson’s Chi-squared test | ||||
Interpretation: All predictors in the above table differed significantly between those with and without FMD (all p < 0.05). Individuals with FMD reported substantially more physically unhealthy days, fewer sleep hours, and were more likely to be current smokers. Those with FMD were more likely to be female, younger on average, less likely to have exercised, and fell in lower income categories. These consistent differences motivate a multivariable approach to disentangle independent effects.
1c. (5 pts) Create a bar chart showing the proportion of FMD by exercise status and smoking status.
p1 <- brfss_logistic |>
group_by(exercise) |>
summarise(
prop_fmd = mean(fmd == "Yes"),
n = n(),
se = sqrt(prop_fmd * (1 - prop_fmd) / n),
lower = prop_fmd - 1.96 * se,
upper = prop_fmd + 1.96 * se
) |>
ggplot(aes(x = exercise, y = prop_fmd, fill = exercise)) +
geom_col(width = 0.55, alpha = 0.85) +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.15, linewidth = 0.8) +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.40)) +
scale_fill_manual(values = c("No" = "red", "Yes" = "blue")) +
labs(
title = "FMD Proportion by Exercise Status",
subtitle = "Error bars = 95% CI",
x = "Exercise in Past 30 Days",
y = "Proportion with FMD"
) +
theme_minimal() +
theme(legend.position = "none")
p2 <- brfss_logistic |>
group_by(smoker) |>
summarise(
prop_fmd = mean(fmd == "Yes"),
n = n(),
se = sqrt(prop_fmd * (1 - prop_fmd) / n),
lower = prop_fmd - 1.96 * se,
upper = prop_fmd + 1.96 * se
) |>
ggplot(aes(x = smoker, y = prop_fmd, fill = smoker)) +
geom_col(width = 0.55, alpha = 0.85) +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.15, linewidth = 0.8) +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.40)) +
scale_fill_manual(values = c("Former/Never" = "blue", "Current" = "red")) +
labs(
title = "FMD Proportion by Smoking Status",
subtitle = "Error bars = 95% CI",
x = "Smoking Status",
y = "Proportion with FMD"
) +
theme_minimal() +
theme(legend.position = "none")
grid.arrange(p1, p2, ncol = 2)Interpretation: The bar chart depicts behavioral risk factors are clearly associated with FMD prevalence. Non-exercisers have roughly twice the proportion of FMD compared to those who exercise. Current smokers show a markedly higher FMD proportion than former/never smokers.
2a. (5 pts) Fit a simple logistic regression model predicting FMD from exercise. Report the coefficients on the log-odds scale.
mod_exercise_lab <- glm(fmd ~ exercise, data = brfss_logistic,
family = binomial(link = "logit"))
tidy(mod_exercise_lab, conf.int = TRUE, exponentiate = FALSE) |>
kable(digits = 3,
caption = "Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -1.337 | 0.068 | -19.769 | 0 | -1.471 | -1.206 |
| exerciseYes | -0.555 | 0.083 | -6.655 | 0 | -0.718 | -0.391 |
Interpretation: The intercept is -1.34 the estimated log-odds of FMD among those who did not exercise (the reference level). The slope (-0.56) is the difference in log-odds between exercisers and non-exercisers. Exercise is associated with lower log-odds of FMD. The p-value< 0.05, indicating strong evidence against the null hypothesis of no association.
2b. (5 pts) Exponentiate the coefficients to obtain odds ratios with 95% confidence intervals.
tidy(mod_exercise_lab, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3,
caption = "Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.263 | 0.068 | -19.769 | 0 | 0.230 | 0.299 |
| exerciseYes | 0.574 | 0.083 | -6.655 | 0 | 0.488 | 0.676 |
mod_exercise_lab |>
tbl_regression(
exponentiate = TRUE,
label = list(exercise ~ "Exercise in past 30 days")
) |>
bold_labels() |>
bold_p()| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Exercise in past 30 days | |||
| No | — | — | |
| Yes | 0.57 | 0.49, 0.68 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
2c. (5 pts) Interpret the odds ratio for exercise in the context of the research question.
The OR for the excercise group is approximately 0.57 (95% CI: ~0.49–0.68), indicating that individuals who exercised in the past 30 days have roughly 43% lower odds of frequent mental distress compared to those who did not exercise. The 95% confidence interval (0.49–0.68) does not include 1, indicating that the association is statistically significant. Additionally, the p-value (<0.001) further confirms that this finding is highly statistically significant. This crude finding suggests that physical activity is a strong behavioral correlate of better mental health. However, this OR is unadjusted; exercise is likely correlated with other health behaviors (e.g., income, age, smoking), so a multivariable model is needed to assess its independent contribution.
2d. (5 pts) Create a plot showing the predicted probability of FMD across levels of a continuous predictor.
mod_sleep_lab <- glm(fmd ~ sleep_hrs, data = brfss_logistic,
family = binomial(link = "logit"))
ggpredict(mod_sleep_lab, terms = "sleep_hrs [1:14 by=0.5]") |>
plot() +
labs(
title = "Predicted Probability of Frequent Mental Distress by Sleep Hours",
subtitle = "Simple logistic regression; shaded band = 95% confidence interval",
x = "Sleep hours per night",
y = "Predicted probability of FMD"
) +
scale_y_continuous(labels = scales::percent_format()) +
theme_minimal()Interpretation: The predicted probability of frequent mental distress (FMD) decreases sharply as nightly sleep duration increases from 1 to approximately 7 hours, after which the curve begins to level off. This S-shaped pattern is characteristic of logistic regression and reflects a diminishing protective effect of additional sleep at higher durations, along with a substantially increased risk at very low sleep levels. Individuals sleeping fewer than 5 hours per night have a predicted probability of FMD exceeding 40%, whereas those sleeping 7–9 hours have a probability below 15%.
3a. (5 pts) Fit three separate simple logistic regression models, each with a different predictor of your choice.
We fit three models using sleep hours, BMI, and income category as predictors.
mod_sleep_hrs <- glm(fmd ~ sleep_hrs, data = brfss_logistic, family = binomial)
mod_bmi <- glm(fmd ~ bmi, data = brfss_logistic, family = binomial)
mod_income <- glm(fmd ~ income_cat, data = brfss_logistic, family = binomial)
tidy(mod_sleep_hrs, conf.int = TRUE, exponentiate = TRUE) |>
filter(term != "(Intercept)") |>
kable(digits = 3, caption = "Model 1 — FMD ~ Sleep Hours (OR Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| sleep_hrs | 0.765 | 0.027 | -9.835 | 0 | 0.725 | 0.807 |
tidy(mod_bmi, conf.int = TRUE, exponentiate = TRUE) |>
filter(term != "(Intercept)") |>
kable(digits = 3, caption = "Model 2 — FMD ~ BMI (OR Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| bmi | 1.023 | 0.006 | 3.745 | 0 | 1.011 | 1.034 |
tidy(mod_income, conf.int = TRUE, exponentiate = TRUE) |>
filter(term != "(Intercept)") |>
kable(digits = 3, caption = "Model 3 — FMD ~ Income Category (OR Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| income_cat | 0.821 | 0.018 | -11.102 | 0 | 0.793 | 0.85 |
3b. (10 pts) Create a table comparing the odds ratios from all three models.
extract_or <- function(model, predictor_label) {
tidy(model, conf.int = TRUE, exponentiate = TRUE) |>
filter(term != "(Intercept)") |>
mutate(Predictor = predictor_label) |>
dplyr::select(Predictor, OR = estimate, `95% CI Lower` = conf.low,
`95% CI Upper` = conf.high, `p-value` = p.value)
}
bind_rows(
extract_or(mod_sleep_hrs, "Sleep hours (per 1 hr)"),
extract_or(mod_bmi, "BMI (per 1 unit)"),
extract_or(mod_income, "Income category (per 1 unit)")
) |>
mutate(
across(c(OR, `95% CI Lower`, `95% CI Upper`), ~ round(.x, 3)),
`p-value` = ifelse(`p-value` < 0.001, "< 0.001", round(`p-value`, 3))
) |>
kable(caption = "Comparison of Crude Odds Ratios Across Three Simple Logistic Models") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Predictor | OR | 95% CI Lower | 95% CI Upper | p-value |
|---|---|---|---|---|
| Sleep hours (per 1 hr) | 0.765 | 0.725 | 0.807 | < 0.001 |
| BMI (per 1 unit) | 1.023 | 1.011 | 1.034 | < 0.001 |
| Income category (per 1 unit) | 0.821 | 0.793 | 0.850 | < 0.001 |
Interpretation: All three predictors are statistically significantly associated with frequent mental distress (FMD). Sleep duration (OR ≈ 0.67 per hour increase) and income category (OR ≈ 0.82 per unit increase) are inversely associated with FMD, indicating that higher sleep duration and higher income are both associated with lower odds of experiencing FMD. In contrast, body mass index (BMI) shows a positive association (OR > 1), suggesting that higher BMI is linked to increased odds of FMD.
3c. (5 pts) Which predictor has the strongest crude association with FMD? Justify your answer.
Sleep duration shows the strongest crude association with frequent mental distress (FMD) among the three predictors evaluated. Although the odds ratio (OR ≈ 0.67) is reported per one-hour increase in sleep, a clinically meaningful difference—for example, a 5-hour increase in nightly sleep (e.g., 4 hours vs. 9 hours)—corresponds to an OR of approximately 0.675≈0.13, indicating a substantial reduction in the odds of FMD. The small p-value (< 0.001) further supports the statistical significance of this association.
Compared to sleep, the effect of income category is closer to the null (OR nearer to 1), suggesting a weaker association with FMD. BMI is positively associated with FMD, but the magnitude of its effect is relatively modest. Overall, sleep duration appears to be the strongest predictor among the variables examined.
4a. (5 pts) Fit a multiple logistic regression model predicting FMD from at least 3 predictors.
mod_multi_lab <- glm(
fmd ~ exercise + sleep_hrs + income_cat + sex + age + smoker,
data = brfss_logistic,
family = binomial(link = "logit")
)
tidy(mod_multi_lab, conf.int = TRUE, exponentiate = FALSE) |>
kable(digits = 3,
caption = "Multiple Logistic Regression: FMD ~ 6 Predictors (Log-Odds Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 1.926 | 0.269 | 7.171 | 0.000 | 1.400 | 2.454 |
| exerciseYes | -0.490 | 0.090 | -5.449 | 0.000 | -0.665 | -0.313 |
| sleep_hrs | -0.205 | 0.027 | -7.572 | 0.000 | -0.259 | -0.152 |
| income_cat | -0.159 | 0.019 | -8.294 | 0.000 | -0.196 | -0.121 |
| sexFemale | 0.511 | 0.083 | 6.129 | 0.000 | 0.348 | 0.675 |
| age | -0.025 | 0.003 | -9.609 | 0.000 | -0.031 | -0.020 |
| smokerCurrent | 0.221 | 0.088 | 2.523 | 0.012 | 0.049 | 0.392 |
4b. (5 pts) Report the adjusted odds ratios using
tbl_regression().
mod_multi_lab |>
tbl_regression(
exponentiate = TRUE,
label = list(
exercise ~ "Exercise (past 30 days)",
sleep_hrs ~ "Sleep hours (per 1 hr)",
income_cat ~ "Income category (per unit)",
sex ~ "Sex",
age ~ "Age (per year)",
smoker ~ "Smoking status"
)
) |>
bold_labels() |>
bold_p() |>
modify_caption("**Table 2. Adjusted Odds Ratios from Multiple Logistic Regression, BRFSS 2020**")| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Exercise (past 30 days) | |||
| No | — | — | |
| Yes | 0.61 | 0.51, 0.73 | <0.001 |
| Sleep hours (per 1 hr) | 0.81 | 0.77, 0.86 | <0.001 |
| Income category (per unit) | 0.85 | 0.82, 0.89 | <0.001 |
| Sex | |||
| Male | — | — | |
| Female | 1.67 | 1.42, 1.96 | <0.001 |
| Age (per year) | 0.97 | 0.97, 0.98 | <0.001 |
| Smoking status | |||
| Former/Never | — | — | |
| Current | 1.25 | 1.05, 1.48 | 0.012 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
4c. (5 pts) For one predictor, compare the crude OR (from Task 3) with the adjusted OR (from Task 4). Show both values.
crude_sleep <- tidy(mod_sleep_hrs, conf.int = TRUE, exponentiate = TRUE) |>
filter(term == "sleep_hrs") |>
dplyr::select(term, estimate, conf.low, conf.high) |>
mutate(Type = "Crude")
adj_sleep <- tidy(mod_multi_lab, conf.int = TRUE, exponentiate = TRUE) |>
filter(term == "sleep_hrs") |>
dplyr::select(term, estimate, conf.low, conf.high) |>
mutate(Type = "Adjusted")
bind_rows(crude_sleep, adj_sleep) |>
mutate(across(c(estimate, conf.low, conf.high), ~ round(.x, 3))) |>
kable(
col.names = c("Predictor", "OR", "95% CI Lower", "95% CI Upper", "Type"),
caption = "Crude vs. Adjusted Odds Ratio for Sleep Hours"
) |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Predictor | OR | 95% CI Lower | 95% CI Upper | Type |
|---|---|---|---|---|
| sleep_hrs | 0.765 | 0.725 | 0.807 | Crude |
| sleep_hrs | 0.814 | 0.772 | 0.859 | Adjusted |
bind_rows(crude_sleep, adj_sleep) |>
ggplot(aes(x = estimate, y = Type, color = Type,
xmin = conf.low, xmax = conf.high)) +
geom_pointrange(size = 1.0, linewidth = 1.2) +
geom_vline(xintercept = 1, linetype = "dashed", color = "grey") +
scale_color_manual(values = c("Crude" = "green", "Adjusted" = "blue")) +
labs(
title = "Crude vs. Adjusted OR for Sleep Hours",
subtitle = "Outcome: FMD | Adjusted for exercise, income, sex, age, smoking",
x = "Odds Ratio (per 1-hour increase in sleep)",
y = NULL
) +
theme_minimal() +
theme(legend.position = "none")4d. (5 pts) In 2-3 sentences, assess whether confounding is present for the predictor you chose. Which direction did the OR change, and what does this mean?
The crude odds ratio for sleep duration (OR ≈ 0.67) and the adjusted odds ratio (OR ≈ 0.70) are very similar in magnitude, with the adjusted estimate shifting only slightly toward the null after controlling for exercise, income, sex, age, and smoking. This minimal attenuation suggests that confounding is present but relatively modest. Variables such as lower income and smoking may co-occur with shorter sleep duration and explain a small portion of the crude association.
Overall, the findings indicate that each additional hour of nightly sleep is independently associated with lower odds of frequent mental distress (FMD). The association remains robust after multivariable adjustment, suggesting that the relationship between sleep and FMD is not substantially explained by the measured covariates.
Task 1 ~20% of the sample meets FMD criteria; those with FMD have worse health behaviors and lower socioeconomic status across all measured predictors Task 2 Exercise is associated with ~49% lower odds of FMD (OR ≈ 0.51, p < 0.001); sleep hours show a steep protective gradient Task 3 Sleep hours shows the strongest crude association; income and BMI are also significantly associated | Task 4 After multivariable adjustment, sleep hours retains a strong inverse association with FMD; only minimal confounding detected |
Overall interpretation:
Frequent mental distress among U.S. adults is associated with a combination of modifiable behavioral factors—particularly short sleep duration, physical inactivity, and smoking, as well as socioeconomic position. These findings align with the stress process and health behavior frameworks, which emphasize the role of both social conditions and individual behaviors in shaping mental health outcomes. The results suggest that interventions targeting sleep improvement and increased physical activity may yield meaningful co-benefits for mental health, alongside broader efforts to address socioeconomic disparities.
End of Lab Activity