Introduction

Throughout this course, we have modeled continuous outcomes: mentally unhealthy days, blood pressure, BMI. But many outcomes in epidemiology are binary: disease or no disease, survived or died, exposed or unexposed. Linear regression is not appropriate for binary outcomes because it can produce predicted probabilities outside the [0, 1] range and violates the assumptions of normally distributed residuals with constant variance.

Logistic regression is the standard method for modeling binary outcomes. It is arguably the most widely used regression technique in epidemiologic research. In this lecture, we will:

  1. Understand why linear regression fails for binary outcomes
  2. Learn the logistic model and the logit transformation
  3. Connect logistic regression to the odds ratio, the primary measure of effect in epidemiology
  4. Fit and interpret a simple logistic regression model in R
  5. Extend to multiple logistic regression (introduced today, continued next lecture)

Textbook reference: Kleinbaum et al., Chapter 22 (Sections 22.1 through 22.3)


Why/When/Watch-out

Why use logistic regression?

  • The outcome variable is binary (0/1, yes/no, disease/no disease)
  • We want to estimate the probability of an event occurring
  • We need adjusted odds ratios that control for confounders
  • We want to identify risk factors for a dichotomous health outcome

When to use it?

  • Cross-sectional studies: prevalence of a condition
  • Case-control studies: odds of exposure given disease status
  • Cohort studies: incidence of disease (when outcome is rare, OR approximates RR)

Watch out for:

  • Separation: when a predictor perfectly predicts the outcome, ML estimation fails
  • Small sample sizes: logistic regression needs roughly 10-20 events per predictor
  • Multicollinearity: same issues as in linear regression
  • Linearity in the logit: the relationship between continuous predictors and the log-odds must be linear

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(ggeffects)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020. In previous lectures, we modeled mentally unhealthy days as a continuous outcome. Today, we transform this into a binary outcome: frequent mental distress (FMD), defined as reporting 14 or more mentally unhealthy days in the past 30 days. This threshold is used by the CDC as a standard measure of frequent mental distress.

Research question:

What individual, behavioral, and socioeconomic factors are associated with frequent mental distress among U.S. adults?

brfss_full <- read_xpt(
  "LLCP2020.XPT"
) |>
  clean_names()
brfss_logistic <- brfss_full |>
  mutate(
    # Binary outcome: frequent mental distress (>= 14 days)
    menthlth_days = case_when(
      menthlth == 88                  ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                           ~ NA_real_
    ),
    fmd = factor(
      ifelse(menthlth_days >= 14, 1, 0),
      levels = c(0, 1),
      labels = c("No", "Yes")
    ),
    # Predictors
    physhlth_days = case_when(
      physhlth == 88                  ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                           ~ NA_real_
    ),
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                           ~ NA_real_
    ),
    age = age80,
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    smoker = factor(case_when(
      smokday2 %in% c(1, 2) ~ "Current",
      smokday2 == 3          ~ "Former/Never",
      TRUE                   ~ NA_character_
    ), levels = c("Former/Never", "Current"))
  ) |>
  filter(
    !is.na(fmd), !is.na(physhlth_days), !is.na(sleep_hrs),
    !is.na(age), age >= 18, !is.na(sex), !is.na(bmi),
    !is.na(exercise), !is.na(income_cat), !is.na(smoker)
  )

set.seed(1220)
brfss_logistic <- brfss_logistic |>
  dplyr::select(fmd, menthlth_days, physhlth_days, sleep_hrs, age, sex,
                bmi, exercise, income_cat, smoker) |>
  slice_sample(n = 5000)

# Save for lab use
saveRDS(brfss_logistic,
  "brfss_logistic_2020.rds")

tibble(Metric = c("Observations", "Variables", "FMD Cases", "FMD Prevalence"),
       Value  = c(nrow(brfss_logistic), ncol(brfss_logistic),
                  sum(brfss_logistic$fmd == "Yes"),
                  paste0(round(100 * mean(brfss_logistic$fmd == "Yes"), 1), "%"))) |>
  kable(caption = "Analytic Dataset Overview") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Analytic Dataset Overview
Metric Value
Observations 5000
Variables 10
FMD Cases 757
FMD Prevalence 15.1%

Descriptive Overview

brfss_logistic |>
  tbl_summary(
    by = fmd,
    include = c(physhlth_days, sleep_hrs, age, sex, bmi, exercise,
                income_cat, smoker),
    type = list(
      c(physhlth_days, sleep_hrs, age, bmi, income_cat) ~ "continuous"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})"
    ),
    label = list(
      physhlth_days ~ "Physical unhealthy days",
      sleep_hrs     ~ "Sleep hours",
      age           ~ "Age (years)",
      sex           ~ "Sex",
      bmi           ~ "BMI",
      exercise      ~ "Exercise in past 30 days",
      income_cat    ~ "Income category (1-8)",
      smoker        ~ "Smoking status"
    )
  ) |>
  add_overall() |>
  add_p() |>
  bold_labels()
Characteristic Overall
N = 5,000
1
No
N = 4,243
1
Yes
N = 757
1
p-value2
Physical unhealthy days 4 (9) 3 (8) 10 (13) <0.001
Sleep hours 7.00 (1.48) 7.09 (1.40) 6.51 (1.83) <0.001
Age (years) 56 (16) 57 (16) 50 (16) <0.001
Sex


<0.001
    Male 2,701 (54%) 2,378 (56%) 323 (43%)
    Female 2,299 (46%) 1,865 (44%) 434 (57%)
BMI 28.5 (6.3) 28.4 (6.2) 29.3 (7.0) 0.001
Exercise in past 30 days 3,673 (73%) 3,192 (75%) 481 (64%) <0.001
Income category (1-8) 5.85 (2.11) 6.00 (2.04) 5.05 (2.29) <0.001
Smoking status


<0.001
    Former/Never 3,280 (66%) 2,886 (68%) 394 (52%)
    Current 1,720 (34%) 1,357 (32%) 363 (48%)
1 Mean (SD); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

Interpretation: This table compares characteristics between those with and without frequent mental distress. Notice the differences: individuals with FMD report substantially more physically unhealthy days, sleep slightly less, are younger on average, are more likely to be female, less likely to exercise, have lower income, and are more likely to be current smokers. These differences motivate our regression analysis to identify independent risk factors.


Part 1: Why Linear Regression Fails for Binary Outcomes

The Problem

What happens if we try to use linear regression for a binary outcome? Let us fit a linear probability model and see.

# Create numeric version of outcome
brfss_logistic <- brfss_logistic |>
  mutate(fmd_num = as.numeric(fmd == "Yes"))

# Fit linear regression on binary outcome
lpm <- lm(fmd_num ~ age, data = brfss_logistic)

ggplot(brfss_logistic, aes(x = age, y = fmd_num)) +
  geom_jitter(alpha = 0.1, height = 0.05, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1.2) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"),
              se = FALSE, color = "darkorange", linewidth = 1.2) +
  labs(title = "Linear vs. Logistic Fit for a Binary Outcome",
       subtitle = "Red = Linear regression, Orange = Logistic regression",
       x = "Age (years)", y = "Frequent Mental Distress (0/1)") +
  scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1)) +
  theme_minimal()

Three problems with using linear regression for binary outcomes:

  1. Predicted values outside [0, 1]: The linear model can predict probabilities below 0 or above 1, which is nonsensical for a probability.

  2. Non-constant variance: For a binary outcome, the variance of Y at a given X value is \(p(1-p)\), which depends on the predicted probability. This violates the homoscedasticity assumption.

  3. Non-normal residuals: The residuals can only take two values at any given X, so they cannot be normally distributed.

The logistic model solves all three problems by modeling the log-odds of the outcome rather than the outcome itself.


Part 2: The Logistic Model

The Logistic Function

The logistic model describes the probability of the outcome (\(Y = 1\)) as a function of predictors using the logistic function:

\[\Pr(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k)}}\]

This function has a characteristic S-shape (sigmoid curve) that is bounded between 0 and 1, making it ideal for modeling probabilities.

# Illustrate the logistic function
tibble(z = seq(-6, 6, by = 0.1)) |>
  mutate(probability = 1 / (1 + exp(-z))) |>
  ggplot(aes(x = z, y = probability)) +
  geom_line(color = "steelblue", linewidth = 1.5) +
  geom_hline(yintercept = c(0, 0.5, 1), linetype = "dashed", alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
  labs(title = "The Logistic Function",
       subtitle = expression(f(z) == frac(1, 1 + e^{-z})),
       x = expression(z == beta[0] + beta[1]*X[1] + cdots + beta[k]*X[k]),
       y = "Pr(Y = 1)") +
  annotate("text", x = 3, y = 0.3, label = "Risk increases\nwith z",
           size = 4, color = "grey40") +
  theme_minimal()

Key properties of the logistic function:

  • As \(z \to -\infty\), \(\Pr(Y=1) \to 0\)
  • As \(z \to +\infty\), \(\Pr(Y=1) \to 1\)
  • When \(z = 0\), \(\Pr(Y=1) = 0.5\)
  • The function is steepest around \(z = 0\)

The Logit Transformation

The logit is the natural log of the odds:

\[\text{logit}[\Pr(Y = 1)] = \ln\left(\frac{\Pr(Y=1)}{1 - \Pr(Y=1)}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]

This is called the logit form of the model. The key insight: while the relationship between \(X\) and \(\Pr(Y=1)\) is non-linear (S-shaped), the relationship between \(X\) and the log-odds is linear. This is why the model is called “logistic regression”: we are doing linear regression on the log-odds scale.

Scale Formula Range Interpretation
Probability \(\Pr(Y=1)\) [0, 1] Chance of the event
Odds \(\frac{\Pr(Y=1)}{1-\Pr(Y=1)}\) [0, \(\infty\)) How much more likely than not
Log-odds (logit) \(\ln\left(\frac{\Pr(Y=1)}{1-\Pr(Y=1)}\right)\) (\(-\infty\), \(\infty\)) Linear in predictors

Part 3: Odds and Odds Ratios

What Are Odds?

The odds of an event is the ratio of the probability that the event occurs to the probability that it does not:

\[\text{Odds}(D) = \frac{\Pr(D)}{1 - \Pr(D)}\]

For example, if \(\Pr(D) = 0.20\) (20% chance of disease):

\[\text{Odds}(D) = \frac{0.20}{0.80} = 0.25\]

This means the event is one-quarter as likely to happen as not happen, or equivalently, the odds are “4 to 1 against.”

tibble(Probability = seq(0.01, 0.99, by = 0.01)) |>
  mutate(Odds = Probability / (1 - Probability),
         `Log-Odds` = log(Odds)) |>
  pivot_longer(-Probability, names_to = "Scale", values_to = "Value") |>
  ggplot(aes(x = Probability, y = Value, color = Scale)) +
  geom_line(linewidth = 1.2) +
  facet_wrap(~Scale, scales = "free_y") +
  labs(title = "Probability, Odds, and Log-Odds",
       x = "Probability of Event") +
  theme_minimal() +
  theme(legend.position = "none")

The Odds Ratio

The odds ratio (OR) compares the odds of an event between two groups:

\[\text{OR} = \frac{\text{Odds in Group A}}{\text{Odds in Group B}}\]

Interpreting the odds ratio:

OR Value Interpretation
OR = 1 No association
OR > 1 Higher odds in group A (risk factor)
OR < 1 Lower odds in group A (protective factor)

Connecting OR to Logistic Regression

For a binary predictor \(X\) (coded 0/1), the logistic model gives:

  • Log-odds when \(X = 1\): \(\beta_0 + \beta_1\)
  • Log-odds when \(X = 0\): \(\beta_0\)
  • Difference in log-odds: \(\beta_1\)

Therefore:

\[\text{OR} = e^{\beta_1}\]

This is the fundamental result: exponentiating the logistic regression coefficient gives the odds ratio. This works because the \(\beta_0\) terms cancel when we take the ratio of odds.

For a continuous predictor, \(e^{\beta_1}\) gives the OR for a one-unit increase in \(X\). For a \(c\)-unit increase:

\[\text{OR}_{c\text{-unit}} = e^{c \cdot \beta_1}\]


Part 4: Simple Logistic Regression in R

Fitting the Model with glm()

In R, logistic regression is fit using glm() with family = binomial(link = "logit").

Example 1: Binary predictor (exercise)

mod_exercise <- glm(fmd ~ exercise, data = brfss_logistic,
                    family = binomial(link = "logit"))

tidy(mod_exercise, conf.int = TRUE, exponentiate = FALSE) |>
  kable(digits = 3, caption = "Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -1.337 0.068 -19.769 0 -1.471 -1.206
exerciseYes -0.555 0.083 -6.655 0 -0.718 -0.391

Reading the output:

  • The intercept (\(\hat{\beta}_0\)) is the log-odds of FMD when exercise = “No” (the reference level)
  • The slope (\(\hat{\beta}_1\)) is the difference in log-odds between exercisers and non-exercisers
  • The p-value tests \(H_0: \beta_1 = 0\) (i.e., OR = 1, no association)

Converting to Odds Ratios

To get odds ratios, we exponentiate the coefficients:

tidy(mod_exercise, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.263 0.068 -19.769 0 0.230 0.299
exerciseYes 0.574 0.083 -6.655 0 0.488 0.676

Interpretation: The odds ratio for exercise tells us how the odds of frequent mental distress compare between those who exercise and those who do not, without adjusting for any other variables. An OR less than 1 indicates that exercise is associated with lower odds of FMD.

Using gtsummary for Publication-Ready Tables

mod_exercise |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(exercise ~ "Exercise in past 30 days")
  ) |>
  bold_labels() |>
  bold_p()
Characteristic OR 95% CI p-value
Exercise in past 30 days


    No
    Yes 0.57 0.49, 0.68 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Example 2: Continuous predictor (age)

mod_age <- glm(fmd ~ age, data = brfss_logistic,
               family = binomial(link = "logit"))

tidy(mod_age, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: FMD ~ Age (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Simple Logistic Regression: FMD ~ Age (Odds Ratio Scale)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.725 0.131 -2.451 0.014 0.559 0.936
age 0.974 0.002 -10.703 0.000 0.970 0.979

Interpretation: The OR for age represents the change in odds of FMD for each one-year increase in age. A one-year change is rarely meaningful. Let us compute the OR for a 10-year increase:

beta_age <- coef(mod_age)["age"]
or_10yr <- exp(10 * beta_age)
ci_10yr <- exp(10 * confint(mod_age)["age", ])

tibble(
  `Age Difference` = "10 years",
  `OR` = round(or_10yr, 3),
  `95% CI Lower` = round(ci_10yr[1], 3),
  `95% CI Upper` = round(ci_10yr[2], 3)
) |>
  kable(caption = "Odds Ratio for 10-Year Increase in Age") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Odds Ratio for 10-Year Increase in Age
Age Difference OR 95% CI Lower 95% CI Upper
10 years 0.771 0.736 0.809

Interpretation: The OR for a 10-year increase in age tells us how the odds of frequent mental distress change when comparing two individuals who differ by 10 years of age.

Visualizing the Predicted Probabilities

ggpredict(mod_age, terms = "age [18:80]") |>
  plot() +
  labs(title = "Predicted Probability of Frequent Mental Distress by Age",
       x = "Age (years)", y = "Predicted Probability of FMD") +
  theme_minimal()

Example 3: Binary predictor (smoking)

mod_smoker <- glm(fmd ~ smoker, data = brfss_logistic,
                  family = binomial(link = "logit"))

tidy(mod_smoker, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: FMD ~ Smoking (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Simple Logistic Regression: FMD ~ Smoking (Odds Ratio Scale)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.137 0.054 -37.076 0 0.123 0.151
smokerCurrent 1.959 0.080 8.424 0 1.675 2.291

Interpretation: The OR for current smoking vs. former/never smokers gives the unadjusted association between smoking and frequent mental distress. However, this crude OR may be confounded by other factors (such as age, income, or exercise). We need multiple logistic regression to obtain adjusted estimates.


Part 5: Multiple Logistic Regression (Introduction)

The Adjusted Odds Ratio

Just as in multiple linear regression, we can include several predictors in a logistic model to obtain adjusted estimates:

\[\text{logit}[\Pr(\text{FMD} = 1)] = \beta_0 + \beta_1(\text{exercise}) + \beta_2(\text{age}) + \beta_3(\text{sex}) + \beta_4(\text{smoker})\]

In this model, \(e^{\beta_1}\) gives the adjusted odds ratio for exercise, controlling for age, sex, and smoking status. This is the logistic regression analog of the adjusted coefficient in multiple linear regression.

mod_multi <- glm(fmd ~ exercise + age + sex + smoker + sleep_hrs + income_cat,
                 data = brfss_logistic,
                 family = binomial(link = "logit"))

mod_multi |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      exercise   ~ "Exercise (past 30 days)",
      age        ~ "Age (per year)",
      sex        ~ "Sex",
      smoker     ~ "Smoking status",
      sleep_hrs  ~ "Sleep hours (per hour)",
      income_cat ~ "Income category (per unit)"
    )
  ) |>
  bold_labels() |>
  bold_p()
Characteristic OR 95% CI p-value
Exercise (past 30 days)


    No
    Yes 0.61 0.51, 0.73 <0.001
Age (per year) 0.97 0.97, 0.98 <0.001
Sex


    Male
    Female 1.67 1.42, 1.96 <0.001
Smoking status


    Former/Never
    Current 1.25 1.05, 1.48 0.012
Sleep hours (per hour) 0.81 0.77, 0.86 <0.001
Income category (per unit) 0.85 0.82, 0.89 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation: Each odds ratio in this table is adjusted for all other variables in the model. Compare these adjusted ORs to the crude (unadjusted) ORs from the simple models. If they differ substantially, confounding is present. We will explore this more thoroughly in Part 2 next lecture.

Crude vs. Adjusted Comparison

# Crude ORs from simple models
crude_exercise <- tidy(mod_exercise, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "exerciseYes") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Crude")

crude_smoker <- tidy(mod_smoker, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "smokerCurrent") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Crude")

# Adjusted ORs from multiple model
adj_exercise <- tidy(mod_multi, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "exerciseYes") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Adjusted")

adj_smoker <- tidy(mod_multi, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "smokerCurrent") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Adjusted")

bind_rows(crude_exercise, adj_exercise, crude_smoker, adj_smoker) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
  kable(col.names = c("Predictor", "OR", "95% CI Lower", "95% CI Upper", "Type"),
        caption = "Crude vs. Adjusted Odds Ratios") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Crude vs. Adjusted Odds Ratios
Predictor OR 95% CI Lower 95% CI Upper Type
exerciseYes 0.574 0.488 0.676 Crude
exerciseYes 0.613 0.514 0.731 Adjusted
smokerCurrent 1.959 1.675 2.291 Crude
smokerCurrent 1.247 1.050 1.480 Adjusted

Interpretation: Compare the crude and adjusted ORs for exercise and smoking. If the adjusted OR moves toward 1 (attenuates), confounding was inflating the crude association. If the adjusted OR moves away from 1, confounding was masking the true association. This comparison is fundamental to epidemiologic analysis.


Part 6: Model Assessment (Preview)

We will cover model assessment in detail next lecture, but here is a preview of the tools available:

Overall Model Significance

The likelihood ratio test compares the fitted model to the null (intercept-only) model:

null_mod <- glm(fmd ~ 1, data = brfss_logistic, family = binomial)

# Likelihood ratio test
lr_test <- anova(null_mod, mod_multi, test = "Chisq")

lr_test |>
  kable(digits = 3, caption = "Likelihood Ratio Test: Full Model vs. Null") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Likelihood Ratio Test: Full Model vs. Null
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
4999 4251.299 NA NA NA
4993 3870.197 6 381.101 0

Model Fit Statistics

glance(mod_multi) |>
  dplyr::select(null.deviance, deviance, AIC, BIC, nobs) |>
  kable(digits = 1, caption = "Model Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Model Fit Statistics
null.deviance deviance AIC BIC nobs
4251.3 3870.2 3884.2 3929.8 5000

Key metrics (covered in detail next lecture):

  • Deviance: analogous to RSS in linear regression; lower is better
  • AIC/BIC: for model comparison; lower is better
  • Hosmer-Lemeshow test: goodness-of-fit test (next lecture)
  • ROC curve and AUC: discrimination ability (next lecture)

Summary

Concept Key Point
Why logistic regression Binary outcomes need a model that constrains predictions to [0, 1]
The logistic model \(\Pr(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots)}}\)
The logit \(\text{logit}[\Pr(Y=1)] = \beta_0 + \beta_1 X_1 + \cdots\) (linear in predictors)
Odds ratio \(\text{OR} = e^{\beta}\) for one-unit change; \(e^{c \cdot \beta}\) for \(c\)-unit change
Binary predictor OR compares the two categories directly
Continuous predictor OR is per one-unit increase; scale appropriately
Adjusted OR Include multiple predictors to control for confounding
In R glm(y ~ x, family = binomial)
Exponentiate tidy(model, exponentiate = TRUE) or exp(coef(model))

Next Lecture (April 14)

  • Multiple logistic regression in depth
  • Wald test vs. likelihood ratio test
  • Confidence intervals for adjusted ORs
  • Hosmer-Lemeshow goodness-of-fit test
  • ROC curves and AUC
  • Interaction terms in logistic regression

References

  • Kleinbaum, D. G., Kupper, L. L., Nizam, A., & Rosenberg, E. S. (2013). Applied Regression Analysis and Other Multivariable Methods (5th ed.), Chapter 22.
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
  • CDC. Frequent Mental Distress definition: 14+ mentally unhealthy days in past 30 days.

Part 7: In-Class Lab Activity

EPI 553 — Logistic Regression Part 1 Lab Due: End of class, April 9, 2026

Instructions

Complete the four tasks below using the BRFSS 2020 dataset (brfss_logistic_2020.rds). Submit a knitted HTML file via Brightspace. You may collaborate, but each student must submit their own work.

Data

Variable Description Type
fmd Frequent mental distress (No/Yes) Factor (outcome)
menthlth_days Mentally unhealthy days (0-30) Numeric
physhlth_days Physically unhealthy days (0-30) Numeric
sleep_hrs Hours of sleep per night Numeric
age Age in years Numeric
sex Male / Female Factor
bmi Body mass index Numeric
exercise Exercised in past 30 days (No/Yes) Factor
income_cat Household income category (1-8) Numeric
smoker Former/Never vs. Current Factor
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_logistic <- readRDS("brfss_logistic_2020.rds")

Task 1: Explore the Binary Outcome (15 points)

1a. (5 pts) Create a frequency table showing the number and percentage of individuals with and without frequent mental distress.

brfss_logistic %>%
  count(fmd) %>%
  mutate(`%` = round(n / sum(n) * 100, 1)) %>%
  kable(
    caption   = "Table 1. Frequency of Frequent Mental Distress (FMD)",
    col.names = c("FMD Status", "N", "%")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 1. Frequency of Frequent Mental Distress (FMD)
FMD Status N %
No 4243 84.9
Yes 757 15.1

1b. (5 pts) Create a descriptive summary table of at least 4 predictors, stratified by FMD status. Use tbl_summary().

# Continuous variables stratified by FMD
brfss_logistic %>%
  group_by(fmd) %>%
  summarise(
    `Mean physhlth_days` = round(mean(physhlth_days, na.rm = TRUE), 2),
    `Mean sleep_hrs`     = round(mean(sleep_hrs, na.rm = TRUE), 2),
    `Mean age`           = round(mean(age, na.rm = TRUE), 2),
    `Mean bmi`           = round(mean(bmi, na.rm = TRUE), 2),
    `Mean income_cat`    = round(mean(income_cat, na.rm = TRUE), 2),
    N                    = n()
  ) %>%
  kable(
    caption = "Table 2a. Continuous Predictors by FMD Status"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 2a. Continuous Predictors by FMD Status
fmd Mean physhlth_days Mean sleep_hrs Mean age Mean bmi Mean income_cat N
No 3.20 7.09 57.44 28.40 6.00 4243
Yes 10.39 6.51 50.49 29.33 5.05 757
# Categorical variables stratified by FMD
brfss_logistic %>%
  group_by(fmd, sex) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(fmd) %>%
  mutate(`%` = round(n / sum(n) * 100, 1)) %>%
  kable(
    caption   = "Table 2b. Sex Distribution by FMD Status",
    col.names = c("FMD", "Sex", "N", "%")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 2b. Sex Distribution by FMD Status
FMD Sex N %
No Male 2378 56.0
No Female 1865 44.0
Yes Male 323 42.7
Yes Female 434 57.3
brfss_logistic %>%
  group_by(fmd, exercise) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(fmd) %>%
  mutate(`%` = round(n / sum(n) * 100, 1)) %>%
  kable(
    caption   = "Table 2c. Exercise Distribution by FMD Status",
    col.names = c("FMD", "Exercise", "N", "%")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 2c. Exercise Distribution by FMD Status
FMD Exercise N %
No No 1051 24.8
No Yes 3192 75.2
Yes No 276 36.5
Yes Yes 481 63.5

1c. (5 pts) Create a bar chart showing the proportion of FMD by exercise status OR smoking status.

brfss_logistic %>%
  group_by(smoker, fmd) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(smoker) %>%
  mutate(prop = n / sum(n)) %>%
  filter(fmd == "Yes") %>%
  ggplot(aes(x = smoker, y = prop, fill = smoker)) +
  geom_col(alpha = 0.85, width = 0.5) +
  geom_text(aes(label = paste0(round(prop * 100, 1), "%")),
            vjust = -0.5, size = 4) +
  scale_fill_manual(values = c("steelblue", "coral")) +
  labs(
    title    = "Figure 1. Proportion with Frequent Mental Distress by Smoking Status",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "Smoking Status",
    y        = "Proportion with FMD"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")
Figure 1. Proportion of Frequent Mental Distress by Smoking Status

Figure 1. Proportion of Frequent Mental Distress by Smoking Status

Task 2: Simple Logistic Regression (20 points)

2a. (5 pts) Fit a simple logistic regression model predicting FMD from exercise. Report the coefficients on the log-odds scale.

# Model 1: FMD ~ exercise (unadjusted)
mod_exercise <- glm(fmd ~ exercise, data = brfss_logistic,
                    family = binomial(link = "logit"))

tidy(mod_exercise, conf.int = TRUE, exponentiate = FALSE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption   = "Table 3. Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 3. Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) -1.3371 0.0676 -19.7689 0 -1.4714 -1.2062
exerciseYes -0.5554 0.0835 -6.6545 0 -0.7184 -0.3911

2b. (5 pts) Exponentiate the coefficients to obtain odds ratios with 95% confidence intervals.

tidy(mod_exercise, conf.int = TRUE, exponentiate = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption   = "Table 4. Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)",
    col.names = c("Term", "OR", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 4. Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)
Term OR Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 0.2626 0.0676 -19.7689 0 0.2296 0.2993
exerciseYes 0.5738 0.0835 -6.6545 0 0.4876 0.6763
or_ex <- round(exp(coef(mod_exercise)["exerciseYes"]), 3)
ci_ex <- round(exp(confint(mod_exercise)["exerciseYes", ]), 3)
p_ex  <- round(tidy(mod_exercise)$p.value[2], 4)

2c. (5 pts) Interpret the odds ratio for exercise in the context of the research question.

The odds ratio for exercise is 0.574 (95% CI: 0.488 to 0.676; p < 0.0001). Individuals who exercised in the past 30 days had 42.6% lower odds of frequent mental distress compared to non-exercisers, without adjusting for any other variables. Since the entire confidence interval falls below 1 and p < 0.0001, this is a statistically significant protective association.

2d. (5 pts) Create a plot showing the predicted probability of FMD across levels of a continuous predictor (e.g., age or sleep hours).

mod_age <- glm(fmd ~ age, data = brfss_logistic,
               family = binomial(link = "logit"))

# Generate predicted probabilities using predict()
pred_age <- data.frame(age = seq(18, 80, by = 1))
pred_age$prob <- predict(mod_age, newdata = pred_age, type = "response")

ggplot(pred_age, aes(x = age, y = prob)) +
  geom_line(color = "steelblue", linewidth = 1.2) +
  labs(
    title    = "Figure 2. Predicted Probability of FMD by Age",
    subtitle = "From simple logistic regression: fmd ~ age",
    x        = "Age (years)",
    y        = "Predicted Probability of FMD"
  ) +
  theme_minimal(base_size = 13)
Figure 2. Predicted Probability of FMD by Age

Figure 2. Predicted Probability of FMD by Age

Task 3: Comparing Predictors (20 points)

3a. (5 pts) Fit three separate simple logistic regression models, each with a different predictor of your choice.

# Model 2: FMD ~ smoker
mod_smoker <- glm(fmd ~ smoker, data = brfss_logistic,
                  family = binomial(link = "logit"))

# Model 3: FMD ~ sleep_hrs
mod_sleep <- glm(fmd ~ sleep_hrs, data = brfss_logistic,
                 family = binomial(link = "logit"))

# Model 4: FMD ~ sex
mod_sex <- glm(fmd ~ sex, data = brfss_logistic,
               family = binomial(link = "logit"))

tidy(mod_smoker, conf.int = TRUE, exponentiate = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(caption = "Table 5a. FMD ~ Smoking (OR Scale)",
        col.names = c("Term", "OR", "SE", "t", "p-value", "CI Lower", "CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 5a. FMD ~ Smoking (OR Scale)
Term OR SE t p-value CI Lower CI Upper
(Intercept) 0.1365 0.0537 -37.0761 0 0.1227 0.1515
smokerCurrent 1.9594 0.0799 8.4237 0 1.6753 2.2913
tidy(mod_sleep, conf.int = TRUE, exponentiate = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(caption = "Table 5b. FMD ~ Sleep Hours (OR Scale)",
        col.names = c("Term", "OR", "SE", "t", "p-value", "CI Lower", "CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 5b. FMD ~ Sleep Hours (OR Scale)
Term OR SE t p-value CI Lower CI Upper
(Intercept) 1.1013 0.1843 0.5234 0.6007 0.7668 1.5798
sleep_hrs 0.7652 0.0272 -9.8354 0.0000 0.7253 0.8070
tidy(mod_sex, conf.int = TRUE, exponentiate = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(caption = "Table 5c. FMD ~ Sex (OR Scale)",
        col.names = c("Term", "OR", "SE", "t", "p-value", "CI Lower", "CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 5c. FMD ~ Sex (OR Scale)
Term OR SE t p-value CI Lower CI Upper
(Intercept) 0.1358 0.0593 -33.6657 0 0.1207 0.1523
sexFemale 1.7132 0.0797 6.7527 0 1.4660 2.0040

3b. (10 pts) Create a table comparing the odds ratios from all three models.

bind_rows(
  tidy(mod_exercise, conf.int = TRUE, exponentiate = TRUE) %>%
    filter(term == "exerciseYes") %>%
    mutate(Predictor = "Exercise (Yes vs. No)"),
  tidy(mod_smoker, conf.int = TRUE, exponentiate = TRUE) %>%
    filter(term == "smokerCurrent") %>%
    mutate(Predictor = "Smoking (Current vs. Former/Never)"),
  tidy(mod_sleep, conf.int = TRUE, exponentiate = TRUE) %>%
    filter(term == "sleep_hrs") %>%
    mutate(Predictor = "Sleep Hours (per 1 hour)"),
  tidy(mod_sex, conf.int = TRUE, exponentiate = TRUE) %>%
    filter(term == "sexFemale") %>%
    mutate(Predictor = "Sex (Female vs. Male)")
) %>%
  select(Predictor, estimate, conf.low, conf.high, p.value) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption   = "Table 6. Crude Odds Ratios from Simple Logistic Regression Models",
    col.names = c("Predictor", "OR", "95% CI Lower", "95% CI Upper", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 6. Crude Odds Ratios from Simple Logistic Regression Models
Predictor OR 95% CI Lower 95% CI Upper p-value
Exercise (Yes vs. No) 0.5738 0.4876 0.6763 0
Smoking (Current vs. Former/Never) 1.9594 1.6753 2.2913 0
Sleep Hours (per 1 hour) 0.7652 0.7253 0.8070 0
Sex (Female vs. Male) 1.7132 1.4660 2.0040 0
or_smoker <- round(exp(coef(mod_smoker)["smokerCurrent"]), 3)
or_sleep  <- round(exp(coef(mod_sleep)["sleep_hrs"]), 3)

3c. (5 pts) Which predictor has the strongest crude association with FMD? Justify your answer.

Current smoking has the strongest crude association with FMD. Given an odds ratio of 1.959 current smokers have almost twice the odds of frequent mental distress compared to former/never smokers (the largest deviation from OR = 1 among the examined predictors ). Sleep hours also shows a meaningful protective association (OR = 0.765 per additional hour). Sex (OR = 1.713) and exercise (OR = 0.574) are also strongly associated.

Task 4: Introduction to Multiple Logistic Regression (20 points)

4a. (5 pts) Fit a multiple logistic regression model predicting FMD from at least 3 predictors.

# Multiple logistic regression: FMD ~ exercise + age + sex + smoker + sleep_hrs + income_cat
mod_multi <- glm(fmd ~ exercise + age + sex + smoker + sleep_hrs + income_cat,
                 data = brfss_logistic,
                 family = binomial(link = "logit"))

summary(mod_multi)
## 
## Call:
## glm(formula = fmd ~ exercise + age + sex + smoker + sleep_hrs + 
##     income_cat, family = binomial(link = "logit"), data = brfss_logistic)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.926270   0.268602   7.171 7.42e-13 ***
## exerciseYes   -0.489579   0.089845  -5.449 5.06e-08 ***
## age           -0.025488   0.002653  -9.609  < 2e-16 ***
## sexFemale      0.511035   0.083382   6.129 8.85e-10 ***
## smokerCurrent  0.220855   0.087536   2.523   0.0116 *  
## sleep_hrs     -0.205438   0.027133  -7.572 3.69e-14 ***
## income_cat    -0.158812   0.019149  -8.294  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4251.3  on 4999  degrees of freedom
## Residual deviance: 3870.2  on 4993  degrees of freedom
## AIC: 3884.2
## 
## Number of Fisher Scoring iterations: 5

4b. (5 pts) Report the adjusted odds ratios using tbl_regression().

tidy(mod_multi, conf.int = TRUE, exponentiate = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption   = "Table 7. Multiple Logistic Regression: Adjusted Odds Ratios",
    col.names = c("Term", "OR", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 7. Multiple Logistic Regression: Adjusted Odds Ratios
Term OR Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 6.8639 0.2686 7.1715 0.0000 4.0571 11.6318
exerciseYes 0.6129 0.0898 -5.4492 0.0000 0.5142 0.7314
age 0.9748 0.0027 -9.6090 0.0000 0.9698 0.9799
sexFemale 1.6670 0.0834 6.1289 0.0000 1.4162 1.9639
smokerCurrent 1.2471 0.0875 2.5230 0.0116 1.0501 1.4800
sleep_hrs 0.8143 0.0271 -7.5716 0.0000 0.7719 0.8586
income_cat 0.8532 0.0191 -8.2937 0.0000 0.8217 0.8858

4c. (5 pts) For one predictor, compare the crude OR (from Task 3) with the adjusted OR (from Task 4). Show both values.

or_smoker_crude <- round(exp(coef(mod_smoker)["smokerCurrent"]), 3)
or_smoker_adj   <- round(exp(coef(mod_multi)["smokerCurrent"]), 3)
ci_crude        <- round(exp(confint(mod_smoker)["smokerCurrent", ]), 3)
ci_adj          <- round(exp(confint(mod_multi)["smokerCurrent", ]), 3)

tibble(
  Model    = c("Crude (unadjusted)", "Adjusted (+ age, sex, exercise, sleep, income)"),
  OR       = c(or_smoker_crude, or_smoker_adj),
  `CI Low` = c(ci_crude[1], ci_adj[1]),
  `CI High`= c(ci_crude[2], ci_adj[2])
) %>%
  kable(
    caption   = "Table 8. Crude vs. Adjusted OR for Smoking (Current vs. Former/Never)",
    col.names = c("Model", "OR", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 8. Crude vs. Adjusted OR for Smoking (Current vs. Former/Never)
Model OR 95% CI Lower 95% CI Upper
Crude (unadjusted) 1.959 1.675 2.291
Adjusted (+ age, sex, exercise, sleep, income) 1.247 1.050 1.480

4d. (5 pts) In 2-3 sentences, assess whether confounding is present for the predictor you chose. Which direction did the OR change, and what does this mean?

The crude OR for current smoking was 1.959 and the adjusted OR was 1.247 after controlling for age, sex, exercise, sleep, and income. This indicates positive confounding. Current smokers tend to exercise less, sleep fewer hours, and have lower incomes compared to former/never smokers. All these are independently associated with higher odds of FMD, and adjusting for these removes their contribution from the smoking estimate.The adjusted OR remains statistically significant (95% CI: 1.050 to 1.480). It confirms that smoking has an independent association with frequent mental distress besides what is explained by other covariates.

Completion credit (25 points): Awarded for a complete, good-faith attempt at all tasks. Total: 75 + 25 = 100 points.

End of Lab Activity