Introduction

Throughout this course, we have modeled continuous outcomes: mentally unhealthy days, blood pressure, BMI. But many outcomes in epidemiology are binary: disease or no disease, survived or died, exposed or unexposed. Linear regression is not appropriate for binary outcomes because it can produce predicted probabilities outside the [0, 1] range and violates the assumptions of normally distributed residuals with constant variance.

Logistic regression is the standard method for modeling binary outcomes. It is arguably the most widely used regression technique in epidemiologic research. In this lecture, we will:

Understand why linear regression fails for binary outcomes
Learn the logistic model and the logit transformation
Connect logistic regression to the odds ratio, the primary measure of effect in epidemiology
Fit and interpret a simple logistic regression model in R
Extend to multiple logistic regression (introduced today, continued next lecture)

Textbook reference: Kleinbaum et al., Chapter 22 (Sections 22.1 through 22.3)

Why/When/Watch-out

Why use logistic regression?

The outcome variable is binary (0/1, yes/no, disease/no disease)
We want to estimate the probability of an event occurring
We need adjusted odds ratios that control for confounders
We want to identify risk factors for a dichotomous health outcome

When to use it?

Cross-sectional studies: prevalence of a condition
Case-control studies: odds of exposure given disease status
Cohort studies: incidence of disease (when outcome is rare, OR approximates RR)

Watch out for:

Separation: when a predictor perfectly predicts the outcome, ML estimation fails
Small sample sizes: logistic regression needs roughly 10-20 events per predictor
Multicollinearity: same issues as in linear regression
Linearity in the logit: the relationship between continuous predictors and the log-odds must be linear

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(ggeffects)
library(plotly)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020. In previous lectures, we modeled mentally unhealthy days as a continuous outcome. Today, we transform this into a binary outcome: frequent mental distress (FMD), defined as reporting 14 or more mentally unhealthy days in the past 30 days. This threshold is used by the CDC as a standard measure of frequent mental distress.

Research question:

What individual, behavioral, and socioeconomic factors are associated with frequent mental distress among U.S. adults?

brfss_full <- read_xpt(
  "LLCP2020.XPT"
) |>
  clean_names()

brfss_logistic <- brfss_full |>
  mutate(
    # Binary outcome: frequent mental distress (>= 14 days)
    menthlth_days = case_when(
      menthlth == 88                  ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                           ~ NA_real_
    ),
    fmd = factor(
      ifelse(menthlth_days >= 14, 1, 0),
      levels = c(0, 1),
      labels = c("No", "Yes")
    ),
    # Predictors
    physhlth_days = case_when(
      physhlth == 88                  ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                           ~ NA_real_
    ),
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                           ~ NA_real_
    ),
    age = age80,
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    smoker = factor(case_when(
      smokday2 %in% c(1, 2) ~ "Current",
      smokday2 == 3          ~ "Former/Never",
      TRUE                   ~ NA_character_
    ), levels = c("Former/Never", "Current"))
  ) |>
  filter(
    !is.na(fmd), !is.na(physhlth_days), !is.na(sleep_hrs),
    !is.na(age), age >= 18, !is.na(sex), !is.na(bmi),
    !is.na(exercise), !is.na(income_cat), !is.na(smoker)
  )

set.seed(1220)
brfss_logistic <- brfss_logistic |>
  dplyr::select(fmd, menthlth_days, physhlth_days, sleep_hrs, age, sex,
                bmi, exercise, income_cat, smoker) |>
  slice_sample(n = 5000)

# Save for lab use
saveRDS(brfss_logistic,
  "brfss_logistic_2020.rds")

tibble(Metric = c("Observations", "Variables", "FMD Cases", "FMD Prevalence"),
       Value  = c(nrow(brfss_logistic), ncol(brfss_logistic),
                  sum(brfss_logistic$fmd == "Yes"),
                  paste0(round(100 * mean(brfss_logistic$fmd == "Yes"), 1), "%"))) |>
  kable(caption = "Analytic Dataset Overview") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Overview
Metric	Value
Observations	5000
Variables	10
FMD Cases	757
FMD Prevalence	15.1%

Descriptive Overview

brfss_logistic |>
  tbl_summary(
    by = fmd,
    include = c(physhlth_days, sleep_hrs, age, sex, bmi, exercise,
                income_cat, smoker),
    type = list(
      c(physhlth_days, sleep_hrs, age, bmi, income_cat) ~ "continuous"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})"
    ),
    label = list(
      physhlth_days ~ "Physical unhealthy days",
      sleep_hrs     ~ "Sleep hours",
      age           ~ "Age (years)",
      sex           ~ "Sex",
      bmi           ~ "BMI",
      exercise      ~ "Exercise in past 30 days",
      income_cat    ~ "Income category (1-8)",
      smoker        ~ "Smoking status"
    )
  ) |>
  add_overall() |>
  add_p() |>
  bold_labels()

Characteristic	Overall N = 5,000¹	No N = 4,243¹	Yes N = 757¹	p-value²
Physical unhealthy days	4 (9)	3 (8)	10 (13)	<0.001
Sleep hours	7.00 (1.48)	7.09 (1.40)	6.51 (1.83)	<0.001
Age (years)	56 (16)	57 (16)	50 (16)	<0.001
Sex				<0.001
Male	2,701 (54%)	2,378 (56%)	323 (43%)
Female	2,299 (46%)	1,865 (44%)	434 (57%)
BMI	28.5 (6.3)	28.4 (6.2)	29.3 (7.0)	0.001
Exercise in past 30 days	3,673 (73%)	3,192 (75%)	481 (64%)	<0.001
Income category (1-8)	5.85 (2.11)	6.00 (2.04)	5.05 (2.29)	<0.001
Smoking status				<0.001
Former/Never	3,280 (66%)	2,886 (68%)	394 (52%)
Current	1,720 (34%)	1,357 (32%)	363 (48%)
¹ Mean (SD); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

Interpretation: This table compares characteristics between those with and without frequent mental distress. Notice the differences: individuals with FMD report substantially more physically unhealthy days, sleep slightly less, are younger on average, are more likely to be female, less likely to exercise, have lower income, and are more likely to be current smokers. These differences motivate our regression analysis to identify independent risk factors.

Part 1: Why Linear Regression Fails for Binary Outcomes

The Problem

What happens if we try to use linear regression for a binary outcome? Let us fit a linear probability model and see.

Real-Life Examples

Before seeing this in our BRFSS data, consider two familiar clinical scenarios where the breakdown of linear regression is obvious.

Example 1: Predicting Diabetes Diagnosis from Fasting Blood Glucose

A clinician wants to predict whether a patient has diabetes (1 = yes, 0 = no) based on fasting blood glucose (mg/dL). After fitting a linear regression on a sample of patients, the model produces:

A 25-year-old with a fasting glucose of 60 mg/dL: \(\hat{P}(\text{diabetes}) = -0.08\) — a negative probability.
An elderly patient with a glucose of 380 mg/dL: \(\hat{P}(\text{diabetes}) = 1.21\) — a probability above 1.

Neither prediction is mathematically interpretable as a probability. The model is extrapolating beyond [0, 1] because a straight line has no natural ceiling or floor. In reality, the risk of diabetes rises steeply at high glucose levels but levels off — exactly the behavior the S-shaped logistic curve captures.

Example 2: Predicting 30-Day Hospital Readmission

A hospital administrator fits a linear regression to predict 30-day readmission (1 = readmitted, 0 = not) using age and number of comorbidities. For extreme cases:

A healthy 22-year-old with no comorbidities: \(\hat{P}(\text{readmission}) = -0.15\) (impossible)
An 85-year-old with eight comorbidities: \(\hat{P}(\text{readmission}) = 1.09\) (impossible)

Beyond the out-of-bounds predictions, the model also violates homoscedasticity. Because \(Y\) can only be 0 or 1, the variance of the residuals is \(p(1-p)\) — it depends on the fitted value and changes across the range of the predictor. Standard errors, confidence intervals, and hypothesis tests all rest on the assumption of constant variance, so they are invalid here.

These are not edge cases — they arise routinely whenever you apply linear regression to a binary outcome. The solution is to model the log-odds of the event, which is what logistic regression does.

# Create numeric version of outcome
brfss_logistic <- brfss_logistic |>
  mutate(fmd_num = as.numeric(fmd == "Yes"))

# Fit linear regression on binary outcome
lpm <- lm(fmd_num ~ age, data = brfss_logistic)

ggplot(brfss_logistic, aes(x = age, y = fmd_num)) +
  geom_jitter(alpha = 0.1, height = 0.05, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1.2) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"),
              se = FALSE, color = "darkorange", linewidth = 1.2) +
  labs(title = "Linear vs. Logistic Fit for a Binary Outcome",
       subtitle = "Red = Linear regression, Orange = Logistic regression",
       x = "Age (years)", y = "Frequent Mental Distress (0/1)") +
  scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1)) +
  theme_minimal()

Three problems with using linear regression for binary outcomes:

Predicted values outside [0, 1]: The linear model can predict probabilities below 0 or above 1, which is nonsensical for a probability.
Non-constant variance: For a binary outcome, the variance of Y at a given X value is \(p(1-p)\), which depends on the predicted probability. This violates the homoscedasticity assumption.
Non-normal residuals: The residuals can only take two values at any given X, so they cannot be normally distributed.

The logistic model solves all three problems by modeling the log-odds of the outcome rather than the outcome itself.

Part 2: The Logistic Model

The Logistic Function

The logistic model describes the probability of the outcome (\(Y = 1\)) as a function of predictors using the logistic function:

\[\Pr(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k)}}\]

This function has a characteristic S-shape (sigmoid curve) that is bounded between 0 and 1, making it ideal for modeling probabilities.

# Fit a simple logistic model using sleep hours to anchor the real-life annotations
mod_sleep_demo <- glm(fmd_num ~ sleep_hrs, data = brfss_logistic,
                      family = binomial(link = "logit"))
b0 <- coef(mod_sleep_demo)[1]
b1 <- coef(mod_sleep_demo)[2]

# Real-life anchor points: translate specific sleep-hour values into z and P(FMD)
anchors <- tibble(
  sleep  = c(3, 5, 7, 8, 10),
  label  = paste0(c(3, 5, 7, 8, 10), " hrs sleep")
) |>
  mutate(
    z    = b0 + b1 * sleep,
    prob = plogis(z),
    hover_text = paste0(
      "<b>", label, "</b><br>",
      "z (log-odds): ", round(z, 2), "<br>",
      "P(FMD): ", round(prob * 100, 1), "%"
    )
  )

# Logistic curve data
curve_df <- tibble(z = seq(-6, 6, by = 0.05)) |>
  mutate(prob = plogis(z))

plot_ly() |>
  # S-shaped logistic curve
  add_trace(
    data       = curve_df,
    x          = ~z, y = ~prob,
    type       = "scatter", mode = "lines",
    line       = list(color = "steelblue", width = 3),
    name       = "Logistic function",
    hoverinfo  = "skip"
  ) |>
  # Reference lines at y = 0, 0.5, 1 and x = 0
  add_segments(x = -6, xend = 6, y = 0.5, yend = 0.5,
               line = list(dash = "dash", color = "grey60", width = 1),
               showlegend = FALSE, hoverinfo = "skip") |>
  add_segments(x = 0, xend = 0, y = 0, yend = 1,
               line = list(dash = "dash", color = "grey60", width = 1),
               showlegend = FALSE, hoverinfo = "skip") |>
  # Anchor points from the real BRFSS sleep-hours model
  add_trace(
    data       = anchors,
    x          = ~z, y = ~prob,
    type       = "scatter", mode = "markers",
    marker     = list(size = 14, color = "darkorange",
                      line = list(color = "white", width = 1.5)),
    text       = ~hover_text,
    hoverinfo  = "text",
    name       = "BRFSS: sleep hrs → P(FMD)"
  ) |>
  layout(
    title = list(
      text = "<b>The Logistic Function</b><br><sup>f(z) = 1 / (1 + e<sup>−z</sup>)   |   Real-life example: sleep hours predicting P(frequent mental distress)</sup>",
      x    = 0.5
    ),
    xaxis = list(
      title      = "z  =  β₀ + β₁·X₁ + ··· + βₖ·Xₖ  (linear predictor / log-odds)",
      zeroline   = FALSE,
      range      = c(-6, 6)
    ),
    yaxis = list(
      title      = "Pr(FMD = 1)",
      tickformat = ".0%",
      range      = c(0, 1)
    ),
    legend = list(orientation = "h", y = -0.18),
    hovermode  = "closest",
    annotations = list(
      list(x = 3.5, y = 0.22,
           text = "⟵  More sleep<br>(lower risk)",
           showarrow = FALSE,
           font = list(size = 12, color = "grey40")),
      list(x = -3.5, y = 0.78,
           text = "Less sleep<br>(higher risk)  ⟶",
           showarrow = FALSE,
           font = list(size = 12, color = "grey40"))
    )
  )

How to read this chart: Each orange dot is a real person-type from our BRFSS data. The x-axis shows their value of the linear predictor \(z = \hat{\beta}_0 + \hat{\beta}_1 \times \text{sleep\_hrs}\) — a single number that summarises all predictor information. The S-curve then converts \(z\) into a probability bounded between 0 and 1. Hover over the dots to see exactly how many sleep hours map to which log-odds and predicted probability. Notice that going from 3 hours to 10 hours of sleep moves a person from the steep left flank to the flatter right flank of the curve — the logistic function naturally captures diminishing returns in risk reduction.

Key properties of the logistic function:

As \(z \to -\infty\), \(\Pr(Y=1) \to 0\)
As \(z \to +\infty\), \(\Pr(Y=1) \to 1\)
When \(z = 0\), \(\Pr(Y=1) = 0.5\)
The function is steepest around \(z = 0\)

The Logit Transformation

The logit is the natural log of the odds:

\[\text{logit}[\Pr(Y = 1)] = \ln\left(\frac{\Pr(Y=1)}{1 - \Pr(Y=1)}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]

This is called the logit form of the model. The key insight: while the relationship between \(X\) and \(\Pr(Y=1)\) is non-linear (S-shaped), the relationship between \(X\) and the log-odds is linear. This is why the model is called “logistic regression”: we are doing linear regression on the log-odds scale.

Scale	Formula	Range	Interpretation
Probability	\(\Pr(Y=1)\)	[0, 1]	Chance of the event
Odds	\(\frac{\Pr(Y=1)}{1-\Pr(Y=1)}\)	[0, \(\infty\))	How much more likely than not
Log-odds (logit)	\(\ln\left(\frac{\Pr(Y=1)}{1-\Pr(Y=1)}\right)\)	(\(-\infty\), \(\infty\))	Linear in predictors

Part 3: Odds and Odds Ratios

What Are Odds?

The odds of an event is the ratio of the probability that the event occurs to the probability that it does not:

\[\text{Odds}(D) = \frac{\Pr(D)}{1 - \Pr(D)}\]

For example, if \(\Pr(D) = 0.20\) (20% chance of disease):

\[\text{Odds}(D) = \frac{0.20}{0.80} = 0.25\]

This means the event is one-quarter as likely to happen as not happen, or equivalently, the odds are “4 to 1 against.”

tibble(Probability = seq(0.01, 0.99, by = 0.01)) |>
  mutate(Odds = Probability / (1 - Probability),
         `Log-Odds` = log(Odds)) |>
  pivot_longer(-Probability, names_to = "Scale", values_to = "Value") |>
  ggplot(aes(x = Probability, y = Value, color = Scale)) +
  geom_line(linewidth = 1.2) +
  facet_wrap(~Scale, scales = "free_y") +
  labs(title = "Probability, Odds, and Log-Odds",
       x = "Probability of Event") +
  theme_minimal() +
  theme(legend.position = "none")

The Odds Ratio

The odds ratio (OR) compares the odds of an event between two groups:

\[\text{OR} = \frac{\text{Odds in Group A}}{\text{Odds in Group B}}\]

Interpreting the odds ratio:

OR Value	Interpretation
OR = 1	No association
OR > 1	Higher odds in group A (risk factor)
OR < 1	Lower odds in group A (protective factor)

Connecting OR to Logistic Regression

For a binary predictor \(X\) (coded 0/1), the logistic model gives:

Log-odds when \(X = 1\): \(\beta_0 + \beta_1\)
Log-odds when \(X = 0\): \(\beta_0\)
Difference in log-odds: \(\beta_1\)

Therefore:

\[\text{OR} = e^{\beta_1}\]

This is the fundamental result: exponentiating the logistic regression coefficient gives the odds ratio. This works because the \(\beta_0\) terms cancel when we take the ratio of odds.

For a continuous predictor, \(e^{\beta_1}\) gives the OR for a one-unit increase in \(X\). For a \(c\)-unit increase:

\[\text{OR}_{c\text{-unit}} = e^{c \cdot \beta_1}\]

Part 4: Simple Logistic Regression in R

Fitting the Model with `glm()`

In R, logistic regression is fit using glm() with family = binomial(link = "logit").

Example 1: Binary predictor (exercise)

mod_exercise <- glm(fmd ~ exercise, data = brfss_logistic,
                    family = binomial(link = "logit"))

tidy(mod_exercise, conf.int = TRUE, exponentiate = FALSE) |>
  kable(digits = 3, caption = "Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-1.337	0.068	-19.769	0	-1.471	-1.206
exerciseYes	-0.555	0.083	-6.655	0	-0.718	-0.391

Reading the output:

The intercept (\(\hat{\beta}_0\)) is the log-odds of FMD when exercise = “No” (the reference level)
The slope (\(\hat{\beta}_1\)) is the difference in log-odds between exercisers and non-exercisers
The p-value tests \(H_0: \beta_1 = 0\) (i.e., OR = 1, no association)

Converting to Odds Ratios

To get odds ratios, we exponentiate the coefficients:

tidy(mod_exercise, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.263	0.068	-19.769	0	0.230	0.299
exerciseYes	0.574	0.083	-6.655	0	0.488	0.676

Interpretation: The odds ratio for exercise tells us how the odds of frequent mental distress compare between those who exercise and those who do not, without adjusting for any other variables. An OR less than 1 indicates that exercise is associated with lower odds of FMD.

Using `gtsummary` for Publication-Ready Tables

mod_exercise |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(exercise ~ "Exercise in past 30 days")
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
Exercise in past 30 days
No	—	—
Yes	0.57	0.49, 0.68	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Example 2: Continuous predictor (age)

mod_age <- glm(fmd ~ age, data = brfss_logistic,
               family = binomial(link = "logit"))

tidy(mod_age, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: FMD ~ Age (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: FMD ~ Age (Odds Ratio Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.725	0.131	-2.451	0.014	0.559	0.936
age	0.974	0.002	-10.703	0.000	0.970	0.979

Interpretation: The OR for age represents the change in odds of FMD for each one-year increase in age. A one-year change is rarely meaningful. Let us compute the OR for a 10-year increase:

beta_age <- coef(mod_age)["age"]
or_10yr <- exp(10 * beta_age)
ci_10yr <- exp(10 * confint(mod_age)["age", ])

tibble(
  `Age Difference` = "10 years",
  `OR` = round(or_10yr, 3),
  `95% CI Lower` = round(ci_10yr[1], 3),
  `95% CI Upper` = round(ci_10yr[2], 3)
) |>
  kable(caption = "Odds Ratio for 10-Year Increase in Age") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Odds Ratio for 10-Year Increase in Age
Age Difference	OR	95% CI Lower	95% CI Upper
10 years	0.771	0.736	0.809

Interpretation: The OR for a 10-year increase in age tells us how the odds of frequent mental distress change when comparing two individuals who differ by 10 years of age.

Visualizing the Predicted Probabilities

ggpredict(mod_age, terms = "age [18:80]") |>
  plot() +
  labs(title = "Predicted Probability of Frequent Mental Distress by Age",
       x = "Age (years)", y = "Predicted Probability of FMD") +
  theme_minimal()

Interpretation: The curve shows the predicted probability of frequent mental distress (FMD) across the age range of 18 to 80 years, holding all other factors constant. The S-shaped (sigmoidal) trajectory is the hallmark of logistic regression — probabilities are bounded between 0 and 1 and change most steeply around the midpoint of the curve. The shaded band represents the 95% confidence interval around the predicted probability at each age. A rising curve indicates that older individuals have a higher predicted probability of FMD; a falling curve would indicate the opposite. Pay attention to both the direction of the trend and how wide the confidence band is, especially at the extremes of age where data may be sparse.

Example 3: Binary predictor (smoking)

mod_smoker <- glm(fmd ~ smoker, data = brfss_logistic,
                  family = binomial(link = "logit"))

tidy(mod_smoker, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: FMD ~ Smoking (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: FMD ~ Smoking (Odds Ratio Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.137	0.054	-37.076	0	0.123	0.151
smokerCurrent	1.959	0.080	8.424	0	1.675	2.291

Interpretation: The OR for current smoking vs. former/never smokers gives the unadjusted association between smoking and frequent mental distress. However, this crude OR may be confounded by other factors (such as age, income, or exercise). We need multiple logistic regression to obtain adjusted estimates.

Part 5: Multiple Logistic Regression (Introduction)

The Adjusted Odds Ratio

Just as in multiple linear regression, we can include several predictors in a logistic model to obtain adjusted estimates:

\[\text{logit}[\Pr(\text{FMD} = 1)] = \beta_0 + \beta_1(\text{exercise}) + \beta_2(\text{age}) + \beta_3(\text{sex}) + \beta_4(\text{smoker})\]

In this model, \(e^{\beta_1}\) gives the adjusted odds ratio for exercise, controlling for age, sex, and smoking status. This is the logistic regression analog of the adjusted coefficient in multiple linear regression.

mod_multi <- glm(fmd ~ exercise + age + sex + smoker + sleep_hrs + income_cat,
                 data = brfss_logistic,
                 family = binomial(link = "logit"))

mod_multi |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      exercise   ~ "Exercise (past 30 days)",
      age        ~ "Age (per year)",
      sex        ~ "Sex",
      smoker     ~ "Smoking status",
      sleep_hrs  ~ "Sleep hours (per hour)",
      income_cat ~ "Income category (per unit)"
    )
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
Exercise (past 30 days)
No	—	—
Yes	0.61	0.51, 0.73	<0.001
Age (per year)	0.97	0.97, 0.98	<0.001
Sex
Male	—	—
Female	1.67	1.42, 1.96	<0.001
Smoking status
Former/Never	—	—
Current	1.25	1.05, 1.48	0.012
Sleep hours (per hour)	0.81	0.77, 0.86	<0.001
Income category (per unit)	0.85	0.82, 0.89	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation: Each odds ratio in this table is adjusted for all other variables in the model. Compare these adjusted ORs to the crude (unadjusted) ORs from the simple models. If they differ substantially, confounding is present. We will explore this more thoroughly in Part 2 next lecture.

Crude vs. Adjusted Comparison

# Crude ORs from simple models
crude_exercise <- tidy(mod_exercise, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "exerciseYes") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Crude")

crude_smoker <- tidy(mod_smoker, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "smokerCurrent") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Crude")

# Adjusted ORs from multiple model
adj_exercise <- tidy(mod_multi, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "exerciseYes") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Adjusted")

adj_smoker <- tidy(mod_multi, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "smokerCurrent") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Adjusted")

bind_rows(crude_exercise, adj_exercise, crude_smoker, adj_smoker) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
  kable(col.names = c("Predictor", "OR", "95% CI Lower", "95% CI Upper", "Type"),
        caption = "Crude vs. Adjusted Odds Ratios") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Crude vs. Adjusted Odds Ratios
Predictor	OR	95% CI Lower	95% CI Upper	Type
exerciseYes	0.574	0.488	0.676	Crude
exerciseYes	0.613	0.514	0.731	Adjusted
smokerCurrent	1.959	1.675	2.291	Crude
smokerCurrent	1.247	1.050	1.480	Adjusted

Interpretation: Compare the crude and adjusted ORs for exercise and smoking. If the adjusted OR moves toward 1 (attenuates), confounding was inflating the crude association. If the adjusted OR moves away from 1, confounding was masking the true association. This comparison is fundamental to epidemiologic analysis.

Part 6: Model Assessment (Preview)

We will cover model assessment in detail next lecture, but here is a preview of the tools available:

Overall Model Significance

The likelihood ratio test compares the fitted model to the null (intercept-only) model:

null_mod <- glm(fmd ~ 1, data = brfss_logistic, family = binomial)

# Likelihood ratio test
lr_test <- anova(null_mod, mod_multi, test = "Chisq")

lr_test |>
  kable(digits = 3, caption = "Likelihood Ratio Test: Full Model vs. Null") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Likelihood Ratio Test: Full Model vs. Null
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
4999	4251.299	NA	NA	NA
4993	3870.197	6	381.101	0

Model Fit Statistics

glance(mod_multi) |>
  dplyr::select(null.deviance, deviance, AIC, BIC, nobs) |>
  kable(digits = 1, caption = "Model Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Fit Statistics
null.deviance	deviance	AIC	BIC	nobs
4251.3	3870.2	3884.2	3929.8	5000

Key metrics (covered in detail next lecture):

Deviance: analogous to RSS in linear regression; lower is better
AIC/BIC: for model comparison; lower is better
Hosmer-Lemeshow test: goodness-of-fit test (next lecture)
ROC curve and AUC: discrimination ability (next lecture)

Summary

Concept	Key Point
Why logistic regression	Binary outcomes need a model that constrains predictions to [0, 1]
The logistic model	\(\Pr(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots)}}\)
The logit	\(\text{logit}[\Pr(Y=1)] = \beta_0 + \beta_1 X_1 + \cdots\) (linear in predictors)
Odds ratio	\(\text{OR} = e^{\beta}\) for one-unit change; \(e^{c \cdot \beta}\) for \(c\)-unit change
Binary predictor	OR compares the two categories directly
Continuous predictor	OR is per one-unit increase; scale appropriately
Adjusted OR	Include multiple predictors to control for confounding
In R	`glm(y ~ x, family = binomial)`
Exponentiate	`tidy(model, exponentiate = TRUE)` or `exp(coef(model))`

Next Lecture (April 14)

Multiple logistic regression in depth
Wald test vs. likelihood ratio test
Confidence intervals for adjusted ORs
Hosmer-Lemeshow goodness-of-fit test
ROC curves and AUC
Interaction terms in logistic regression

References

Kleinbaum, D. G., Kupper, L. L., Nizam, A., & Rosenberg, E. S. (2013). Applied Regression Analysis and Other Multivariable Methods (5th ed.), Chapter 22.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
CDC. Frequent Mental Distress definition: 14+ mentally unhealthy days in past 30 days.

Part 7: In-Class Lab Activity

EPI 553 — Logistic Regression Part 1 Lab Due: April 13, 2026

Instructions

Complete the four tasks below using the BRFSS 2020 dataset (brfss_logistic_2020.rds). Submit a knitted HTML file via Brightspace. You may collaborate, but each student must submit their own work.

Data

Variable	Description	Type
`fmd`	Frequent mental distress (No/Yes)	Factor (outcome)
`menthlth_days`	Mentally unhealthy days (0-30)	Numeric
`physhlth_days`	Physically unhealthy days (0-30)	Numeric
`sleep_hrs`	Hours of sleep per night	Numeric
`age`	Age in years	Numeric
`sex`	Male / Female	Factor
`bmi`	Body mass index	Numeric
`exercise`	Exercised in past 30 days (No/Yes)	Factor
`income_cat`	Household income category (1-8)	Numeric
`smoker`	Former/Never vs. Current	Factor

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)
library(dplyr)
library(scales)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_logistic <- readRDS("brfss_logistic_2020.rds")
brfss_logistic$fmd <- ifelse(brfss_logistic$menthlth_days >= 14, 1, 0)# ================================
# Task 1a: Final Frequency Table
# ================================

freq_table <- table(brfss_logistic$fmd)
percent_table <- prop.table(freq_table) * 100

task1a_table <- data.frame(
  Category = c("No Frequent Mental Distress", "Frequent Mental Distress"),
  Frequency = as.vector(freq_table),
  Percentage = round(as.vector(percent_table), 2)
)

task1a_table

##                      Category Frequency Percentage
## 1 No Frequent Mental Distress      4243      84.86
## 2    Frequent Mental Distress       757      15.14

# ================================
# Task 1b: Descriptive Table
# ================================

# Recode variables for clean output
brfss_logistic <- brfss_logistic %>%
  mutate(
    fmd = factor(
      fmd,
      levels = c(0, 1),
      labels = c("No Frequent Mental Distress", "Frequent Mental Distress")
    ),
    sex = as.factor(sex),
    exercise = as.factor(exercise),
    smoker = as.factor(smoker),
    income_cat = as.factor(income_cat)
  )

# Create descriptive summary table stratified by FMD
task1b_table <- brfss_logistic %>%
  select(fmd, age, sex, bmi, exercise, smoker) %>%
  tbl_summary(
    by = fmd,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 2,
    label = list(
      age ~ "Age (years)",
      sex ~ "Sex",
      bmi ~ "Body Mass Index (BMI)",
      exercise ~ "Exercise",
      smoker ~ "Smoking Status"
    ),
    missing = "no"
  ) %>%
  add_p() %>%
  bold_labels()

# Display table
task1b_table

Characteristic	No Frequent Mental Distress N = 4,243¹	Frequent Mental Distress N = 757¹	p-value²
Age (years)	57.44 (16.06)	50.49 (15.69)	<0.001
Sex			<0.001
Male	2,378 (56%)	323 (43%)
Female	1,865 (44%)	434 (57%)
Body Mass Index (BMI)	28.40 (6.17)	29.33 (6.96)	0.001
Exercise	3,192 (75%)	481 (64%)	<0.001
Smoking Status			<0.001
Former/Never	2,886 (68%)	394 (52%)
Current	1,357 (32%)	363 (48%)
¹ Mean (SD); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

# ================================
# Lab Setup
# ================================

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

# Load data
brfss_logistic <- readRDS("brfss_logistic_2020.rds")

# Create binary outcome as numeric 0/1
brfss_logistic <- brfss_logistic %>%
  mutate(
    fmd = ifelse(menthlth_days >= 14, 1, 0)
  )

# ================================
# Task 1a: Final Frequency Table
# ================================

freq_table <- table(brfss_logistic$fmd)
percent_table <- prop.table(freq_table) * 100

task1a_table <- data.frame(
  Category = c("No Frequent Mental Distress", "Frequent Mental Distress"),
  Frequency = as.vector(freq_table),
  Percentage = round(as.vector(percent_table), 2)
)

task1a_table

##                      Category Frequency Percentage
## 1 No Frequent Mental Distress      4243      84.86
## 2    Frequent Mental Distress       757      15.14

# ================================
# Task 1b: Descriptive Table
# ================================

# Create a labeled copy ONLY for tbl_summary
brfss_summary <- brfss_logistic %>%
  mutate(
    fmd_label = factor(
      fmd,
      levels = c(0, 1),
      labels = c("No Frequent Mental Distress", "Frequent Mental Distress")
    ),
    sex = as.factor(sex),
    exercise = as.factor(exercise),
    smoker = as.factor(smoker),
    income_cat = as.factor(income_cat)
  )

# Create descriptive summary table stratified by FMD
task1b_table <- brfss_summary %>%
  select(fmd_label, age, sex, bmi, exercise, smoker) %>%
  tbl_summary(
    by = fmd_label,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 2,
    label = list(
      age ~ "Age (years)",
      sex ~ "Sex",
      bmi ~ "Body Mass Index (BMI)",
      exercise ~ "Exercise",
      smoker ~ "Smoking Status"
    ),
    missing = "no"
  ) %>%
  add_p() %>%
  bold_labels()

task1b_table

Characteristic	No Frequent Mental Distress N = 4,243¹	Frequent Mental Distress N = 757¹	p-value²
Age (years)	57.44 (16.06)	50.49 (15.69)	<0.001
Sex			<0.001
Male	2,378 (56%)	323 (43%)
Female	1,865 (44%)	434 (57%)
Body Mass Index (BMI)	28.40 (6.17)	29.33 (6.96)	0.001
Exercise	3,192 (75%)	481 (64%)	<0.001
Smoking Status			<0.001
Former/Never	2,886 (68%)	394 (52%)
Current	1,357 (32%)	363 (48%)
¹ Mean (SD); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

# ================================
# Task 1c: Bar Chart (FMD by Exercise)
# ================================

# Keep fmd numeric here
plot_data <- brfss_logistic %>%
  group_by(exercise) %>%
  summarise(
    prop_fmd = mean(fmd, na.rm = TRUE),
    .groups = "drop"
  )

plot_data

## # A tibble: 2 × 2
##   exercise prop_fmd
##   <fct>       <dbl>
## 1 No          0.208
## 2 Yes         0.131

ggplot(plot_data, aes(x = exercise, y = prop_fmd)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Proportion of Frequent Mental Distress by Exercise Status",
    x = "Exercise Status",
    y = "Proportion with Frequent Mental Distress"
  ) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 1)
  ) +
  theme_minimal()

# ================================
# Task 2a: Simple Logistic Regression (FMD ~ Exercise)
# ================================

# Fit logistic regression model
model_exercise <- glm(fmd ~ exercise, 
                      data = brfss_logistic, 
                      family = binomial)

# View model summary (log-odds scale)
summary(model_exercise)

## 
## Call:
## glm(formula = fmd ~ exercise, family = binomial, data = brfss_logistic)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.33710    0.06764 -19.769  < 2e-16 ***
## exerciseYes -0.55544    0.08347  -6.655 2.84e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4251.3  on 4999  degrees of freedom
## Residual deviance: 4208.6  on 4998  degrees of freedom
## AIC: 4212.6
## 
## Number of Fisher Scoring iterations: 4

# Extract coefficients (log-odds)
model_results <- tidy(model_exercise)

model_results

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -1.34     0.0676    -19.8  5.51e-87
## 2 exerciseYes   -0.555    0.0835     -6.65 2.84e-11

# ================================
# Task 2b: Odds Ratios + 95% CI
# ================================

# Get odds ratios with 95% CI
or_results <- tidy(model_exercise, exponentiate = TRUE, conf.int = TRUE)

or_results

## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    0.263    0.0676    -19.8  5.51e-87    0.230     0.299
## 2 exerciseYes    0.574    0.0835     -6.65 2.84e-11    0.488     0.676

# ================================
# Task 2d: Predicted Probability Plot
# ================================

# Fit logistic model with a continuous predictor (age)
model_age <- glm(fmd ~ age, 
                 data = brfss_logistic, 
                 family = binomial)

# Generate predicted probabilities across age
pred_age <- ggpredict(model_age, terms = "age")

# Plot
plot(pred_age) +
  labs(
    title = "Predicted Probability of Frequent Mental Distress by Age",
    x = "Age (years)",
    y = "Predicted Probability of FMD"
  ) +
  theme_minimal()

# ================================
# Task 3a: Three Simple Logistic Models
# ================================

# Model 1: FMD ~ Age
model_age <- glm(fmd ~ age, 
                 data = brfss_logistic, 
                 family = binomial)

# Model 2: FMD ~ Smoking Status
model_smoker <- glm(fmd ~ smoker, 
                    data = brfss_logistic, 
                    family = binomial)

# Model 3: FMD ~ BMI
model_bmi <- glm(fmd ~ bmi, 
                 data = brfss_logistic, 
                 family = binomial)

# Extract results (log-odds scale)
results_age <- tidy(model_age)
results_smoker <- tidy(model_smoker)
results_bmi <- tidy(model_bmi)

# Display results
results_age

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -0.322    0.131       -2.45 1.42e- 2
## 2 age          -0.0259   0.00242    -10.7  9.80e-27

results_smoker

## # A tibble: 2 × 5
##   term          estimate std.error statistic   p.value
##   <chr>            <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     -1.99     0.0537    -37.1  6.82e-301
## 2 smokerCurrent    0.673    0.0799      8.42 3.65e- 17

results_bmi

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -2.37     0.178      -13.3  2.34e-40
## 2 bmi           0.0223   0.00595      3.74 1.81e- 4

# ================================
# Task 3b: Compare Odds Ratios
# ================================

# Get ORs for each model
or_age <- tidy(model_age, exponentiate = TRUE, conf.int = TRUE) %>%
  mutate(model = "Age")

or_smoker <- tidy(model_smoker, exponentiate = TRUE, conf.int = TRUE) %>%
  mutate(model = "Smoking Status")

or_bmi <- tidy(model_bmi, exponentiate = TRUE, conf.int = TRUE) %>%
  mutate(model = "BMI")

# Combine into one table
or_comparison <- bind_rows(or_age, or_smoker, or_bmi) %>%
  filter(term != "(Intercept)") %>%
  select(model, term, estimate, conf.low, conf.high, p.value)

or_comparison

## # A tibble: 3 × 6
##   model          term          estimate conf.low conf.high  p.value
##   <chr>          <chr>            <dbl>    <dbl>     <dbl>    <dbl>
## 1 Age            age              0.974    0.970     0.979 9.80e-27
## 2 Smoking Status smokerCurrent    1.96     1.68      2.29  3.65e-17
## 3 BMI            bmi              1.02     1.01      1.03  1.81e- 4

# ================================
# Task 4a: Multiple Logistic Regression
# ================================

library(broom)

model_multi <- glm(
  fmd ~ age + exercise + smoker,
  data = brfss_logistic,
  family = binomial
)

summary(model_multi)

## 
## Call:
## glm(formula = fmd ~ age + exercise + smoker, family = binomial, 
##     data = brfss_logistic)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.07508    0.17125  -0.438    0.661    
## age           -0.02543    0.00256  -9.933  < 2e-16 ***
## exerciseYes   -0.63754    0.08675  -7.349 2.00e-13 ***
## smokerCurrent  0.42264    0.08363   5.054 4.33e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4251.3  on 4999  degrees of freedom
## Residual deviance: 4047.6  on 4996  degrees of freedom
## AIC: 4055.6
## 
## Number of Fisher Scoring iterations: 4

tidy(model_multi)

## # A tibble: 4 × 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    -0.0751   0.171      -0.438 6.61e- 1
## 2 age            -0.0254   0.00256    -9.93  2.98e-23
## 3 exerciseYes    -0.638    0.0868     -7.35  2.00e-13
## 4 smokerCurrent   0.423    0.0836      5.05  4.33e- 7

# ================================
# Task 4b: Adjusted Odds Ratios
# ================================

library(gtsummary)

# Create regression table with adjusted odds ratios
table_4b <- tbl_regression(
  model_multi,
  exponentiate = TRUE
)

table_4b

Characteristic	OR	95% CI	p-value
IMPUTED AGE VALUE COLLAPSED ABOVE 80	0.97	0.97, 0.98	<0.001
exercise
No	—	—
Yes	0.53	0.45, 0.63	<0.001
smoker
Former/Never	—	—
Current	1.53	1.29, 1.80	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Task 1: Explore the Binary Outcome (15 points)

1a. (5 pts) Create a frequency table showing the number and percentage of individuals with and without frequent mental distress.

A frequency table was created to summarize the distribution of frequent mental distress (FMD). Among the sample (n = 5000), 15.14% of individuals reported frequent mental distress, while 84.86% did not. This indicates that frequent mental distress is relatively uncommon in this population.

1b. (5 pts) Create a descriptive summary table of at least 4 predictors, stratified by FMD status. Use tbl_summary().

Table 1 presents the distribution of selected demographic and behavioral characteristics stratified by frequent mental distress (FMD) status. Continuous variables are summarized using means and standard deviations, and categorical variables are presented as frequencies and percentages.

Individuals with FMD were younger on average, more likely to be female, less likely to engage in exercise, and more likely to be current smokers compared to those without FMD. Several of these differences were statistically significant, suggesting potential associations between these characteristics and frequent mental distress.

1c. (5 pts) Create a bar chart showing the proportion of FMD by exercise status OR smoking status.

Figure 1 displays the proportion of individuals with frequent mental distress (FMD) by exercise status. Individuals who did not engage in exercise had a higher proportion of FMD compared to those who exercised (approximately 21% vs. 13%). This suggests that lack of physical activity may be associated with an increased likelihood of frequent mental distress.

Task 2: Simple Logistic Regression (20 points)

2a. (5 pts) Fit a simple logistic regression model predicting FMD from exercise. Report the coefficients on the log-odds scale.

Model Results (log-odds scale)

Intercept: −1.337

Exercise (Yes vs No): −0.555

A simple logistic regression model was fit to examine the association between exercise and frequent mental distress (FMD). On the log-odds scale, the intercept was −1.337, representing the log-odds of FMD among individuals who do not exercise. The coefficient for exercise (Yes vs. No) was −0.555, indicating that individuals who exercise have lower log-odds of FMD compared to those who do not exercise. This association was statistically significant (p < 0.001).

2b. (5 pts) Exponentiate the coefficients to obtain odds ratios with 95% confidence intervals.

Odds Ratios

Intercept OR: 0.263

Exercise (Yes vs No) OR: 0.574

95% Confidence Interval

Exercise OR 95% CI: (0.49, 0.67) (rounded from your output)

The exponentiated coefficients from the logistic regression model represent odds ratios. The odds ratio for exercise (Yes vs. No) was 0.57 (95% CI: 0.49–0.67), indicating that individuals who exercise have 43% lower odds of frequent mental distress compared to those who do not exercise. The confidence interval does not include 1, indicating statistical significance.

2c. (5 pts) Interpret the odds ratio for exercise in the context of the research question.

The odds ratio of 0.57 suggests that individuals who exercise have substantially lower odds of experiencing frequent mental distress compared to those who do not exercise. This indicates a strong inverse association between exercise and frequent mental distress.

2d. (5 pts) Create a plot showing the predicted probability of FMD across levels of a continuous predictor (e.g., age or sleep hours).

Figure 2 shows the predicted probability of frequent mental distress (FMD) across age. The predicted probability decreases as age increases, indicating that younger individuals have a higher likelihood of experiencing frequent mental distress. For example, the predicted probability is highest among younger individuals and declines steadily with increasing age, suggesting a strong inverse relationship between age and FMD.

Task 3: Comparing Predictors (20 points)

3a. (5 pts) Fit three separate simple logistic regression models, each with a different predictor of your choice.

Three separate simple logistic regression models were fit to evaluate the association between age, smoking status, and body mass index (BMI) with frequent mental distress (FMD). Each model included only one predictor, allowing for assessment of the unadjusted association between each predictor and FMD.

3b. (10 pts) Create a table comparing the odds ratios from all three models.

Table 2 compares the odds ratios from the three models. Age was inversely associated with FMD (OR = 0.97, 95% CI: 0.97–0.98), indicating that older individuals have lower odds of frequent mental distress. Smoking status showed a strong positive association, with current smokers having nearly twice the odds of FMD compared to non-smokers (OR = 1.96, 95% CI: 1.68–2.29). BMI was also positively associated with FMD (OR = 1.02, 95% CI: 1.01–1.03), although the magnitude of this effect was relatively small.

Overall, smoking status exhibited the strongest association with frequent mental distress among the predictors examined.

3c. (5 pts) Which predictor has the strongest crude association with FMD? Justify your answer.

Table 2 compares the odds ratios from three separate simple logistic regression models examining the association between age, smoking status, and BMI with frequent mental distress (FMD).

Age was inversely associated with FMD (OR = 0.97, 95% CI: 0.97–0.98), indicating that each additional year of age is associated with a decrease in the odds of frequent mental distress. Smoking status showed a strong positive association, with current smokers having nearly twice the odds of FMD compared to non-smokers (OR = 1.96, 95% CI: 1.68–2.29). BMI was also positively associated with FMD (OR = 1.02, 95% CI: 1.01–1.03), although the magnitude of this association was relatively small.

Overall, smoking status appears to have the strongest association with frequent mental distress among the predictors examined.

Task 4: Introduction to Multiple Logistic Regression (20 points)

4a. (5 pts) Fit a multiple logistic regression model predicting FMD from at least 3 predictors.

A multiple logistic regression model was fit to examine the association between age, exercise, and smoking status with frequent mental distress (FMD). This model estimates the adjusted association of each predictor with FMD while controlling for the other variables.

4b. (5 pts) Report the adjusted odds ratios using tbl_regression().

Age (OR = 0.97): protective

Exercise (OR = 0.53): strong protective effect

Smoking (OR = 1.53): risk factor

Table 3 presents the adjusted odds ratios (AORs) from the multiple logistic regression model.

Age was inversely associated with FMD (OR = 0.97), indicating that older individuals have lower odds of experiencing frequent mental distress. Exercise was also inversely associated with FMD (OR = 0.53), suggesting a strong protective effect. In contrast, current smoking was positively associated with FMD (OR = 1.53), indicating higher odds of FMD among smokers.

4c. (5 pts) For one predictor, compare the crude OR (from Task 3) with the adjusted OR (from Task 4). Show both values.

For smoking status, the crude odds ratio was 1.96, while the adjusted odds ratio was 1.53. The reduction in the odds ratio after adjustment suggests that part of the association between smoking and frequent mental distress is explained by other variables in the model.

4d. (5 pts) In 2-3 sentences, assess whether confounding is present for the predictor you chose. Which direction did the OR change, and what does this mean?

The decrease in the odds ratio for smoking (from 1.96 to 1.53) indicates that confounding is present. The crude estimate overestimated the strength of the association between smoking and frequent mental distress, and adjustment for age and exercise partially accounted for this relationship.

Completion credit (25 points): Awarded for a complete, good-faith attempt at all tasks. Total: 75 + 25 = 100 points.

End of Lab Activity

Logistic Regression — Part 1

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Jessica OKoroji

April 9, 2026