Introduction

In Part 1, we introduced the logistic model, the logit transformation, and the connection between logistic regression coefficients and odds ratios. We fit simple logistic regression models with single predictors and previewed multiple logistic regression.

In Part 2, we go deeper:

Multiple logistic regression with adjusted odds ratios for confounder control
Maximum likelihood estimation and the Wald and likelihood ratio tests
Confidence intervals for adjusted odds ratios
Interaction terms in logistic regression and effect modification
Goodness-of-fit: deviance, Hosmer-Lemeshow test, pseudo-R²
Model discrimination: ROC curves and the area under the curve (AUC)
Diagnostics: residuals, influential observations, and linearity in the logit

Textbook reference: Kleinbaum et al., Chapter 22 (Sections 22.4 and 22.5)

Setup and Data

library(tidyverse)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(ggeffects)
library(ResourceSelection)  # for Hosmer-Lemeshow
library(pROC)               # for ROC/AUC
library(performance)        # for model performance metrics
library(sjPlot)
library(modelsummary)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_logistic <- readRDS(
  "C:/Users/God's Icon/Desktop/553-Statistical Inference/553 Lab Codes/data/Logistic Regression/brfss_logistic_2020.rds"
)

dim(brfss_logistic)

## [1] 5000   10

Outcome: fmd — Frequent Mental Distress (1 = 14+ mentally unhealthy days in past 30, 0 = otherwise).

Part 1: Multiple Logistic Regression

The Adjusted Model

We extend the simple model to include several predictors:

\[\text{logit}[\Pr(Y = 1)] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k\]

Each coefficient \(\beta_j\) represents the change in log-odds for a one-unit increase in \(X_j\), holding all other predictors constant. Exponentiating gives the adjusted odds ratio:

\[\text{aOR}_j = e^{\beta_j}\]

mod_full <- glm(
  fmd ~ exercise + smoker + age + sex + sleep_hrs + income_cat + bmi + physhlth_days,
  data = brfss_logistic,
  family = binomial(link = "logit")
)

mod_full |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      exercise      ~ "Exercise (past 30 days)",
      smoker        ~ "Smoking status",
      age           ~ "Age (per year)",
      sex           ~ "Sex",
      sleep_hrs     ~ "Sleep hours",
      income_cat    ~ "Income category (per unit)",
      bmi           ~ "BMI",
      physhlth_days ~ "Physically unhealthy days"
    )
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
Exercise (past 30 days)
No	—	—
Yes	0.82	0.68, 0.99	0.041
Smoking status
Former/Never	—	—
Current	1.30	1.09, 1.56	0.004
Age (per year)	0.97	0.96, 0.97	<0.001
Sex
Male	—	—
Female	1.68	1.41, 1.98	<0.001
Sleep hours	0.86	0.81, 0.90	<0.001
Income category (per unit)	0.91	0.87, 0.95	<0.001
BMI	1.01	0.99, 1.02	0.4
Physically unhealthy days	1.06	1.06, 1.07	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation: Each row gives the adjusted odds ratio (aOR) and 95% CI for one predictor, controlling for all others. For example, the aOR for current smoking compares the odds of frequent mental distress for current smokers vs. former/never smokers, after adjusting for age, sex, sleep, income, BMI, exercise, and physical health. An aOR > 1 indicates a risk factor; an aOR < 1 indicates a protective factor.

Scaling Continuous Predictors

A 1-year change in age or a 1-unit change in BMI is rarely the most clinically meaningful comparison. We can rescale to improve interpretation:

mod_scaled <- glm(
  fmd ~ exercise + smoker + I(age/10) + sex + sleep_hrs +
        income_cat + I(bmi/5) + physhlth_days,
  data = brfss_logistic,
  family = binomial
)

mod_scaled |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      "I(age/10)"  ~ "Age (per 10 years)",
      "I(bmi/5)"   ~ "BMI (per 5 units)"
    )
  ) |>
  bold_labels()

Characteristic	OR	95% CI	p-value
exercise
No	—	—
Yes	0.82	0.68, 0.99	0.041
smoker
Former/Never	—	—
Current	1.30	1.09, 1.56	0.004
Age (per 10 years)	0.73	0.69, 0.77	<0.001
sex
Male	—	—
Female	1.68	1.41, 1.98	<0.001
sleep_hrs	0.86	0.81, 0.90	<0.001
income_cat	0.91	0.87, 0.95	<0.001
BMI (per 5 units)	1.03	0.97, 1.10	0.4
physhlth_days	1.06	1.06, 1.07	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation: Now the aOR for age compares two individuals 10 years apart, and the aOR for BMI compares two individuals 5 BMI units apart, both more clinically interpretable.

Part 2: Maximum Likelihood Estimation and Hypothesis Tests

Maximum Likelihood Estimation (ML)

Unlike linear regression, which uses ordinary least squares, logistic regression coefficients are estimated by maximum likelihood. The algorithm finds the values of \(\beta_0, \beta_1, \ldots, \beta_k\) that maximize the likelihood of observing the data.

The likelihood function for \(n\) independent binary observations is:

\[L(\boldsymbol{\beta}) = \prod_{i=1}^{n} p_i^{y_i}(1 - p_i)^{1 - y_i}\]

where \(p_i = \Pr(Y_i = 1 \mid X_i)\) is the predicted probability for observation \(i\). Taking the log gives the log-likelihood:

\[\ln L(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[y_i \ln p_i + (1 - y_i) \ln(1 - p_i)\right]\]

The ML estimates \(\hat{\beta}\) are obtained iteratively (typically by Newton-Raphson). We never compute these by hand, but it is important to know that R’s glm() reports Deviance \(= -2 \ln \hat{L}\), which is the foundation for hypothesis testing.

Wald Test (z-test for individual coefficients)

The Wald test is the default test reported by summary() and tidy(). For each coefficient \(\beta_j\):

\[z = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} \sim N(0, 1) \text{ under } H_0: \beta_j = 0\]

The p-value tests whether the coefficient is significantly different from zero, equivalently whether the OR is significantly different from 1.

tidy(mod_full, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(c(estimate, std.error, statistic, conf.low, conf.high),
                \(x) round(x, 3)),
         p.value = format.pval(p.value, digits = 3)) |>
  kable(caption = "Wald Tests for Each Coefficient (Exponentiated)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Wald Tests for Each Coefficient (Exponentiated)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	2.198	0.356	2.215	0.02675	1.095	4.414
exerciseYes	0.820	0.097	-2.044	0.04095	0.679	0.993
smokerCurrent	1.301	0.091	2.890	0.00386	1.088	1.555
age	0.969	0.003	-11.225	< 2e-16	0.963	0.974
sexFemale	1.675	0.086	5.975	2.30e-09	1.415	1.985
sleep_hrs	0.857	0.027	-5.643	1.67e-08	0.812	0.904
income_cat	0.909	0.020	-4.703	2.56e-06	0.873	0.946
bmi	1.006	0.006	0.932	0.35113	0.993	1.018
physhlth_days	1.065	0.004	15.634	< 2e-16	1.057	1.073

Caveat: The Wald test can be unreliable when sample sizes are small or when coefficients are large. The likelihood ratio test is generally preferred for these situations.

Likelihood Ratio Test (LR test)

The likelihood ratio test compares two nested models: a “full” model and a “reduced” model that drops one or more predictors. The test statistic is:

\[\text{LR} = -2(\ln \hat{L}_{\text{reduced}} - \ln \hat{L}_{\text{full}}) = D_{\text{reduced}} - D_{\text{full}}\]

Under \(H_0\) that the dropped predictors have no effect, LR follows a \(\chi^2\) distribution with degrees of freedom equal to the number of parameters dropped.

Example: Test the joint contribution of smoking and exercise

mod_reduced <- glm(
  fmd ~ age + sex + sleep_hrs + income_cat + bmi + physhlth_days,
  data = brfss_logistic,
  family = binomial
)

anova(mod_reduced, mod_full, test = "LRT") |>
  kable(digits = 3,
        caption = "LR Test: Does adding exercise + smoker improve the model?") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

LR Test: Does adding exercise + smoker improve the model?
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
4993	3641.913	NA	NA	NA
4991	3628.474	2	13.439	0.001

Interpretation: The LR test gives a \(\chi^2\) statistic on 3 degrees of freedom (1 for exercise, 2 for smoker — but smoker has 2 levels so 1 dummy variable is created, making df = 2 here actually). A small p-value means the dropped variables jointly contribute to model fit.

Wald vs. LR test in practice

Aspect	Wald test	LR test
What it tests	Single coefficient or vector	Nested model comparison
Computational cost	Very fast	Requires fitting two models
Reliability with small samples	Less reliable	Generally preferred
Reported by R	`summary(model)`	`anova(m1, m2, test = "LRT")`

In large samples (like ours with n = 5,000), the two tests usually agree. In smaller samples or with extreme estimates, prefer the LR test.

Part 3: Confidence Intervals for Adjusted Odds Ratios

For the OR of a single coefficient, the 95% CI is computed on the log-odds scale and then exponentiated:

\[95\% \text{ CI for } e^{\beta_j} = \exp\left(\hat{\beta}_j \pm 1.96 \cdot \text{SE}(\hat{\beta}_j)\right)\]

This is the default approach used by confint() and tidy(..., conf.int = TRUE).

ci_table <- tidy(mod_full, conf.int = TRUE, exponentiate = TRUE) |>
  filter(term != "(Intercept)") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)))

ci_table |>
  kable(col.names = c("Predictor", "aOR", "95% CI Lower", "95% CI Upper"),
        caption = "Adjusted Odds Ratios with 95% CIs") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Adjusted Odds Ratios with 95% CIs
Predictor	aOR	95% CI Lower	95% CI Upper
exerciseYes	0.820	0.679	0.993
smokerCurrent	1.301	1.088	1.555
age	0.969	0.963	0.974
sexFemale	1.675	1.415	1.985
sleep_hrs	0.857	0.812	0.904
income_cat	0.909	0.873	0.946
bmi	1.006	0.993	1.018
physhlth_days	1.065	1.057	1.073

Forest Plot of Adjusted ORs

A forest plot is the standard way to visualize multiple ORs and their CIs:

ci_table |>
  ggplot(aes(x = estimate, y = reorder(term, estimate))) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "red") +
  geom_point(size = 3, color = "steelblue") +
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = 0.2,      # ✅ fixed
                color = "steelblue", orientation = "y") +                  # ✅ fixed
  scale_x_log10() +
  labs(title = "Forest Plot of Adjusted Odds Ratios for Frequent Mental Distress",
       subtitle = "Reference line at OR = 1; log-scale x-axis",
       x = "Adjusted Odds Ratio (95% CI)", y = NULL) +
  theme_minimal()

Interpretation: Predictors whose CIs do not cross the dashed line at OR = 1 are statistically significantly associated with FMD at the 0.05 level. The log-scale x-axis ensures that ORs of 0.5 and 2.0 (which represent equally strong associations in opposite directions) appear equidistant from 1.

Part 4: Interaction (Effect Modification)

What Is Interaction in Logistic Regression?

Interaction (effect modification) occurs when the effect of one predictor on the outcome depends on the value of another predictor. In logistic regression, interaction is modeled by including a product term:

\[\text{logit}[\Pr(Y=1)] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1 \cdot X_2)\]

If \(\beta_3 \neq 0\), the OR for \(X_1\) depends on the value of \(X_2\).

Example: Does the effect of exercise differ by sex?

mod_interact <- glm(
  fmd ~ exercise * sex + age + smoker + sleep_hrs + income_cat,
  data = brfss_logistic,
  family = binomial
)

mod_interact |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
exercise
No	—	—
Yes	0.61	0.47, 0.79	<0.001
sex
Male	—	—
Female	1.66	1.26, 2.21	<0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80	0.97	0.97, 0.98	<0.001
smoker
Former/Never	—	—
Current	1.25	1.05, 1.48	0.012
sleep_hrs	0.81	0.77, 0.86	<0.001
income_cat	0.85	0.82, 0.89	<0.001
exercise * sex
Yes * Female	1.00	0.71, 1.41	>0.9
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Testing the Interaction

mod_no_interact <- glm(
  fmd ~ exercise + sex + age + smoker + sleep_hrs + income_cat,
  data = brfss_logistic,
  family = binomial
)

anova(mod_no_interact, mod_interact, test = "LRT") |>
  kable(digits = 3,
        caption = "LR Test for Exercise × Sex Interaction") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

LR Test for Exercise × Sex Interaction
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
4993	3870.197	NA	NA	NA
4992	3870.197	1	0	0.987

Interpretation: If the p-value for the LR test is small (< 0.05), the interaction is statistically significant: the effect of exercise differs by sex. If not, we can drop the interaction term and use the simpler main-effects model.

Visualizing the Interaction

ggpredict(mod_interact, terms = c("exercise", "sex")) |>
  plot() +
  labs(title = "Predicted Probability of FMD by Exercise and Sex",
       x = "Exercise", y = "Predicted Probability of FMD",
       color = "Sex") +
  theme_minimal()

Interpretation: If the lines are parallel, there is no interaction. If they cross or diverge, the effect of exercise differs across sex.

Stratified Odds Ratios

When an interaction is present, we report stratum-specific odds ratios rather than a single overall OR:

# Stratum-specific ORs from the interaction model
ggpredict(mod_interact, terms = c("exercise", "sex")) |>
  as_tibble() |>
  pivot_wider(id_cols = group, names_from = x, values_from = predicted) |>
  mutate(OR_yes_vs_no = (Yes / (1 - Yes)) / (No / (1 - No))) |>
  dplyr::select(Sex = group, OR_yes_vs_no) |>
  kable(digits = 3,
        col.names = c("Sex", "OR (Exercise: Yes vs. No)"),
        caption = "Stratum-Specific Odds Ratios for Exercise") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Stratum-Specific Odds Ratios for Exercise
Sex	OR (Exercise: Yes vs. No)
Male	0.612
Female	0.614

Part 5: Goodness-of-Fit

Deviance

The deviance of a logistic model is:

\[D = -2 \ln \hat{L}\]

It is analogous to the residual sum of squares in linear regression: smaller is better. By itself, the deviance is hard to interpret, but differences in deviance between nested models follow a \(\chi^2\) distribution and form the basis of the LR test.

glance(mod_full) |>
  dplyr::select(null.deviance, df.null, deviance, df.residual, AIC, BIC) |>
  kable(digits = 1, caption = "Model Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Fit Statistics
null.deviance	df.null	deviance	df.residual	AIC	BIC
4251.3	4999	3628.5	4991	3646.5	3705.1

Quick check: The difference between null.deviance and deviance represents the improvement from adding all predictors to an intercept-only model. We can test this with an LR test on df.null - df.residual degrees of freedom.

Pseudo-R²

There is no exact analog of \(R^2\) for logistic regression, but several “pseudo-R²” measures exist. The most common is McFadden’s R²:

\[R^2_{\text{McFadden}} = 1 - \frac{\ln \hat{L}_{\text{full}}}{\ln \hat{L}_{\text{null}}}\]

Values between 0.2 and 0.4 are considered excellent fit.

performance::r2_mcfadden(mod_full)

## # R2 for Generalized Linear Regression
##        R2: 0.147
##   adj. R2: 0.146

Interpretation: McFadden’s R² should not be interpreted on the same scale as linear regression R². Values are typically much smaller (e.g., 0.1 may indicate a reasonable fit).

Hosmer-Lemeshow Test

The Hosmer-Lemeshow test assesses the agreement between predicted and observed event rates within deciles of predicted probability. A non-significant p-value indicates adequate fit.

hl_test <- hoslem.test(
  x = as.numeric(brfss_logistic$fmd) - 1,
  y = fitted(mod_full),
  g = 10
)

hl_test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(brfss_logistic$fmd) - 1, fitted(mod_full)
## X-squared = 8.9639, df = 8, p-value = 0.3453

Interpretation: A small p-value (< 0.05) suggests that the model does not fit well in some regions of predicted probability. With large samples (like ours), the Hosmer-Lemeshow test can be over-powered and detect trivial misfit. Always pair it with a calibration plot.

Calibration Plot

brfss_logistic |>
  mutate(pred_prob = fitted(mod_full),
         obs_outcome = as.numeric(fmd) - 1,
         decile = ntile(pred_prob, 10)) |>
  group_by(decile) |>
  summarise(
    mean_pred = mean(pred_prob),
    mean_obs  = mean(obs_outcome),
    .groups = "drop"
  ) |>
  ggplot(aes(x = mean_pred, y = mean_obs)) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  geom_point(size = 3, color = "steelblue") +
  geom_line(color = "steelblue") +
  labs(title = "Calibration Plot: Observed vs. Predicted Probability of FMD",
       subtitle = "Points should fall on the dashed line for perfect calibration",
       x = "Mean Predicted Probability (within decile)",
       y = "Observed Proportion (within decile)") +
  theme_minimal()

Interpretation: A well-calibrated model has points lying close to the 45-degree line. Systematic departures suggest miscalibration: points above the line indicate the model under-predicts; points below indicate over-prediction.

Part 6: Discrimination — ROC Curve and AUC

While calibration assesses how well predicted probabilities match observed rates, discrimination assesses how well the model separates events from non-events.

The ROC curve plots sensitivity (true positive rate) against 1 − specificity (false positive rate) across all possible probability cutoffs.

The AUC (area under the ROC curve) summarizes discrimination:

AUC	Discrimination
0.5	No discrimination (chance)
0.6-0.7	Poor
0.7-0.8	Acceptable
0.8-0.9	Excellent
> 0.9	Outstanding

roc_obj <- roc(
  response = brfss_logistic$fmd,
  predictor = fitted(mod_full),
  levels = c("No", "Yes"),
  direction = "<"
)

auc_value <- auc(roc_obj)

ggroc(roc_obj, color = "steelblue", linewidth = 1.2) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "red") +
  labs(title = "ROC Curve for Frequent Mental Distress Model",
       subtitle = paste0("AUC = ", round(auc_value, 3)),
       x = "Specificity", y = "Sensitivity") +
  theme_minimal()

Interpretation: An AUC of approximately 0.75-0.80 indicates acceptable to excellent discrimination, meaning the model can distinguish between individuals with and without FMD reasonably well. Note that calibration and discrimination are distinct concepts: a model can have good discrimination but poor calibration, or vice versa.

Part 7: Diagnostics for Logistic Regression

Linearity in the Logit (for continuous predictors)

For continuous predictors, logistic regression assumes a linear relationship between the predictor and the log-odds. We can check this with a smoothed plot of the logit against the predictor.

brfss_logistic |>
  mutate(logit_pred = predict(mod_full, type = "link")) |>
  ggplot(aes(x = age, y = logit_pred)) +
  geom_point(alpha = 0.2, color = "steelblue") +
  geom_smooth(method = "loess", se = FALSE, color = "darkorange") +
  labs(title = "Linearity in the Logit: Age",
       x = "Age (years)", y = "Predicted Log-Odds (logit)") +
  theme_minimal()

Interpretation: A roughly linear loess curve supports the linearity assumption. A clearly curved pattern suggests we should add a quadratic term or use a spline.

Influential Observations

Cook’s distance and standardized residuals from logistic regression can be examined the same way as in linear regression:

brfss_logistic |>
  mutate(cooks_d = cooks.distance(mod_full),
         row_id = row_number()) |>
  ggplot(aes(x = row_id, y = cooks_d)) +
  geom_point(alpha = 0.4, color = "steelblue") +
  geom_hline(yintercept = 4 / nrow(brfss_logistic),
             linetype = "dashed", color = "red") +
  labs(title = "Cook's Distance for Logistic Regression Model",
       subtitle = "Red line: 4/n threshold",
       x = "Observation Index", y = "Cook's Distance") +
  theme_minimal()

Multicollinearity

VIFs work the same way as in linear regression:

vif(mod_full) |>
  as.data.frame() |>
  rownames_to_column("Predictor") |>
  kable(digits = 2, caption = "Variance Inflation Factors") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Variance Inflation Factors
Predictor	vif(mod_full)
exercise	1.13
smoker	1.12
age	1.17
sex	1.01
sleep_hrs	1.02
income_cat	1.14
bmi	1.03
physhlth_days	1.20

Rule of thumb: VIF > 5 (or 10) indicates problematic multicollinearity.

Part 8: Reporting Logistic Regression Results

A publication-quality logistic regression table should include:

Sample size and number of events
Adjusted odds ratios with 95% CIs
p-values for each coefficient
Model fit statistics (AIC or pseudo-R²) - (may declutter)
Discrimination metric (AUC) - (methodological purposes)

mod_full |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      exercise      ~ "Exercise (past 30 days)",
      smoker        ~ "Smoking status",
      age           ~ "Age",
      sex           ~ "Sex",
      sleep_hrs     ~ "Sleep hours",
      income_cat    ~ "Income category",
      bmi           ~ "BMI",
      physhlth_days ~ "Physically unhealthy days"
    )
  ) |>
  add_glance_source_note(
    #include = c(nobs, AIC, BIC),
    include = everything(),
    label = list(nobs ~ "N", AIC ~ "AIC", BIC ~ "BIC")
  ) |>
  bold_labels() |>
  bold_p() |>
  modify_caption("**Adjusted Odds Ratios for Frequent Mental Distress, BRFSS 2020**")

**Adjusted Odds Ratios for Frequent Mental Distress, BRFSS 2020**
Characteristic	OR	95% CI	p-value
Exercise (past 30 days)
No	—	—
Yes	0.82	0.68, 0.99	0.041
Smoking status
Former/Never	—	—
Current	1.30	1.09, 1.56	0.004
Age	0.97	0.96, 0.97	<0.001
Sex
Male	—	—
Female	1.68	1.41, 1.98	<0.001
Sleep hours	0.86	0.81, 0.90	<0.001
Income category	0.91	0.87, 0.95	<0.001
BMI	1.01	0.99, 1.02	0.4
Physically unhealthy days	1.06	1.06, 1.07	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Null deviance = 4,251; Null df = 4,999; Log-likelihood = -1,814; AIC = 3,646; BIC = 3,705; Deviance = 3,628; Residual df = 4,991; N = 5,000

Summary

Concept	Tool / R function
Multiple logistic regression	`glm(y ~ x1 + x2 + ..., family = binomial)`
Adjusted odds ratios	`tidy(model, exponentiate = TRUE)`
Wald test	Default in `summary()` and `tidy()`
Likelihood ratio test	`anova(reduced, full, test = "LRT")`
Confidence intervals	`confint(model)` (profile CI) or `tidy(..., conf.int = TRUE)`
Interaction	`glm(y ~ x1 * x2, ...)`; test with LR test
Pseudo-R²	`performance::r2_mcfadden()`
Hosmer-Lemeshow	`ResourceSelection::hoslem.test()`
Calibration plot	Decile-based observed vs. predicted
ROC curve / AUC	`pROC::roc()` and `pROC::auc()`
Diagnostics	`cooks.distance()`, `vif()`, linearity plots
Publication table	`gtsummary::tbl_regression()`

Next Lecture (April 16)

Polytomous logistic regression: outcomes with > 2 nominal categories
Ordinal logistic regression: outcomes with > 2 ordered categories
Proportional odds assumption and how to test it

References

Kleinbaum, D. G., Kupper, L. L., Nizam, A., & Rosenberg, E. S. (2013). Applied Regression Analysis and Other Multivariable Methods (5th ed.), Chapter 22.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
Steyerberg, E. W. (2019). Clinical Prediction Models (2nd ed.). Springer.

Part 9: In-Class Lab Activity

EPI 553 — Logistic Regression Part 2 Lab Due: End of class, April 14, 2026

Instructions

In this lab, you will build a multiple logistic regression model, conduct hypothesis tests, examine an interaction, and assess goodness-of-fit and discrimination. Use the same BRFSS 2020 logistic dataset from Part 1. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Variable	Description	Type
`fmd`	Frequent mental distress (No/Yes)	Factor (outcome)
`menthlth_days`	Mentally unhealthy days (0–30)	Numeric
`physhlth_days`	Physically unhealthy days (0–30)	Numeric
`sleep_hrs`	Sleep hours per night (1–14)	Numeric
`age`	Age in years (capped at 80)	Numeric
`sex`	Sex (Male/Female)	Factor
`bmi`	Body mass index	Numeric
`exercise`	Exercised in past 30 days (No/Yes)	Factor
`income_cat`	Household income category (1–8)	Numeric
`smoker`	Former/Never vs. Current	Factor

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
library(ResourceSelection)
library(pROC)
library(performance)

brfss_logistic <- readRDS(
  "C:/Users/God's Icon/Desktop/553-Statistical Inference/553 Lab Codes/data/Logistic Regression/brfss_logistic_2020.rds"
)

Task 1: Build a Multiple Logistic Regression Model (15 points)

1a. Fit a multiple logistic regression model (5 pts)

Fitting a multiple logistic regression model predicting fmd from six predictors: sleep_hrs, exercise, smoker, sex, income_cat, and physhlth_days.

mod_lab_full <- glm(
  fmd ~ sleep_hrs + exercise + smoker + sex + income_cat + physhlth_days,
  data   = brfss_logistic,
  family = binomial(link = "logit")
)

tidy(mod_lab_full, conf.int = TRUE, exponentiate = FALSE) |>
  mutate(across(where(is.numeric), ~ round(.x, 3))) |>
  kable(caption = "Multiple Logistic Regression: FMD ~ 6 Predictors (Log-Odds Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Multiple Logistic Regression: FMD ~ 6 Predictors (Log-Odds Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-0.632	0.244	-2.592	0.010	-1.112	-0.155
sleep_hrs	-0.194	0.027	-7.206	0.000	-0.248	-0.142
exerciseYes	-0.091	0.094	-0.967	0.334	-0.273	0.094
smokerCurrent	0.505	0.087	5.801	0.000	0.334	0.675
sexFemale	0.485	0.085	5.717	0.000	0.319	0.652
income_cat	-0.088	0.020	-4.386	0.000	-0.127	-0.049
physhlth_days	0.055	0.004	14.483	0.000	0.048	0.063

1b. Publication-quality table of adjusted ORs (5 pts)

mod_lab_full |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      sleep_hrs     ~ "Sleep hours (per 1 hr)",
      exercise      ~ "Exercise (past 30 days)",
      smoker        ~ "Smoking status",
      sex           ~ "Sex",
      income_cat    ~ "Income category (per unit)",
      physhlth_days ~ "Physically unhealthy days"
    )
  ) |>
  bold_labels() |>
  bold_p() |>
  modify_caption("**Table 1. Adjusted Odds Ratios for Frequent Mental Distress, BRFSS 2020**")

**Table 1. Adjusted Odds Ratios for Frequent Mental Distress, BRFSS 2020**
Characteristic	OR	95% CI	p-value
Sleep hours (per 1 hr)	0.82	0.78, 0.87	<0.001
Exercise (past 30 days)
No	—	—
Yes	0.91	0.76, 1.10	0.3
Smoking status
Former/Never	—	—
Current	1.66	1.40, 1.96	<0.001
Sex
Male	—	—
Female	1.62	1.38, 1.92	<0.001
Income category (per unit)	0.92	0.88, 0.95	<0.001
Physically unhealthy days	1.06	1.05, 1.07	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

1c. Interpretation of two adjusted ORs (5 pts)

Sleep hours (continuous predictor): The adjusted OR for sleep hours (aOR = 0.82, 95% CI: 0.78–0.87) represents the multiplicative change in the odds of frequent mental distress (FMD) for each additional hour of sleep per night, holding all other predictors constant. Each additional hour of nightly sleep is associated with approximately 18% lower odds of FMD, after adjusting for exercise, smoking, sex, income, and physical health — confirming sleep as a strong, independent protective factor against frequent mental distress.

Smoking status (categorical predictor): The adjusted OR for smoking (aOR = 1.66, 95% CI: 1.40–1.96) compares current smokers to the reference group (Former/Never smokers), holding all other predictors constant. Current smokers have 66% higher odds of frequent mental distress relative to former or never smokers, underscoring cigarette smoking as an independent risk factor for FMD even after accounting for sex, sleep, income, exercise, and physical health.

Task 2: Wald and Likelihood Ratio Tests (15 points)

2a. Wald p-values for each predictor (5 pts)

tidy(mod_lab_full, conf.int = TRUE, exponentiate = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    across(c(estimate, std.error, statistic, conf.low, conf.high), ~ round(.x, 3)),
    p.value = format.pval(p.value, digits = 3, eps = 0.001)
  ) |>
  kable(
    col.names = c("Term", "aOR", "SE", "z (Wald)", "95% CI Lower", "95% CI Upper", "Wald p-value"),
    caption   = "Wald Tests for Each Coefficient in the Full Lab Model"
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Wald Tests for Each Coefficient in the Full Lab Model
Term	aOR	SE	z (Wald)	95% CI Lower	95% CI Upper	Wald p-value
sleep_hrs	0.823	0.027	-7.206	<0.001	0.781	0.868
exerciseYes	0.913	0.094	-0.967	0.334	0.761	1.099
smokerCurrent	1.656	0.087	5.801	<0.001	1.396	1.964
sexFemale	1.625	0.085	5.717	<0.001	1.376	1.920
income_cat	0.916	0.020	-4.386	<0.001	0.881	0.953
physhlth_days	1.057	0.004	14.483	<0.001	1.049	1.065

Interpretation: Five of the six predictors yield statistically significant Wald p-values (all < 0.05): sleep hours, smoking status, sex, income category, and physically unhealthy days. Exercise alone is not statistically significant (p = 0.334). Sleep hours and physically unhealthy days show the most extreme test statistics, reflecting the strongest evidence of association with FMD in this adjusted model.

2b. Likelihood ratio test — reduced model dropping `income_cat` (5 pts)

mod_lab_reduced <- glm(
  fmd ~ sleep_hrs + exercise + smoker + sex + physhlth_days,
  data   = brfss_logistic,
  family = binomial
)

anova(mod_lab_reduced, mod_lab_full, test = "LRT") |>
  kable(
    digits  = 3,
    caption = "LR Test: Does adding income_cat improve the model?"
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

LR Test: Does adding income_cat improve the model?
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
4994	3780.971	NA	NA	NA
4993	3762.018	1	18.953	0

2c. Comparing Wald and LR test conclusions (5 pts)

Both the Wald test (z = −4.386, p < 0.001) and the LR test (χ² = 18.953, df = 1, p < 0.001) reach the same conclusion: income_cat is a statistically significant predictor of FMD and its removal meaningfully worsens model fit. In large samples like ours (n = 5,000), the two tests are asymptotically equivalent and typically agree. However, the two tests can diverge in small samples or when a coefficient estimate is very large — a sign of complete or near-complete separation. The Wald test relies on the approximate normality of the sampling distribution of β̂, which breaks down under these conditions, whereas the LR test — based on the ratio of likelihoods — is more stable and is therefore the preferred approach in small-sample or extreme-coefficient situations.

Task 3: Test an Interaction (20 points)

3a. Fit an interaction model — Smoking × Sex (5 pts)

We test whether the effect of smoking status on FMD differs by sex, a substantively motivated hypothesis given documented sex differences in tobacco’s mental health impact.

mod_lab_interact <- glm(
  fmd ~ smoker * sex + sleep_hrs + exercise + income_cat + physhlth_days,
  data   = brfss_logistic,
  family = binomial
)

mod_lab_interact |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  bold_p() |>
  modify_caption("**Model with Smoking × Sex Interaction**")

**Model with Smoking × Sex Interaction**
Characteristic	OR	95% CI	p-value
smoker
Former/Never	—	—
Current	1.47	1.15, 1.89	0.002
sex
Male	—	—
Female	1.48	1.18, 1.85	<0.001
sleep_hrs	0.82	0.78, 0.87	<0.001
exercise
No	—	—
Yes	0.92	0.76, 1.10	0.4
income_cat	0.92	0.88, 0.95	<0.001
physhlth_days	1.06	1.05, 1.07	<0.001
smoker * sex
Current * Female	1.24	0.89, 1.73	0.2
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

3b. LR test for the interaction (5 pts)

mod_lab_no_interact <- glm(
  fmd ~ smoker + sex + sleep_hrs + exercise + income_cat + physhlth_days,
  data   = brfss_logistic,
  family = binomial
)

anova(mod_lab_no_interact, mod_lab_interact, test = "LRT") |>
  kable(
    digits  = 3,
    caption = "LR Test for Smoking × Sex Interaction"
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

LR Test for Smoking × Sex Interaction
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
4993	3762.018	NA	NA	NA
4992	3760.428	1	1.59	0.207

3c. Visualization of the interaction (5 pts)

ggpredict(mod_lab_interact, terms = c("smoker", "sex")) |>
  plot() +
  labs(
    title    = "Predicted Probability of FMD by Smoking Status and Sex",
    subtitle = "Interaction model; ribbons = 95% confidence intervals",
    x        = "Smoking Status",
    y        = "Predicted Probability of FMD",
    color    = "Sex"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal()

3d. Interpretation of the interaction (5 pts)

ggpredict(mod_lab_interact, terms = c("smoker", "sex")) |>
  as_tibble() |>
  pivot_wider(id_cols = group, names_from = x, values_from = predicted) |>
  mutate(OR_current_vs_former = (`Current` / (1 - `Current`)) /
                                (`Former/Never` / (1 - `Former/Never`))) |>
  dplyr::select(Sex = group, OR_current_vs_former) |>
  kable(
    digits    = 3,
    col.names = c("Sex", "OR (Current vs. Former/Never Smoker)"),
    caption   = "Stratum-Specific Odds Ratios for Smoking by Sex"
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Stratum-Specific Odds Ratios for Smoking by Sex
Sex	OR (Current vs. Former/Never Smoker)
Male	1.473
Female	1.826

The LR test for the Smoking × Sex interaction is not statistically significant (χ² = 1.590, df = 1, p = 0.207), meaning we do not have sufficient evidence to conclude that the effect of smoking on FMD differs between males and females in this sample. The interaction plot shows approximately parallel predicted probability lines for males and females across smoking categories, consistent with a lack of statistically meaningful effect modification. The stratum-specific ORs are numerically similar — 1.473 for males and 1.826 for females — suggesting a slightly stronger association among females, but the confidence intervals overlap substantially and this difference is not statistically distinguishable from chance. Given the non-significant interaction test, the simpler main-effects model is preferred for parsimony.

Task 4: Goodness-of-Fit and Discrimination (25 points)

4a. McFadden’s Pseudo-R² (5 pts)

performance::r2_mcfadden(mod_lab_full)

## # R2 for Generalized Linear Regression
##        R2: 0.115
##   adj. R2: 0.115

Interpretation: McFadden’s R² = 0.115 measures the proportional improvement in log-likelihood from the null (intercept-only) model to the fitted six-predictor model. This value should not be interpreted on the same scale as R² from linear regression, values are inherently much smaller. A McFadden R² of 0.10–0.20 indicates a reasonable fit, and 0.20–0.40 is considered excellent. Our value of 0.115 falls in the lower-reasonable range, suggesting the six predictors collectively capture a meaningful share of the variation in FMD probability, though substantial residual unexplained variation remains, as is expected in population-level behavioral health survey data.

4b. Hosmer-Lemeshow Goodness-of-Fit Test (5 pts)

hl_lab <- hoslem.test(
  x = as.numeric(brfss_logistic$fmd) - 1,
  y = fitted(mod_lab_full),
  g = 10
)

hl_lab

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(brfss_logistic$fmd) - 1, fitted(mod_lab_full)
## X-squared = 16.297, df = 8, p-value = 0.03832

tibble(
  Statistic = round(hl_lab$statistic, 3),
  df        = hl_lab$parameter,
  `p-value` = format.pval(hl_lab$p.value, digits = 3, eps = 0.001)
) |>
  kable(caption = "Hosmer-Lemeshow Goodness-of-Fit Test") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Hosmer-Lemeshow Goodness-of-Fit Test
Statistic	df	p-value
16.297	8	0.0383

Interpretation: The Hosmer-Lemeshow test evaluates whether observed event proportions within deciles of predicted probability agree with the model’s predicted probabilities. Our test statistic is χ² = 16.297 (df = 8, p = 0.038), which is statistically significant at α = 0.05, technically indicating some miscalibration. However, with n = 5,000, the HL test has substantial power and may detect statistically significant departures that are numerically trivial in practice. This result should therefore be interpreted alongside the calibration plot (Task 4c): if points cluster closely around the 45-degree diagonal, the misfit detected by the HL test is unlikely to be of practical concern.

4c. Calibration Plot (5 pts)

brfss_logistic |>
  mutate(
    pred_prob   = fitted(mod_lab_full),
    obs_outcome = as.numeric(fmd) - 1,
    decile      = ntile(pred_prob, 10)
  ) |>
  group_by(decile) |>
  summarise(
    mean_pred = mean(pred_prob),
    mean_obs  = mean(obs_outcome),
    n         = n(),
    .groups   = "drop"
  ) |>
  ggplot(aes(x = mean_pred, y = mean_obs)) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed",
              linewidth = 0.8) +
  geom_point(aes(size = n), color = "steelblue", alpha = 0.85) +
  geom_line(color = "steelblue", linewidth = 0.7) +
  scale_size_continuous(name = "Decile n", range = c(3, 8)) +
  scale_x_continuous(labels = scales::percent_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title    = "Calibration Plot: Observed vs. Predicted Probability of FMD",
    subtitle = "Each point = one predicted probability decile; dashed line = perfect calibration",
    x        = "Mean Predicted Probability (within decile)",
    y        = "Observed Proportion with FMD (within decile)"
  ) +
  theme_minimal()

Interpretation: The calibration plot shows mean predicted probability (x-axis) against observed FMD proportion (y-axis) across ten deciles of predicted risk. Points falling near the red 45-degree reference line indicate good calibration. Points clustering closely along the diagonal across the lower and middle deciles suggest reasonable calibration across the most common range of predicted probabilities. Overall, despite the statistically significant HL test result (Task 4b), the calibration plot suggests that predicted probabilities are reasonably aligned with observed rates across the risk spectrum, consistent with the expectation that the HL test is over-powered at n = 5,000.

4d. ROC Curve and AUC (10 pts)

roc_lab <- roc(
  response  = brfss_logistic$fmd,
  predictor = fitted(mod_lab_full),
  levels    = c("No", "Yes"),
  direction = "<"
)

auc_lab <- auc(roc_lab)

ggroc(roc_lab, color = "#2166ac", linewidth = 1.3) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed",
              color = "red", linewidth = 0.8) +
  annotate("text", x = 0.35, y = 0.15,
           label = paste0("AUC = ", round(auc_lab, 3)),
           size = 5, color = "#2166ac", fontface = "bold") +
  labs(
    title    = "ROC Curve for Frequent Mental Distress — Lab Model",
    subtitle = "Blue curve = model; red dashed = chance (AUC = 0.50)",
    x        = "Specificity (1 − False Positive Rate)",
    y        = "Sensitivity (True Positive Rate)"
  ) +
  theme_minimal()

ci_auc <- ci.auc(roc_lab)

tibble(
  AUC            = round(as.numeric(auc_lab), 3),
  `95% CI Lower` = round(ci_auc[1], 3),
  `95% CI Upper` = round(ci_auc[3], 3)
) |>
  kable(caption = "AUC with 95% Bootstrap Confidence Interval") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

AUC with 95% Bootstrap Confidence Interval
AUC	95% CI Lower	95% CI Upper
0.736	0.716	0.757

Interpretation: The model achieved an AUC of 0.736 (95% CI: 0.716–0.757). The AUC quantifies the probability that a randomly selected individual with FMD receives a higher predicted probability than a randomly selected individual without FMD. An AUC of 0.50 represents no discrimination (coin flip); 1.0 is perfect discrimination. Our AUC of 0.736 falls in the acceptable discrimination range (0.70–0.80), meaning the model performs meaningfully better than chance and the six behavioral and socioeconomic predictors capture a substantial portion of the variation in FMD risk. Importantly, AUC is a rank-order metric independent of calibration, a model with acceptable AUC can still be poorly calibrated, and vice versa. For public health applications, both discrimination (AUC) and calibration (HL test + calibration plot) should always be reported together to give a complete picture of model performance.

Lab Summary

Task	Key Finding
Task 1	Multiple logistic regression with six predictors. Significant adjusted ORs for sleep hours, smoking, sex, income, and physically unhealthy days; exercise was not significant (p = 0.334).
Task 2	Wald (p < 0.001) and LR (χ² = 18.953, p < 0.001) tests both confirm `income_cat` significantly improves model fit. Both tests agree; LR preferred in small samples or with extreme coefficients.
Task 3	Smoking × Sex interaction not statistically significant (LR p = 0.207). Stratum-specific ORs: Male = 1.47, Female = 1.83 — numerically different but not distinguishable from chance. Main-effects model preferred.
Task 4	McFadden R² = 0.115 (reasonable fit); HL p = 0.038 (likely over-powered at n = 5,000); calibration plot shows points tracking near the 45° line; AUC = 0.736 — acceptable discrimination.

Overall conclusion: After multivariable adjustment, shorter sleep duration, current smoking, female sex, lower income, and more physically unhealthy days all independently increase the odds of frequent mental distress. The model demonstrates acceptable discrimination (AUC = 0.736) and reasonable calibration, pointing to sleep, smoking cessation, and economic support as high-priority targets for population-level mental health promotion.

End of Lab Activity

Logistic Regression — Part 2

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Marina Frimpong

April 14, 2026