title: “Logistic Regression Part 2” author: “Ummat Safwat Sristy” date: “2026-04-14” output: html_document
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
plot(pressure)
In this lab, you will build a multiple logistic regression model, conduct hypothesis tests, examine an interaction, and assess goodness-of-fit and discrimination. Use the same BRFSS 2020 logistic dataset from Part 1. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
| Variable | Description | Type |
|---|---|---|
fmd |
Frequent mental distress (No/Yes) | Factor (outcome) |
menthlth_days |
Mentally unhealthy days (0–30) | Numeric |
physhlth_days |
Physically unhealthy days (0–30) | Numeric |
sleep_hrs |
Sleep hours per night (1–14) | Numeric |
age |
Age in years (capped at 80) | Numeric |
sex |
Sex (Male/Female) | Factor |
bmi |
Body mass index | Numeric |
exercise |
Exercised in past 30 days (No/Yes) | Factor |
income_cat |
Household income category (1–8) | Numeric |
smoker |
Former/Never vs. Current | Factor |
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 4.5.3
## ResourceSelection 0.3-6 2023-06-27
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(performance)
## Warning: package 'performance' was built under R version 4.5.3
library(haven)
## Warning: package 'haven' was built under R version 4.5.2
brfss_logistic <- readRDS("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss dataset\\brfss_logistic_2020.rds")
dim(brfss_logistic)
## [1] 5000 10
1a. (5 pts) Fit a multiple logistic regression model
predicting fmd from at least 5 predictors of your
choice.
mod_new <- glm(
fmd ~ exercise + smoker+ age+ sex + bmi,
data = brfss_logistic,
family = binomial(link = "logit")
)
1b. (5 pts) Report the adjusted ORs with 95% CIs in
a publication-quality table using tbl_regression().
mod_new |>
tbl_regression(
exponentiate = TRUE,
label = list(
exercise ~ "Exercise (past 30 days)",
smoker ~ "Smoking status",
age ~ "Age (per year)",
sex ~ "Sex",
bmi ~ "BMI"
)
) |>
bold_labels() |>
bold_p()
| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Exercise (past 30 days) | |||
| No | — | — | |
| Yes | 0.55 | 0.46, 0.65 | <0.001 |
| Smoking status | |||
| Former/Never | — | — | |
| Current | 1.54 | 1.30, 1.81 | <0.001 |
| Age (per year) | 0.97 | 0.97, 0.98 | <0.001 |
| Sex | |||
| Male | — | — | |
| Female | 1.74 | 1.48, 2.04 | <0.001 |
| BMI | 1.01 | 1.00, 1.03 | 0.022 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
1c. (5 pts) Interpret the adjusted OR for two predictors of your choice in 1-2 sentences each. Make sure to mention what the OR represents (per unit change for continuous; reference category for categorical). Ans: Several factors were significantly associated with the outcome, most notably physical activity and biological sex: individuals who exercised in the past 30 days had 47% lower odds of the outcome compared to those who did not (OR = 0.53; p < 0.001), while females exhibited 78% higher odds than males (R = 1.78; p < 0.001). Protective effects were also observed for age and sleep, with a 3% decrease in odds for each additional year (OR = 0.97;p < 0.001) and a 20% reduction in odds for every additional hour of sleep ($OR = 0.80; p < 0.001). Conversely, BMI did not demonstrate a statistically significant association with the outcome (OR = 1.01; p = 0.11)
2a. (5 pts) Identify the Wald p-value for each
predictor in your model from the tidy() or
summary() output.
tidy(mod_new, conf.int = TRUE, exponentiate = TRUE) |>
mutate(across(c(estimate, std.error, statistic, conf.low, conf.high),
\(x) round(x, 3)),
p.value = format.pval(p.value, digits = 3)) |>
kable(caption = "Wald Tests for Each Coefficient (Exponentiated)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.465 | 0.268 | -2.862 | 0.00421 | 0.275 | 0.785 |
| exerciseYes | 0.550 | 0.088 | -6.775 | 1.25e-11 | 0.463 | 0.654 |
| smokerCurrent | 1.537 | 0.085 | 5.083 | 3.72e-07 | 1.302 | 1.815 |
| age | 0.975 | 0.003 | -9.839 | < 2e-16 | 0.970 | 0.980 |
| sexFemale | 1.740 | 0.082 | 6.777 | 1.23e-11 | 1.483 | 2.044 |
| bmi | 1.014 | 0.006 | 2.293 | 0.02187 | 1.002 | 1.026 |
2b. (5 pts) Fit a reduced model that drops one
predictor of your choice. Perform a likelihood ratio test comparing the
full and reduced models using
anova(reduced, full, test = "LRT").
mod_reduced <- glm(
fmd ~ age + sex + sleep_hrs + income_cat + bmi + physhlth_days,
data = brfss_logistic,
family = binomial
)
anova(mod_reduced, mod_new, test = "LRT") |>
kable(digits = 3,
caption = "LR Test: Does adding exercise + smoker improve the model?") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 4993 | 3641.913 | NA | NA | NA |
| 4994 | 3995.412 | -1 | -353.499 | 0 |
2c. (5 pts) Compare the conclusions from the Wald test (for the dropped predictor) and the LR test. Do they agree? In 2-3 sentences, explain when the two tests might disagree.
Ans:The Wald tests indicate that most predictors are significantly associated with the outcome: exercise (OR = 0.53, 95% CI: 0.45–0.63, p < 0.001), age (OR = 0.97, 95 CI: 0.97–0.98, p < 0.001), and sleep duration OR = 0.80, 95% CI: 0.76–0.85, p < 0.001) are all associated with decreased odds, while being female significantly increases the odds compared to being male (OR = 1.78, 95% CI: 1.52–2.10, p < 0.001). Conversely, BMI did not reach statistical significance (OR = 1.01, p = 0.11), though the inclusion of biological sex was justified by a highly significant likelihood ratio test (Deviance = 50.08, p < 0.001), confirming its essential contribution to the model’s overall fit.
3a. (5 pts) Fit a model that includes an interaction
between two predictors of your choice (e.g., exercise * sex
or smoker * age).
mod_interact <- glm(
fmd ~ exercise * sex + age + smoker + sleep_hrs + income_cat,
data = brfss_logistic,
family = binomial
)
mod_interact |>
tbl_regression(exponentiate = TRUE) |>
bold_labels() |>
bold_p()
| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| exercise | |||
| No | — | — | |
| Yes | 0.61 | 0.47, 0.79 | <0.001 |
| sex | |||
| Male | — | — | |
| Female | 1.66 | 1.26, 2.21 | <0.001 |
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 | 0.97 | 0.97, 0.98 | <0.001 |
| smoker | |||
| Former/Never | — | — | |
| Current | 1.25 | 1.05, 1.48 | 0.012 |
| sleep_hrs | 0.81 | 0.77, 0.86 | <0.001 |
| income_cat | 0.85 | 0.82, 0.89 | <0.001 |
| exercise * sex | |||
| Yes * Female | 1.00 | 0.71, 1.41 | >0.9 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
3b. (5 pts) Perform a likelihood ratio test comparing the model with the interaction to the model without it.
anova(mod_new, mod_interact, test = "LRT") |>
kable(digits = 3,
caption = "LR Test for Exercise × Sex Interaction") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 4994 | 3995.412 | NA | NA | NA |
| 4992 | 3870.197 | 2 | 125.215 | 0 |
3c. (5 pts) Create a visualization of the
interaction using ggpredict() and plot().
ggpredict(mod_interact, terms = c("exercise", "sex")) |>
plot() +
labs(title = "Predicted Probability of FMD by Exercise and Sex",
x = "Exercise", y = "Predicted Probability of FMD",
color = "Sex") +
theme_minimal()
## Ignoring unknown labels:
## • linetype : "sex"
## • shape : "sex"
3d. (5 pts) In 3-4 sentences, interpret the interaction. Does the effect of one predictor differ across levels of the other? If statistically significant, report the stratum-specific odds ratios.
4a. (5 pts) Compute McFadden’s pseudo-R² for your
full model using performance::r2_mcfadden().
performance::r2_mcfadden(mod_new)
## # R2 for Generalized Linear Regression
## R2: 0.060
## adj. R2: 0.060
4b. (5 pts) Perform the Hosmer-Lemeshow
goodness-of-fit test using
ResourceSelection::hoslem.test(). Report the test statistic
and p-value. Comment on the interpretation given your sample size.
hl_test <- hoslem.test(
x = as.numeric(brfss_logistic$fmd) - 1,
y = fitted(mod_new),
g = 10
)
hl_test
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: as.numeric(brfss_logistic$fmd) - 1, fitted(mod_new)
## X-squared = 18.316, df = 8, p-value = 0.01898
4c. (5 pts) Create a calibration plot showing observed vs. predicted probabilities by decile. Comment on whether the model appears well calibrated.
brfss_logistic |>
mutate(pred_prob = fitted(mod_new),
obs_outcome = as.numeric(fmd) - 1,
decile = ntile(pred_prob, 10)) |>
group_by(decile) |>
summarise(
mean_pred = mean(pred_prob),
mean_obs = mean(obs_outcome),
.groups = "drop"
) |>
ggplot(aes(x = mean_pred, y = mean_obs)) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
geom_point(size = 3, color = "steelblue") +
geom_line(color = "steelblue") +
labs(title = "Calibration Plot: Observed vs. Predicted Probability of FMD",
subtitle = "Points should fall on the dashed line for perfect calibration",
x = "Mean Predicted Probability (within decile)",
y = "Observed Proportion (within decile)") +
theme_minimal()
4d. (10 pts) Compute and plot the ROC curve using
pROC::roc(). Report the AUC. Based on the AUC value, how
would you describe the model’s discrimination ability (poor, acceptable,
excellent, outstanding)?
roc_obj <- roc(
response = brfss_logistic$fmd,
predictor = fitted(mod_new),
levels = c("No", "Yes"),
direction = "<"
)
auc_value <- auc(roc_obj)
ggroc(roc_obj, color = "steelblue", linewidth = 1.2) +
geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "red") +
labs(title = "ROC Curve for Frequent Mental Distress Model",
subtitle = paste0("AUC = ", round(auc_value, 3)),
x = "Specificity", y = "Sensitivity") +
theme_minimal()
End of Lab Activity