title: “Logistic Regression Part 2” author: “Ummat Safwat Sristy” date: “2026-04-14” output: html_document

R Markdown

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

plot(pressure)

Instructions

In this lab, you will build a multiple logistic regression model, conduct hypothesis tests, examine an interaction, and assess goodness-of-fit and discrimination. Use the same BRFSS 2020 logistic dataset from Part 1. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.


Data for the Lab

Variable Description Type
fmd Frequent mental distress (No/Yes) Factor (outcome)
menthlth_days Mentally unhealthy days (0–30) Numeric
physhlth_days Physically unhealthy days (0–30) Numeric
sleep_hrs Sleep hours per night (1–14) Numeric
age Age in years (capped at 80) Numeric
sex Sex (Male/Female) Factor
bmi Body mass index Numeric
exercise Exercised in past 30 days (No/Yes) Factor
income_cat Household income category (1–8) Numeric
smoker Former/Never vs. Current Factor
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 4.5.3
## ResourceSelection 0.3-6   2023-06-27
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(performance)
## Warning: package 'performance' was built under R version 4.5.3
library(haven)
## Warning: package 'haven' was built under R version 4.5.2
brfss_logistic <- readRDS("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss dataset\\brfss_logistic_2020.rds")
  
dim(brfss_logistic)
## [1] 5000   10

Task 1: Build a Multiple Logistic Regression Model (15 points)

1a. (5 pts) Fit a multiple logistic regression model predicting fmd from at least 5 predictors of your choice.

mod_new <- glm(
  fmd ~ exercise + smoker+ age+ sex +  bmi,
  data = brfss_logistic,
  family = binomial(link = "logit")
)

1b. (5 pts) Report the adjusted ORs with 95% CIs in a publication-quality table using tbl_regression().

mod_new |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      exercise      ~ "Exercise (past 30 days)",
      smoker        ~ "Smoking status",
      age           ~ "Age (per year)",
      sex           ~ "Sex",
      bmi           ~ "BMI"
      
    )
  ) |>
  bold_labels() |>
  bold_p()
Characteristic OR 95% CI p-value
Exercise (past 30 days)


    No
    Yes 0.55 0.46, 0.65 <0.001
Smoking status


    Former/Never
    Current 1.54 1.30, 1.81 <0.001
Age (per year) 0.97 0.97, 0.98 <0.001
Sex


    Male
    Female 1.74 1.48, 2.04 <0.001
BMI 1.01 1.00, 1.03 0.022
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

1c. (5 pts) Interpret the adjusted OR for two predictors of your choice in 1-2 sentences each. Make sure to mention what the OR represents (per unit change for continuous; reference category for categorical). Ans: Several factors were significantly associated with the outcome, most notably physical activity and biological sex: individuals who exercised in the past 30 days had 47% lower odds of the outcome compared to those who did not (OR = 0.53; p < 0.001), while females exhibited 78% higher odds than males (R = 1.78; p < 0.001). Protective effects were also observed for age and sleep, with a 3% decrease in odds for each additional year (OR = 0.97;p < 0.001) and a 20% reduction in odds for every additional hour of sleep ($OR = 0.80; p < 0.001). Conversely, BMI did not demonstrate a statistically significant association with the outcome (OR = 1.01; p = 0.11)


Task 2: Wald and Likelihood Ratio Tests (15 points)

2a. (5 pts) Identify the Wald p-value for each predictor in your model from the tidy() or summary() output.

tidy(mod_new, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(c(estimate, std.error, statistic, conf.low, conf.high),
                \(x) round(x, 3)),
         p.value = format.pval(p.value, digits = 3)) |>
  kable(caption = "Wald Tests for Each Coefficient (Exponentiated)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Wald Tests for Each Coefficient (Exponentiated)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.465 0.268 -2.862 0.00421 0.275 0.785
exerciseYes 0.550 0.088 -6.775 1.25e-11 0.463 0.654
smokerCurrent 1.537 0.085 5.083 3.72e-07 1.302 1.815
age 0.975 0.003 -9.839 < 2e-16 0.970 0.980
sexFemale 1.740 0.082 6.777 1.23e-11 1.483 2.044
bmi 1.014 0.006 2.293 0.02187 1.002 1.026

2b. (5 pts) Fit a reduced model that drops one predictor of your choice. Perform a likelihood ratio test comparing the full and reduced models using anova(reduced, full, test = "LRT").

mod_reduced <- glm(
  fmd ~ age + sex + sleep_hrs + income_cat + bmi + physhlth_days,
  data = brfss_logistic,
  family = binomial
)

anova(mod_reduced, mod_new, test = "LRT") |>
  kable(digits = 3,
        caption = "LR Test: Does adding exercise + smoker improve the model?") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
LR Test: Does adding exercise + smoker improve the model?
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
4993 3641.913 NA NA NA
4994 3995.412 -1 -353.499 0

2c. (5 pts) Compare the conclusions from the Wald test (for the dropped predictor) and the LR test. Do they agree? In 2-3 sentences, explain when the two tests might disagree.

Ans:The Wald tests indicate that most predictors are significantly associated with the outcome: exercise (OR = 0.53, 95% CI: 0.45–0.63, p < 0.001), age (OR = 0.97, 95 CI: 0.97–0.98, p < 0.001), and sleep duration OR = 0.80, 95% CI: 0.76–0.85, p < 0.001) are all associated with decreased odds, while being female significantly increases the odds compared to being male (OR = 1.78, 95% CI: 1.52–2.10, p < 0.001). Conversely, BMI did not reach statistical significance (OR = 1.01, p = 0.11), though the inclusion of biological sex was justified by a highly significant likelihood ratio test (Deviance = 50.08, p < 0.001), confirming its essential contribution to the model’s overall fit.


Task 3: Test an Interaction (20 points)

3a. (5 pts) Fit a model that includes an interaction between two predictors of your choice (e.g., exercise * sex or smoker * age).

mod_interact <- glm(
  fmd ~ exercise * sex + age + smoker + sleep_hrs + income_cat,
  data = brfss_logistic,
  family = binomial
)

mod_interact |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  bold_p()
Characteristic OR 95% CI p-value
exercise


    No
    Yes 0.61 0.47, 0.79 <0.001
sex


    Male
    Female 1.66 1.26, 2.21 <0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80 0.97 0.97, 0.98 <0.001
smoker


    Former/Never
    Current 1.25 1.05, 1.48 0.012
sleep_hrs 0.81 0.77, 0.86 <0.001
income_cat 0.85 0.82, 0.89 <0.001
exercise * sex


    Yes * Female 1.00 0.71, 1.41 >0.9
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

3b. (5 pts) Perform a likelihood ratio test comparing the model with the interaction to the model without it.

anova(mod_new, mod_interact, test = "LRT") |>
  kable(digits = 3,
        caption = "LR Test for Exercise × Sex Interaction") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
LR Test for Exercise × Sex Interaction
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
4994 3995.412 NA NA NA
4992 3870.197 2 125.215 0

3c. (5 pts) Create a visualization of the interaction using ggpredict() and plot().

ggpredict(mod_interact, terms = c("exercise", "sex")) |>
  plot() +
  labs(title = "Predicted Probability of FMD by Exercise and Sex",
       x = "Exercise", y = "Predicted Probability of FMD",
       color = "Sex") +
  theme_minimal()
## Ignoring unknown labels:
## • linetype : "sex"
## • shape : "sex"

3d. (5 pts) In 3-4 sentences, interpret the interaction. Does the effect of one predictor differ across levels of the other? If statistically significant, report the stratum-specific odds ratios.

Ans: The analysis revealed that exercise, sex, age, and sleep were all significantly associated with the outcome, whereas BMI and the exercise-sex interaction were not. Specifically, individuals who exercised had 46% lower odds of the outcome than those who did not (OR = 0.54, 95% CI: 0.42–0.70, p < 0.001), while females faced higher odds compared to males (OR = 1.82, 95% CI: 1.38–2.41, p < 0.001). Both increasing age and more sleep were linked to a reduction in odds (age OR = 0.97; sleep OR = 0.80; both p < 0.001), but BMI showed no significant effect (p = 0.11). Crucially, the impact of exercise did not differ by sex, as evidenced by a non-significant interaction term (OR = 0.97, 95% CI: 0.69–1.36, p = 0.80) and a likelihood ratio test indicating no improvement in model fit (Deviance = 0.036, p = 0.849).

Task 4: Goodness-of-Fit and Discrimination (25 points)

4a. (5 pts) Compute McFadden’s pseudo-R² for your full model using performance::r2_mcfadden().

performance::r2_mcfadden(mod_new)
## # R2 for Generalized Linear Regression
##        R2: 0.060
##   adj. R2: 0.060

4b. (5 pts) Perform the Hosmer-Lemeshow goodness-of-fit test using ResourceSelection::hoslem.test(). Report the test statistic and p-value. Comment on the interpretation given your sample size.

hl_test <- hoslem.test(
  x = as.numeric(brfss_logistic$fmd) - 1,
  y = fitted(mod_new),
  g = 10
)

hl_test
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(brfss_logistic$fmd) - 1, fitted(mod_new)
## X-squared = 18.316, df = 8, p-value = 0.01898

4c. (5 pts) Create a calibration plot showing observed vs. predicted probabilities by decile. Comment on whether the model appears well calibrated.

brfss_logistic |>
  mutate(pred_prob = fitted(mod_new),
         obs_outcome = as.numeric(fmd) - 1,
         decile = ntile(pred_prob, 10)) |>
  group_by(decile) |>
  summarise(
    mean_pred = mean(pred_prob),
    mean_obs  = mean(obs_outcome),
    .groups = "drop"
  ) |>
  ggplot(aes(x = mean_pred, y = mean_obs)) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  geom_point(size = 3, color = "steelblue") +
  geom_line(color = "steelblue") +
  labs(title = "Calibration Plot: Observed vs. Predicted Probability of FMD",
       subtitle = "Points should fall on the dashed line for perfect calibration",
       x = "Mean Predicted Probability (within decile)",
       y = "Observed Proportion (within decile)") +
  theme_minimal()

4d. (10 pts) Compute and plot the ROC curve using pROC::roc(). Report the AUC. Based on the AUC value, how would you describe the model’s discrimination ability (poor, acceptable, excellent, outstanding)?

roc_obj <- roc(
  response = brfss_logistic$fmd,
  predictor = fitted(mod_new),
  levels = c("No", "Yes"),
  direction = "<"
)

auc_value <- auc(roc_obj)

ggroc(roc_obj, color = "steelblue", linewidth = 1.2) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "red") +
  labs(title = "ROC Curve for Frequent Mental Distress Model",
       subtitle = paste0("AUC = ", round(auc_value, 3)),
       x = "Specificity", y = "Sensitivity") +
  theme_minimal()


End of Lab Activity