Lab12_LogisticsRegression_Part2

title: “Logistic Regression Part 2” author: “Ummat Safwat Sristy” date: “2026-04-14” output: html_document

R Markdown

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

plot(pressure)

Instructions

In this lab, you will build a multiple logistic regression model, conduct hypothesis tests, examine an interaction, and assess goodness-of-fit and discrimination. Use the same BRFSS 2020 logistic dataset from Part 1. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Data for the Lab

Variable	Description	Type
`fmd`	Frequent mental distress (No/Yes)	Factor (outcome)
`menthlth_days`	Mentally unhealthy days (0–30)	Numeric
`physhlth_days`	Physically unhealthy days (0–30)	Numeric
`sleep_hrs`	Sleep hours per night (1–14)	Numeric
`age`	Age in years (capped at 80)	Numeric
`sex`	Sex (Male/Female)	Factor
`bmi`	Body mass index	Numeric
`exercise`	Exercised in past 30 days (No/Yes)	Factor
`income_cat`	Household income category (1–8)	Numeric
`smoker`	Former/Never vs. Current	Factor

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## Warning: package 'stringr' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(knitr)

## Warning: package 'knitr' was built under R version 4.5.2

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.2

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(gtsummary)

## Warning: package 'gtsummary' was built under R version 4.5.2

library(car)

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(ggeffects)

## Warning: package 'ggeffects' was built under R version 4.5.2

library(ResourceSelection)

## Warning: package 'ResourceSelection' was built under R version 4.5.3

## ResourceSelection 0.3-6   2023-06-27

library(pROC)

## Warning: package 'pROC' was built under R version 4.5.3

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(performance)

## Warning: package 'performance' was built under R version 4.5.3

library(haven)

## Warning: package 'haven' was built under R version 4.5.2

brfss_logistic <- readRDS("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss dataset\\brfss_logistic_2020.rds")
  
dim(brfss_logistic)

## [1] 5000   10

Task 1: Build a Multiple Logistic Regression Model (15 points)

1a. (5 pts) Fit a multiple logistic regression model predicting fmd from at least 5 predictors of your choice.

mod_new <- glm(
  fmd ~ exercise + smoker+ age+ sex +  bmi,
  data = brfss_logistic,
  family = binomial(link = "logit")
)

1b. (5 pts) Report the adjusted ORs with 95% CIs in a publication-quality table using tbl_regression().

mod_new |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      exercise      ~ "Exercise (past 30 days)",
      smoker        ~ "Smoking status",
      age           ~ "Age (per year)",
      sex           ~ "Sex",
      bmi           ~ "BMI"
      
    )
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
Exercise (past 30 days)
No	—	—
Yes	0.55	0.46, 0.65	<0.001
Smoking status
Former/Never	—	—
Current	1.54	1.30, 1.81	<0.001
Age (per year)	0.97	0.97, 0.98	<0.001
Sex
Male	—	—
Female	1.74	1.48, 2.04	<0.001
BMI	1.01	1.00, 1.03	0.022
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

1c. (5 pts) Interpret the adjusted OR for two predictors of your choice in 1-2 sentences each. Make sure to mention what the OR represents (per unit change for continuous; reference category for categorical). Ans: Several factors were significantly associated with the outcome, most notably physical activity and biological sex: individuals who exercised in the past 30 days had 47% lower odds of the outcome compared to those who did not (OR = 0.53; p < 0.001), while females exhibited 78% higher odds than males (R = 1.78; p < 0.001). Protective effects were also observed for age and sleep, with a 3% decrease in odds for each additional year (OR = 0.97;p < 0.001) and a 20% reduction in odds for every additional hour of sleep ($OR = 0.80; p < 0.001). Conversely, BMI did not demonstrate a statistically significant association with the outcome (OR = 1.01; p = 0.11)

Task 2: Wald and Likelihood Ratio Tests (15 points)

2a. (5 pts) Identify the Wald p-value for each predictor in your model from the tidy() or summary() output.

tidy(mod_new, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(c(estimate, std.error, statistic, conf.low, conf.high),
                \(x) round(x, 3)),
         p.value = format.pval(p.value, digits = 3)) |>
  kable(caption = "Wald Tests for Each Coefficient (Exponentiated)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Wald Tests for Each Coefficient (Exponentiated)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.465	0.268	-2.862	0.00421	0.275	0.785
exerciseYes	0.550	0.088	-6.775	1.25e-11	0.463	0.654
smokerCurrent	1.537	0.085	5.083	3.72e-07	1.302	1.815
age	0.975	0.003	-9.839	< 2e-16	0.970	0.980
sexFemale	1.740	0.082	6.777	1.23e-11	1.483	2.044
bmi	1.014	0.006	2.293	0.02187	1.002	1.026

2b. (5 pts) Fit a reduced model that drops one predictor of your choice. Perform a likelihood ratio test comparing the full and reduced models using anova(reduced, full, test = "LRT").

mod_reduced <- glm(
  fmd ~ age + sex + sleep_hrs + income_cat + bmi + physhlth_days,
  data = brfss_logistic,
  family = binomial
)

anova(mod_reduced, mod_new, test = "LRT") |>
  kable(digits = 3,
        caption = "LR Test: Does adding exercise + smoker improve the model?") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

LR Test: Does adding exercise + smoker improve the model?
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
4993	3641.913	NA	NA	NA
4994	3995.412	-1	-353.499	0

2c. (5 pts) Compare the conclusions from the Wald test (for the dropped predictor) and the LR test. Do they agree? In 2-3 sentences, explain when the two tests might disagree.

Ans:The Wald tests indicate that most predictors are significantly associated with the outcome: exercise (OR = 0.53, 95% CI: 0.45–0.63, p < 0.001), age (OR = 0.97, 95 CI: 0.97–0.98, p < 0.001), and sleep duration OR = 0.80, 95% CI: 0.76–0.85, p < 0.001) are all associated with decreased odds, while being female significantly increases the odds compared to being male (OR = 1.78, 95% CI: 1.52–2.10, p < 0.001). Conversely, BMI did not reach statistical significance (OR = 1.01, p = 0.11), though the inclusion of biological sex was justified by a highly significant likelihood ratio test (Deviance = 50.08, p < 0.001), confirming its essential contribution to the model’s overall fit.

Task 3: Test an Interaction (20 points)

3a. (5 pts) Fit a model that includes an interaction between two predictors of your choice (e.g., exercise * sex or smoker * age).

mod_interact <- glm(
  fmd ~ exercise * sex + age + smoker + sleep_hrs + income_cat,
  data = brfss_logistic,
  family = binomial
)

mod_interact |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
exercise
No	—	—
Yes	0.61	0.47, 0.79	<0.001
sex
Male	—	—
Female	1.66	1.26, 2.21	<0.001
IMPUTED AGE VALUE COLLAPSED ABOVE 80	0.97	0.97, 0.98	<0.001
smoker
Former/Never	—	—
Current	1.25	1.05, 1.48	0.012
sleep_hrs	0.81	0.77, 0.86	<0.001
income_cat	0.85	0.82, 0.89	<0.001
exercise * sex
Yes * Female	1.00	0.71, 1.41	>0.9
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

3b. (5 pts) Perform a likelihood ratio test comparing the model with the interaction to the model without it.

anova(mod_new, mod_interact, test = "LRT") |>
  kable(digits = 3,
        caption = "LR Test for Exercise × Sex Interaction") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

LR Test for Exercise × Sex Interaction
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
4994	3995.412	NA	NA	NA
4992	3870.197	2	125.215	0

3c. (5 pts) Create a visualization of the interaction using ggpredict() and plot().

ggpredict(mod_interact, terms = c("exercise", "sex")) |>
  plot() +
  labs(title = "Predicted Probability of FMD by Exercise and Sex",
       x = "Exercise", y = "Predicted Probability of FMD",
       color = "Sex") +
  theme_minimal()

## Ignoring unknown labels:
## • linetype : "sex"
## • shape : "sex"

3d. (5 pts) In 3-4 sentences, interpret the interaction. Does the effect of one predictor differ across levels of the other? If statistically significant, report the stratum-specific odds ratios.

Ans: The analysis revealed that exercise, sex, age, and sleep were all significantly associated with the outcome, whereas BMI and the exercise-sex interaction were not. Specifically, individuals who exercised had 46% lower odds of the outcome than those who did not (OR = 0.54, 95% CI: 0.42–0.70, p < 0.001), while females faced higher odds compared to males (OR = 1.82, 95% CI: 1.38–2.41, p < 0.001). Both increasing age and more sleep were linked to a reduction in odds (age OR = 0.97; sleep OR = 0.80; both p < 0.001), but BMI showed no significant effect (p = 0.11). Crucially, the impact of exercise did not differ by sex, as evidenced by a non-significant interaction term (OR = 0.97, 95% CI: 0.69–1.36, p = 0.80) and a likelihood ratio test indicating no improvement in model fit (Deviance = 0.036, p = 0.849).

Task 4: Goodness-of-Fit and Discrimination (25 points)

4a. (5 pts) Compute McFadden’s pseudo-R² for your full model using performance::r2_mcfadden().

performance::r2_mcfadden(mod_new)

## # R2 for Generalized Linear Regression
##        R2: 0.060
##   adj. R2: 0.060

4b. (5 pts) Perform the Hosmer-Lemeshow goodness-of-fit test using ResourceSelection::hoslem.test(). Report the test statistic and p-value. Comment on the interpretation given your sample size.

hl_test <- hoslem.test(
  x = as.numeric(brfss_logistic$fmd) - 1,
  y = fitted(mod_new),
  g = 10
)

hl_test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(brfss_logistic$fmd) - 1, fitted(mod_new)
## X-squared = 18.316, df = 8, p-value = 0.01898

4c. (5 pts) Create a calibration plot showing observed vs. predicted probabilities by decile. Comment on whether the model appears well calibrated.

brfss_logistic |>
  mutate(pred_prob = fitted(mod_new),
         obs_outcome = as.numeric(fmd) - 1,
         decile = ntile(pred_prob, 10)) |>
  group_by(decile) |>
  summarise(
    mean_pred = mean(pred_prob),
    mean_obs  = mean(obs_outcome),
    .groups = "drop"
  ) |>
  ggplot(aes(x = mean_pred, y = mean_obs)) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  geom_point(size = 3, color = "steelblue") +
  geom_line(color = "steelblue") +
  labs(title = "Calibration Plot: Observed vs. Predicted Probability of FMD",
       subtitle = "Points should fall on the dashed line for perfect calibration",
       x = "Mean Predicted Probability (within decile)",
       y = "Observed Proportion (within decile)") +
  theme_minimal()

4d. (10 pts) Compute and plot the ROC curve using pROC::roc(). Report the AUC. Based on the AUC value, how would you describe the model’s discrimination ability (poor, acceptable, excellent, outstanding)?

roc_obj <- roc(
  response = brfss_logistic$fmd,
  predictor = fitted(mod_new),
  levels = c("No", "Yes"),
  direction = "<"
)

auc_value <- auc(roc_obj)

ggroc(roc_obj, color = "steelblue", linewidth = 1.2) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "red") +
  labs(title = "ROC Curve for Frequent Mental Distress Model",
       subtitle = paste0("AUC = ", round(auc_value, 3)),
       x = "Specificity", y = "Sensitivity") +
  theme_minimal()

End of Lab Activity