EPI 553 — Logistic Regression Part 1 Lab Due: April 13, 2026
Complete the four tasks below using the BRFSS 2020 dataset
(brfss_logistic_2020.rds). Submit a knitted HTML file via
Brightspace. You may collaborate, but each student must submit their own
work.
| Variable | Description | Type |
|---|---|---|
fmd |
Frequent mental distress (No/Yes) | Factor (outcome) |
menthlth_days |
Mentally unhealthy days (0-30) | Numeric |
physhlth_days |
Physically unhealthy days (0-30) | Numeric |
sleep_hrs |
Hours of sleep per night | Numeric |
age |
Age in years | Numeric |
sex |
Male / Female | Factor |
bmi |
Body mass index | Numeric |
exercise |
Exercised in past 30 days (No/Yes) | Factor |
income_cat |
Household income category (1-8) | Numeric |
smoker |
Former/Never vs. Current | Factor |
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
## Warning: package 'broom' was built under R version 4.5.2
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))
## Setting theme "Compact"
## Setting theme "Compact"
brfss_logistic <- readRDS("C:/Users/abbym/OneDrive/Desktop/STATS553/R Materials/epi553/scripts/brfss_logistic.rds")
brfss_logistic |>
select(fmd) |>
tbl_summary(
type = list(fmd ~ "categorical"),
label = list(
fmd ~ "Frequent Mental Distress"
),
statistic = list(
all_categorical() ~ "{n} ({p}%)"
)) |>
add_n() |>
bold_labels() |>
modify_caption("**FMD Descriptive Statistics**")
| Characteristic | N | N = 5,0001 |
|---|---|---|
| Frequent Mental Distress | 5,000 | |
| No | 4,243 (85%) | |
| Yes | 757 (15%) | |
| 1 n (%) | ||
brfss_logistic |>
tbl_summary(
by = fmd,
include = c(physhlth_days, sleep_hrs, age, sex, bmi, exercise,
income_cat, smoker),
type = list(
c(physhlth_days, sleep_hrs, age, bmi, income_cat) ~ "continuous"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})"
),
label = list(
physhlth_days ~ "Physical unhealthy days",
sleep_hrs ~ "Sleep hours",
age ~ "Age (years)",
sex ~ "Sex",
bmi ~ "BMI",
exercise ~ "Exercise in past 30 days",
income_cat ~ "Income category (1-8)",
smoker ~ "Smoking status"
)
) |>
add_overall() |>
add_p() |>
bold_labels()
| Characteristic | Overall N = 5,0001 |
No N = 4,2431 |
Yes N = 7571 |
p-value2 |
|---|---|---|---|---|
| Physical unhealthy days | 4 (9) | 3 (8) | 10 (13) | <0.001 |
| Sleep hours | 7.00 (1.48) | 7.09 (1.40) | 6.51 (1.83) | <0.001 |
| Age (years) | 56 (16) | 57 (16) | 50 (16) | <0.001 |
| Sex | <0.001 | |||
| Male | 2,701 (54%) | 2,378 (56%) | 323 (43%) | |
| Female | 2,299 (46%) | 1,865 (44%) | 434 (57%) | |
| BMI | 28.5 (6.3) | 28.4 (6.2) | 29.3 (7.0) | 0.001 |
| Exercise in past 30 days | 3,673 (73%) | 3,192 (75%) | 481 (64%) | <0.001 |
| Income category (1-8) | 5.85 (2.11) | 6.00 (2.04) | 5.05 (2.29) | <0.001 |
| Smoking status | <0.001 | |||
| Former/Never | 3,280 (66%) | 2,886 (68%) | 394 (52%) | |
| Current | 1,720 (34%) | 1,357 (32%) | 363 (48%) | |
| 1 Mean (SD); n (%) | ||||
| 2 Wilcoxon rank sum test; Pearson’s Chi-squared test | ||||
brfss_logistic |>
ggplot(aes(x = fmd, fill= exercise)) +
geom_bar(color= "black", position= "dodge") +
geom_text(stat= "count", aes(label= ..count..), position= position_dodge(width=0.9))+
labs(
title = "Figure 1: FMD Distribution by Exercise Status",
x = "FMD",
y = "Count of each FMD Status (yes/no)",
caption = "Source: BRFSS"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
1a. (5 pts) Create a frequency table showing the number and percentage of individuals with and without frequent mental distress.
1b. (5 pts) Create a descriptive summary table of at
least 4 predictors, stratified by FMD status. Use
tbl_summary().
1c. (5 pts) Create a bar chart showing the proportion of FMD by exercise status OR smoking status.
# Fit a simple logistic model using exercise
mod_exercise <- glm(fmd ~ exercise, data = brfss_logistic,
family = binomial(link = "logit"))
# Report coefficients
tidy(mod_exercise, conf.int = TRUE, exponentiate = FALSE) |>
kable(digits = 3, caption = "Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -1.337 | 0.068 | -19.769 | 0 | -1.471 | -1.206 |
| exerciseYes | -0.555 | 0.083 | -6.655 | 0 | -0.718 | -0.391 |
# Report coefficients
tidy(mod_exercise, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3, caption = "Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.263 | 0.068 | -19.769 | 0 | 0.230 | 0.299 |
| exerciseYes | 0.574 | 0.083 | -6.655 | 0 | 0.488 | 0.676 |
mod_age <- glm(fmd ~ age, data = brfss_logistic,
family = binomial(link = "logit"))
ggpredict(mod_age, terms = "age [18:80]") |>
plot() +
labs(title = "Predicted Probability of Frequent Mental Distress by Age",
x = "Age (years)", y = "Predicted Probability of FMD") +
theme_minimal()
2a. (5 pts) Fit a simple logistic regression model predicting FMD from exercise. Report the coefficients on the log-odds scale.
2b. (5 pts) Exponentiate the coefficients to obtain odds ratios with 95% confidence intervals.
2c. (5 pts) Interpret the odds ratio for exercise in the context of the research question. Compared to those who do not exercise, those who do have 42.6% decreased odds of having frequent mental distress. 2d. (5 pts) Create a plot showing the predicted probability of FMD across levels of a continuous predictor (e.g., age or sleep hours).
# Smoking
mod_smoker <- glm(fmd ~ smoker, data = brfss_logistic,
family = binomial(link = "logit"))
# Income-cat
mod_income <- glm(fmd ~ income_cat, data = brfss_logistic,
family = binomial(link = "logit"))
#BMI
mod_bmi <- glm(fmd ~ bmi, data = brfss_logistic,
family = binomial(link = "logit"))
tidy(mod_bmi, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3, caption = "Simple Logistic Regression: FMD ~ BMI (Odds Ratio Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.094 | 0.178 | -13.299 | 0 | 0.066 | 0.133 |
| bmi | 1.023 | 0.006 | 3.745 | 0 | 1.011 | 1.034 |
# Compare the fmd OR across models
tribble(
~Model, ~`Odds Ratio`, ~`95% CI`,
"Smoker Model", round(exp(coef(mod_smoker))[2], 3),
paste0("(", round(exp(confint(mod_smoker))[2,1],3), ", ", round(exp(confint(mod_smoker))[2,2],3), ")"),
"Income Model", round(exp(coef(mod_income))[2], 3),
paste0("(", round(exp(confint(mod_income))[2,1],3), ", ", round(exp(confint(mod_income))[2,2],3), ")"),
"BMI Model", round(exp(coef(mod_bmi))[2], 3),
paste0("(", round(exp(confint(mod_bmi))[2,1],3), ", ", round(exp(confint(mod_bmi))[2,2],3), ")")
) %>%
kable(caption = "FMD OR Across Different Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
| Model | Odds Ratio | 95% CI |
|---|---|---|
| Smoker Model | 1.959 | (1.675, 2.291) |
| Income Model | 0.821 | (0.793, 0.85) |
| BMI Model | 1.023 | (1.011, 1.034) |
3a. (5 pts) Fit three separate simple logistic regression models, each with a different predictor of your choice.
3b. (10 pts) Create a table comparing the odds ratios from all three models.
3c. (5 pts) Which predictor has the strongest crude association with FMD? Justify your answer. Smoking has the strongest crude association with FMD, because it has the largest odds ratio. The odds ratio of 1.959 means that those who smoke have almost 2 times the odds of having frequent mental distress than those who do not smoke. ## Task 4: Introduction to Multiple Logistic Regression (20 points)
mod_full <- glm(fmd ~ smoker + bmi + income_cat, data = brfss_logistic,
family = binomial(link = "logit"))
# Create Table
mod_full |>
tbl_regression(
exponentiate = TRUE,
label = list(
bmi ~ "BMI",
smoker ~ "Smoking status",
income_cat ~ "Income category (per unit)"
)
) |>
bold_labels() |>
bold_p()
| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Smoking status | |||
| Former/Never | — | — | |
| Current | 1.69 | 1.44, 1.99 | <0.001 |
| BMI | 1.02 | 1.01, 1.03 | <0.001 |
| Income category (per unit) | 0.84 | 0.81, 0.88 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
# Crude ORs from simple models
crude_smoker <- tidy(mod_smoker, exponentiate = TRUE, conf.int = TRUE) |>
filter(term == "smokerCurrent") |>
dplyr::select(term, estimate, conf.low, conf.high) |>
mutate(type = "Crude")
# Adjusted ORs from multiple model
adj_smoker <- tidy(mod_full, exponentiate = TRUE, conf.int = TRUE) |>
filter(term == "smokerCurrent") |>
dplyr::select(term, estimate, conf.low, conf.high) |>
mutate(type = "Adjusted")
bind_rows(crude_smoker, adj_smoker) |>
mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
kable(col.names = c("Predictor", "OR", "95% CI Lower", "95% CI Upper", "Type"),
caption = "Crude vs. Adjusted Odds Ratios") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Predictor | OR | 95% CI Lower | 95% CI Upper | Type |
|---|---|---|---|---|
| smokerCurrent | 1.959 | 1.675 | 2.291 | Crude |
| smokerCurrent | 1.689 | 1.436 | 1.987 | Adjusted |
4a. (5 pts) Fit a multiple logistic regression model predicting FMD from at least 3 predictors.
4b. (5 pts) Report the adjusted odds ratios using
tbl_regression().
4c. (5 pts) For one predictor, compare the crude OR (from Task 3) with the adjusted OR (from Task 4). Show both values.
4d. (5 pts) In 2-3 sentences, assess whether confounding is present for the predictor you chose. Which direction did the OR change, and what does this mean? Confounding is present for smoking. The OR decreased, meaning confounding inflates the relationship smoking has on frequent mental distress. Completion credit (25 points): Awarded for a complete, good-faith attempt at all tasks. Total: 75 + 25 = 100 points.
End of Lab Activity