EPI 553 — Logistic Regression Part 1 Lab Due: End of class, April 9, 2026
Complete the four tasks below using the BRFSS 2020 dataset
(brfss_logistic_2020.rds). Submit a knitted HTML file via
Brightspace. You may collaborate, but each student must submit their own
work.
| Variable | Description | Type |
|---|---|---|
fmd |
Frequent mental distress (No/Yes) | Factor (outcome) |
menthlth_days |
Mentally unhealthy days (0-30) | Numeric |
physhlth_days |
Physically unhealthy days (0-30) | Numeric |
sleep_hrs |
Hours of sleep per night | Numeric |
age |
Age in years | Numeric |
sex |
Male / Female | Factor |
bmi |
Body mass index | Numeric |
exercise |
Exercised in past 30 days (No/Yes) | Factor |
income_cat |
Household income category (1-8) | Numeric |
smoker |
Former/Never vs. Current | Factor |
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
## Warning: package 'broom' was built under R version 4.4.3
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.3
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.4.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.4.3
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.4.3
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))
## Setting theme "Compact"
## Setting theme "Compact"
brfss_logistic <- readRDS(
"/Users/sarah/OneDrive/Documents/EPI 553/data/brfss_logistic_2020.rds"
)
1a. (5 pts) Create a frequency table showing the number and percentage of individuals with and without frequent mental distress.
brfss_logistic %>%
count(fmd) %>%
mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 2 × 3
## fmd n percent
## <fct> <int> <dbl>
## 1 No 4243 84.9
## 2 Yes 757 15.1
1b. (5 pts) Create a descriptive summary table of at
least 4 predictors, stratified by FMD status. Use
tbl_summary().
tbl_summary(
brfss_logistic,
by = fmd,
include = c(age, sex, exercise, smoker),
statistic = all_continuous() ~ "{mean} ({sd})",
label = list(
age ~ "Age (years)",
sex ~ "Sex",
exercise ~ "Exercise in past 30 days",
smoker ~ "Smoking status"
))|>
add_overall() |>
add_p() |>
bold_labels()
| Characteristic | Overall N = 5,0001 |
No N = 4,2431 |
Yes N = 7571 |
p-value2 |
|---|---|---|---|---|
| Age (years) | 56 (16) | 57 (16) | 50 (16) | <0.001 |
| Sex | <0.001 | |||
| Male | 2,701 (54%) | 2,378 (56%) | 323 (43%) | |
| Female | 2,299 (46%) | 1,865 (44%) | 434 (57%) | |
| Exercise in past 30 days | 3,673 (73%) | 3,192 (75%) | 481 (64%) | <0.001 |
| Smoking status | <0.001 | |||
| Former/Never | 3,280 (66%) | 2,886 (68%) | 394 (52%) | |
| Current | 1,720 (34%) | 1,357 (32%) | 363 (48%) | |
| 1 Mean (SD); n (%) | ||||
| 2 Wilcoxon rank sum test; Pearson’s Chi-squared test | ||||
1c. (5 pts) Create a bar chart showing the proportion of FMD by exercise status OR smoking status.
brfss_logistic %>%
group_by(exercise) %>%
summarize(prop_FMD = mean(fmd == "Yes")) %>%
ggplot(aes(x = exercise, y = prop_FMD)) +
geom_col(fill = "steelblue") +
scale_y_continuous(labels = scales::percent_format()) +
labs(
x = "Exercise Status",
y = "Proportion with FMD",
title = "Proportion of Frequent Mental Distress by Exercise Status"
)
2a. (5 pts) Fit a simple logistic regression model predicting FMD from exercise. Report the coefficients on the log-odds scale.
mod_2a <- glm(fmd~exercise, data = brfss_logistic, family = binomial(link = "logit"))
tidy(mod_2a, conf.int = TRUE, exponentiate = FALSE) |>
kable(digits = 3, caption = "Simple Logistic Regession: FMD ~ Exercise") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -1.337 | 0.068 | -19.769 | 0 | -1.471 | -1.206 |
| exerciseYes | -0.555 | 0.083 | -6.655 | 0 | -0.718 | -0.391 |
The log‑odds of frequent mental distress among adults who do not exercise is −1.337. The coefficient for exercise is −0.555, meaning that adults who exercise have 0.555 lower log‑odds of frequent mental distress compared with those who do not exercise.
2b. (5 pts) Exponentiate the coefficients to obtain odds ratios with 95% confidence intervals.
tidy(mod_2a, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3, caption ="Simple Logistic Regression: FMD ~ Exercise (Odds ratio)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.263 | 0.068 | -19.769 | 0 | 0.230 | 0.299 |
| exerciseYes | 0.574 | 0.083 | -6.655 | 0 | 0.488 | 0.676 |
2c. (5 pts) Interpret the odds ratio for exercise in the context of the research question.
Adults who exercised in the past 30 days had 43% lower odds of experiencing frequent mental distress compared to adults who did not exercise (OR =0.57, 95% CI: 0.488 - 0.676). This indicates that exercise is associated with reduced likelihood of frequent mental distress.
2d. (5 pts) Create a plot showing the predicted probability of FMD across levels of a continuous predictor (e.g., age or sleep hours).
mod_sleep <- glm(fmd ~ sleep_hrs, data = brfss_logistic, family = binomial(link = "logit"))
ggpredict(mod_sleep, terms = "sleep_hrs") |>
plot() +
labs(title = "Predicted Probability of Frequent Mental Distress by Hours of Sleep",
x = "Hours of Sleep", y = "Predicted Probability of FMD") +
theme_minimal()
## Task 3: Comparing Predictors (20 points)
3a. (5 pts) Fit three separate simple logistic regression models, each with a different predictor of your choice.
mod_age <- glm(fmd ~ age, data = brfss_logistic, family = binomial(link = "logit"))
mod_sex <- glm(fmd ~ sex, data = brfss_logistic, family = binomial(link = "logit"))
mod_smoker <- glm(fmd ~ smoker, data =brfss_logistic, family = binomial (link = "logit"))
3b. (10 pts) Create a table comparing the odds ratios from all three models.
table_3b <- bind_rows(
tidy(mod_age, conf.int = TRUE, exponentiate = TRUE),
tidy(mod_sex, conf.int = TRUE, exponentiate = TRUE),
tidy(mod_smoker, conf.int = TRUE, exponentiate = TRUE)
)|>
filter(term != "(Intercept)") |>
select(term, estimate, conf.low, conf.high, p.value)
table_3b |>
kable(digits = 3, caption = "Comparison of Odds Ratios Across Three Simple Logistic Models") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| term | estimate | conf.low | conf.high | p.value |
|---|---|---|---|---|
| age | 0.974 | 0.970 | 0.979 | 0 |
| sexFemale | 1.713 | 1.466 | 2.004 | 0 |
| smokerCurrent | 1.959 | 1.675 | 2.291 | 0 |
3c. (5 pts) Which predictor has the strongest crude association with FMD? Justify your answer.
Smoking status shows the strongest crude association with FMD. The odds ratio of 1.959 indicates a larger difference in the odds of FMD between smokers and non-smokers.
4a. (5 pts) Fit a multiple logistic regression model predicting FMD from at least 3 predictors.
mod_4a <- glm(fmd~ age + sex + smoker, data= brfss_logistic, family = binomial(link = "logit"))
4b. (5 pts) Report the adjusted odds ratios using
tbl_regression().
mod_4a |>
tbl_regression(
exponentiate = TRUE,
label = list(
age ~ "Age (per year)",
sex ~ "Sex",
smoker ~ "Smoking Status"
)
) |>
bold_labels ()|>
bold_p()
| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Age (per year) | 0.98 | 0.97, 0.98 | <0.001 |
| Sex | |||
| Male | — | — | |
| Female | 1.76 | 1.50, 2.07 | <0.001 |
| Smoking Status | |||
| Former/Never | — | — | |
| Current | 1.62 | 1.38, 1.91 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
4c. (5 pts) For one predictor, compare the crude OR (from Task 3) with the adjusted OR (from Task 4). Show both values.
The crude OR for smoker from task 3 was 1.959, compared to the adjusted OR of 1.62.
4d. (5 pts) In 2-3 sentences, assess whether confounding is present for the predictor you chose. Which direction did the OR change, and what does this mean?
Since the adjusted OR moved toward 1, this indicates that part of the crude association between smoking and FMD was explained by differences in age and sex between smokers and non-smokers. This pattern suggests the presence of confounding.
Completion credit (25 points): Awarded for a complete, good-faith attempt at all tasks. Total: 75 + 25 = 100 points.
End of Lab Activity