Part 2: In-Class Lab Activity

EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026


Instructions

In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.

Variable Description Type
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
age Age in years (capped at 80) Continuous
sex Sex (Male/Female) Factor
education Education level (4 categories) Factor
exercise Any physical activity (Yes/No) Factor
gen_health General health status (5 categories) Factor
income_cat Household income (1–8 ordinal) Numeric
# Load the dataset
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
## Warning: package 'broom' was built under R version 4.4.3
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.3
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.4.3
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.4.3
brfss_ci <- readRDS(
"/Users/sarah/OneDrive/Documents/EPI 553/data/brfss_ci_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a scatterplot of physhlth_days (y-axis) vs. age (x-axis), colored by exercise status. Add separate regression lines for each group. Describe the pattern you observe.

ggplot(brfss_ci, aes(x= age, y= physhlth_days, color = exercise)) +
  geom_point(alpha= 0.4) +
  geom_smooth(method = "lm") +
  labs(
    x = "Age (years)",
    y = "Poor Physical Health Days (past 30 days)",
    color = "Exercise",
    title = "Poor Physical Health Days vs Age by Exercise Status"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The scatterplot shows a positive association between age and poor physical health days.

1b. (5 pts) Compute the mean physhlth_days for each combination of sex and exercise. Present the results in a table. Does it appear that the association between exercise and physical health days might differ by sex?

mean_table <- brfss_ci %>%
  group_by(sex, exercise) %>%
  summarise(mean_physhlth = mean(physhlth_days, na.rm = TRUE)) %>%
  arrange(sex, exercise)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
mean_table
## # A tibble: 4 × 3
## # Groups:   sex [2]
##   sex    exercise mean_physhlth
##   <fct>  <fct>            <dbl>
## 1 Male   No                6.64
## 2 Male   Yes               2.32
## 3 Female No                7.43
## 4 Female Yes               2.45

For both men and women, individuals who exercise tend to report fewer poor physical health days. The mean values are near identical for both men and women exercising.

1c. (5 pts) Create a scatterplot of physhlth_days vs. sleep_hrs, faceted by education level. Comment on whether the slopes appear similar or different across education groups.

ggplot(brfss_ci, aes(x= sleep_hrs, y= physhlth_days, color = education)) +
  geom_point(alphan= 0.4) +
  geom_smooth(method = "lm") +
  labs(
    x = "Hours of Sleep",
    y = "Poor Physical Health Days (past 30 days)",
    color = "Education level",
    title = "Poor Physical Health Days vs Hours of Sleep by Education Level"
  ) +
  theme_minimal()
## Warning in geom_point(alphan = 0.4): Ignoring unknown parameters: `alphan`
## `geom_smooth()` using formula = 'y ~ x'

The scatterplot generally shows a negative association between hours of sleep and poor physical health days across all education groups. The slopes differ slightly some education levels show a steeper decline in poor physical health days with increasing sleep.


Task 2: Stratified Analysis for Interaction (15 points)

2a. (5 pts) Fit separate simple linear regression models of physhlth_days ~ age for exercisers and non-exercisers. Report the slope, SE, and 95% CI for age in each stratum.

exercise_model <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "Yes"))

no_exercise_model <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "No"))

bind_rows(
  tidy(exercise_model, conf.int = TRUE) |> mutate (Stratum = "Exercise"),
  tidy(no_exercise_model, conf.int = TRUE) |> mutate(Stratum = "No Exercise")
) |>
  filter(term == "age") |>
  select(Stratum, estimate, std.error, conf.low, conf.high, p.value) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Stratified Analysis: Age -> Physical Health Days, by Exercise Status",
        col.names = c("Stratum", "Estimate", "SE", "CI Lower", "CI Upper", "p-value")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Stratified Analysis: Age -> Physical Health Days, by Exercise Status
Stratum Estimate SE CI Lower CI Upper p-value
Exercise 0.0192 0.0060 0.0075 0.0309 0.0013
No Exercise 0.0745 0.0212 0.0328 0.1161 0.0005

2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel?

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title = "Association Between Age and Physical Health Days, Stratified by Exercise Status",
    x = "age (years)",
    y = "Physically Unhealthy Days (Past 30)",
    color = "Exercise"
  ) +
  theme_minimal(base_size = 13)
## `geom_smooth()` using formula = 'y ~ x'

Yes the regression lines are approximately parallel.

2c. (5 pts) Can you formally test whether the two slopes are different using only the stratified results? Explain why or why not.

No we cannot formally test whether the two slopes are different using the stratified results. We are able to see that the slopes are different but can’t compute a pvalue for the difference.

Task 3: Interaction via Regression (25 points)

3a. (5 pts) Fit the interaction model: physhlth_days ~ age * exercise. Write out the fitted equation.

interaction_mod <- lm(physhlth_days ~ age * exercise, data = brfss_ci)

tidy(interaction_mod,conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Interaction Model: Age * Exercise -> Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Interaction Model: Age * Exercise -> Physical Health Days
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.7610 0.8798 3.1381 0.0017 1.0361 4.4858
age 0.0745 0.0145 5.1203 0.0000 0.0460 0.1030
exerciseYes -1.3858 0.9664 -1.4340 0.1516 -3.2804 0.5087
age:exerciseYes -0.0553 0.0162 -3.4072 0.0007 -0.0871 -0.0235

\[Physhlth_days = 2.7610 + 0.0745(age) + -1.3858(exerciseYes) + -0.0553(age X Exerciseyes)\]

3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.

b <- coef(interaction_mod)

tribble(
  ~Method, ~Stratum, ~Intercept, ~Slope,
  
  "Stratified", "Exercise",
    round(coef(exercise_model)[1], 3), 
    round(coef(exercise_model)[2], 3),
  
  "Stratified", "No Exercise",
    round(coef(no_exercise_model)[1], 3), 
    round(coef(no_exercise_model)[2], 3),
  
  "Interaction model", "Exercise",
    round(b[1] + b[3], 3), 
    round(b[2] + b[4], 3),
  
  "Interaction model", "No Exercise",
    round(b[1], 3), 
    round(b[2], 3)
  
) |>
  kable(caption = "Verification: Stratified Analysis = Interaction Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Verification: Stratified Analysis = Interaction Model
Method Stratum Intercept Slope
Stratified Exercise 1.375 0.019
Stratified No Exercise 2.761 0.074
Interaction model Exercise 1.375 0.019
Interaction model No Exercise 2.761 0.074

3c. (5 pts) Conduct the t-test for the interaction term (age:exerciseYes). State the null and alternative hypotheses, report the test statistic and p-value, and state your conclusion about whether interaction is present.

interaction_term <- tidy(interaction_mod) |> filter(term == "age:exerciseYes")
cat("Interaction term (age:exerciseYes) : \n")
## Interaction term (age:exerciseYes) :
cat("estimate:", round(interaction_term$estimate, 4), "\n")
## estimate: -0.0553
cat("t-statistic:", round(interaction_term$statistic, 3), "\n")
## t-statistic: -3.407
cat("p-value:", round(interaction_term$p.value, 4), "\n")
## p-value: 7e-04

HO: _age:exerciseyes = 0 Ha: _age:exerciseyes != 0

Estimate = -0.0553 t-statistic = -3.407 p-vallue = .0007

There is strong evidence that the association between age and poor physical health days differs between exercisers and non-exercisers.

3d. (5 pts) Now fit a model with an interaction between age and education: physhlth_days ~ age * education. How many interaction terms are produced? Use a partial F-test to test whether the age \(\times\) education interaction as a whole is significant.

mod_3d <- lm(physhlth_days ~ age * education, data = brfss_ci)

mod_3d_reduced <- lm(physhlth_days ~ age + education, data = brfss_ci)

tidy(mod_3d, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Interaction Model: Age * Education -> Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Interaction Model: Age * Education -> Physical Health Days
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.7542 1.5422 1.7859 0.0742 -0.2692 5.7775
age 0.0706 0.0269 2.6277 0.0086 0.0179 0.1233
educationHS graduate -1.4016 1.6900 -0.8293 0.4069 -4.7146 1.9115
educationSome college -1.1261 1.6893 -0.6666 0.5051 -4.4380 2.1857
educationCollege graduate -2.7102 1.6580 -1.6346 0.1022 -5.9606 0.5402
age:educationHS graduate -0.0197 0.0296 -0.6661 0.5054 -0.0776 0.0382
age:educationSome college -0.0326 0.0295 -1.1039 0.2697 -0.0904 0.0253
age:educationCollege graduate -0.0277 0.0289 -0.9583 0.3380 -0.0844 0.0290
anova(mod_3d_reduced, mod_3d) |>
  tidy()|>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Is the Age x Education Interaction Significant?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Partial F-test: Is the Age x Education Interaction Significant?
term df.residual rss df sumsq statistic p.value
physhlth_days ~ age + education 4995 306764.2 NA NA NA NA
physhlth_days ~ age * education 4992 306671.9 3 92.2713 0.5007 0.6818

There are three interaction terms. The p-value of 0.6818 indicates that we fail to reject the null hypothesis. There is no evidence that the association between age and physically unhealthy days differs across education levels.

3e. (5 pts) Create a visualization using ggpredict() showing the predicted physhlth_days by age for each education level. Do the lines appear parallel?

pred_3e <- ggpredict(mod_3d, terms = c("age", "education"))

ggplot(pred_3e, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
  labs(
    title = "Predicted Physical Health Days by Age and Education",
    x = "Age (years)",
    y = "Predicted Physically Unhealthy Days",
    color = "Education",
    fill = "Education"
  ) +
  theme_minimal(base_size = 13)

Yes the lines appear to be roughly parallel.

Task 4: Confounding Assessment (25 points)

For this task, the exposure is exercise and the outcome is physhlth_days.

4a. (5 pts) Fit the crude model: physhlth_days ~ exercise. Report the exercise coefficient. This is the unadjusted estimate.

mod_4a <- lm(physhlth_days ~ exercise, data = brfss_ci)
b_crude <- coef(mod_4a)
b_crude
## (Intercept) exerciseYes 
##    7.102708   -4.711514

The exercise coefficient in this model is -4.71.

4b. (10 pts) Systematically assess whether each of the following is a confounder of the exercise-physical health association: age, sex, sleep_hrs, education, and income_cat. For each:

  • Fit the model physhlth_days ~ exercise + [covariate]
  • Report the adjusted exercise coefficient
  • Compute the percent change from the crude estimate
  • Apply the 10% rule to determine if the variable is a confounder

Present your results in a single summary table.

b_crude_val <- coef(mod_4a)["exerciseYes"]

confounders <- list(
  "Age" = lm(physhlth_days ~ exercise + age, data = brfss_ci),
  "Sex" = lm(physhlth_days ~ exercise + sex, data = brfss_ci),
  "Hours of Sleep" = lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci),
  "Education" = lm(physhlth_days ~ exercise + education, data = brfss_ci),
  "Income" = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
)

conf_table <- map_dfr(names(confounders), \(name) {
  mod <- confounders[[name]]
  b_adjusted_val <- coef(mod)["exerciseYes"]
  tibble(
    Covariate = name,
    `Crude β  (exercise)` = round(b_crude_val, 4),
    `Adjusted β  (exercise)` = round(b_adjusted_val, 4),
    `% Change` = round(abs(b_crude_val - b_adjusted_val) / abs(b_crude_val) *100, 1),
    Confounder = ifelse(abs(b_crude_val - b_adjusted_val) / abs(b_crude_val) *100> 10, "Yes (>10%)", "No")
    
    
  )
})

conf_table |>
  kable(caption = "Systematic Confounding Assessment: One-at-a-Time Addition") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(5, bold = TRUE)
Systematic Confounding Assessment: One-at-a-Time Addition
Covariate Crude β (exercise) Adjusted β (exercise) % Change Confounder
Age -4.7115 -4.5504 3.4 No
Sex -4.7115 -4.6974 0.3 No
Hours of Sleep -4.7115 -4.6831 0.6 No
Education -4.7115 -4.3912 6.8 No
Income -4.7115 -3.9406 16.4 Yes (>10%)

4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?

mod_4c <- lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)

b_income <- coef(mod_4c) ["exerciseYes"]


  tribble(
  ~Model, ~`Exercise β`, ~`% Change from Crude`,
  "Crude", round(b_crude_val, 4), "—",
  
  "Adjusted (income only)", round(b_income, 4),
    paste0(round(abs(b_crude_val - b_income) / abs(b_crude_val) * 100, 1), "%"),
  "Final (all confounders)", round(b_income, 4),
    paste0(round(abs(b_crude_val - b_income) / abs(b_crude_val) * 100, 1), "%")
) |>
  kable(caption = "Exercise Coefficient: Crude vs. Adjusted Models") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Exercise Coefficient: Crude vs. Adjusted Models
Model Exercise β % Change from Crude
Crude -4.7115
Adjusted (income only) -3.9406 16.4%
Final (all confounders) -3.9406 16.4%

The fully adjusted model included exercise and the only confounder, income. The adjusted exercise coefficient was -3.94, compared to the crude estimate of -4.71. This represents a 16.4% change.

4d. (5 pts) Is gen_health a confounder or a mediator of the exercise-physical health relationship? Could it be both? Explain your reasoning with reference to the three conditions for confounding and the concept of the causal pathway.

gen_health is not a confounder, although it’s associated with both exercise and physical health days. Exercise can improve general health which reduces physically unhealthy days. Because confounders must not be downstream of the exposure, gen_health fails the third condition for confounding.


Task 5: Public Health Interpretation (20 points)

5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:

  • Whether the association between exercise and physical health differs by any subgroup (interaction assessment)
  • Which variables confounded the exercise-physical health association
  • The direction and approximate magnitude of the adjusted exercise effect
  • Appropriate caveats about cross-sectional data and potential unmeasured confounding

Do not use statistical jargon.

People who exercise tend to report fewer physically unhealthy days, and this pattern was generally consistent across groups. Income was the only factor that meaningfully changed the association, suggesting that part of the difference in physical health between those who exercise and those who don’t reflects underlying socioeconomic differences.After accounting for income, people who exercise still reported about four fewer physically unhealthy days. Because the data is cross-sectional, we cannot determine whether exercise improves health.

5b. (10 pts) A colleague suggests including gen_health as a covariate in the final model because it changes the exercise coefficient by more than 10%. You disagree. Write a 3–4 sentence argument explaining why adjusting for general health may not be appropriate if the goal is to estimate the total effect of exercise on physical health days. Use the concept of mediation in your argument.

Including gen_health in the model would block the effect we are trying to measure. Exercise can improve general health which can in turn reduce physically unhealthy days. This means it lies in the causal pathway as a mediator rather than confounder.


End of Lab Activity