Part 2: In-Class Lab Activity

EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026

Instructions

In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.

Variable	Description	Type
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`exercise`	Any physical activity (Yes/No)	Factor
`gen_health`	General health status (5 categories)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(ggplot2)
library(car)
library(ggeffects)
library(haven)
library(janitor)
library(gtsummary)
library(GGally)
library(plotly)
library(lmtest)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_ci <- readRDS(
 "C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Labs/brfss_ci.rds")
  #clean_names()

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a scatterplot of physhlth_days (y-axis) vs. age (x-axis), colored by exercise status. Add separate regression lines for each group. Describe the pattern you observe.

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Association Between Age and Physical Health Days, Stratified by Exercise status",
    subtitle = "BRFSS 2020",
    x        = "Age in years",
    y        = "Physically Unhealthy Days (Past 30)",
    color    = "Exercise status"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

Age is positively associated with physically unhealthy days in both Exercise status groups. However, the slope is steeper for the No Exercise group, suggesting that Exercise modifies the Age-Physically Unhealthy relationship.

1b. (5 pts) Compute the mean physhlth_days for each combination of sex and exercise. Present the results in a table. Does it appear that the association between exercise and physical health days might differ by sex?

group_means <- brfss_ci |>
  summarise(mean_days = mean(physhlth_days), .by = c(exercise, sex))

group_means %>%
  kable(caption = "Physical Health Days by Sex & Exercise",
        align = "lrrrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Physical Health Days by Sex & Exercise
exercise	sex	mean_days
No	Female	7.430400
Yes	Male	2.323577
Yes	Female	2.451056
No	Male	6.643498

It appears the relationship between Physical health days and Exercise might slightly differ by sex – males and females who exercised had roughly the same number of poor physical health days, but among the non-exercisers, females had much more physically unhealthy days than males.

1c. (5 pts) Create a scatterplot of physhlth_days vs. sleep_hrs, faceted by education level. Comment on whether the slopes appear similar or different across education groups.

ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days, color = education)) +
  geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Association Between Sleep and Physical Health Days, Stratified by Education Level",
    subtitle = "BRFSS 2020",
    x        = "Sleep (hrs)",
    y        = "Physically Unhealthy Days (Past 30)",
    color    = "Education level"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

The slopes are different across education groups – the slope becomes less steep with increasing education level.

Task 2: Stratified Analysis for Interaction (15 points)

2a. (5 pts) Fit separate simple linear regression models of physhlth_days ~ age for exercisers and non-exercisers. Report the slope, SE, and 95% CI for age in each stratum.

exercise_brfss <- brfss_ci %>%
  filter(
    exercise == "Yes")

no_exercise_brfss <- brfss_ci %>%
  filter(
    exercise == "No")


exer <- lm(physhlth_days ~ age, data = exercise_brfss)

tidy(exer, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Physically Unhealthy Days vs. Age, Exercisers",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Physically Unhealthy Days vs. Age, Exercisers
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.3752	0.3327	4.1327	0.0000	0.7228	2.0275
age	0.0192	0.0060	3.2079	0.0013	0.0075	0.0309

no_exer <- lm(physhlth_days ~ age, data = no_exercise_brfss)

tidy(no_exer, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Physically Unhealthy Days vs. Age, Non-Exercisers",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Physically Unhealthy Days vs. Age, Non-Exercisers
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.7610	1.2837	2.1508	0.0317	0.2421	5.2798
age	0.0745	0.0212	3.5094	0.0005	0.0328	0.1161

2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel?

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
  labs(
    title    = "Association Between Age and Physical Health Days, Stratified by Exercise status",
    subtitle = "BRFSS 2020",
    x        = "Age in years",
    y        = "Physically Unhealthy Days (Past 30)",
    color    = "Exercise status"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

The lines are not parallel. The non-exercise group has a steeper slope.

2c. (5 pts) Can you formally test whether the two slopes are different using only the stratified results? Explain why or why not.

No, even though it is visually apparent that the slopes are different, we cannot determine if that difference is statistically significant through the stratified results alone.

Task 3: Interaction via Regression (25 points)

3a. (5 pts) Fit the interaction model: physhlth_days ~ age * exercise. Write out the fitted equation.

inter <- lm(physhlth_days ~ age * exercise, data = brfss_ci)

tidy(inter, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Physically Unhealthy Days vs. Age * Exercise",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Physically Unhealthy Days vs. Age * Exercise
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.7610	0.8798	3.1381	0.0017	1.0361	4.4858
age	0.0745	0.0145	5.1203	0.0000	0.0460	0.1030
exerciseYes	-1.3858	0.9664	-1.4340	0.1516	-3.2804	0.5087
age:exerciseYes	-0.0553	0.0162	-3.4072	0.0007	-0.0871	-0.0235

Physhlth_days = 2.7610 + (0.0745 * age) + (-1.3858 * ExerciseYes) + (-0.0553 * (ExerciseYes * Age))

3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.

No Exercise:

Physhlth_days = 2.7610 + (0.0745 * age)

Exercise:

Physhlth_days = (2.7610-1.3858) + (0.0745 * age) + (-0.0553 * Age))

Physhlth_days = 1.3752 + (0.0192 * age)

These match the stratified analysis from Task 2.

3c. (5 pts) Conduct the t-test for the interaction term (age:exerciseYes). State the null and alternative hypotheses, report the test statistic and p-value, and state your conclusion about whether interaction is present.

int_term <- tidy(inter) |> filter(term == "age:exerciseYes")
cat("Interaction term (age:exerciseYes):\n")

## Interaction term (age:exerciseYes):

cat("  Estimate:", round(int_term$estimate, 4), "\n")

##   Estimate: -0.0553

cat("  t-statistic:", round(int_term$statistic, 3), "\n")

##   t-statistic: -3.407

cat("  p-value:", round(int_term$p.value, 4), "\n")

##   p-value: 7e-04

Null hypothesis: The slopes are equal, no interaction. Alternative hypothesis: The slopes are not equal, there is an interaction. T statistic is -3.407, p value is 7e-04. Since p<0.05, I reject the null hypothesis and include that interaction is present.

3d. (5 pts) Now fit a model with an interaction between age and education: physhlth_days ~ age * education. How many interaction terms are produced? Use a partial F-test to test whether the age \(\times\) education interaction as a whole is significant.

inter2 <- lm(physhlth_days ~ age * education, data = brfss_ci)

tidy(inter2, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Physically Unhealthy Days vs. Age * Education",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Physically Unhealthy Days vs. Age * Education
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.7542	1.5422	1.7859	0.0742	-0.2692	5.7775
age	0.0706	0.0269	2.6277	0.0086	0.0179	0.1233
educationHS graduate	-1.4016	1.6900	-0.8293	0.4069	-4.7146	1.9115
educationSome college	-1.1261	1.6893	-0.6666	0.5051	-4.4380	2.1857
educationCollege graduate	-2.7102	1.6580	-1.6346	0.1022	-5.9606	0.5402
age:educationHS graduate	-0.0197	0.0296	-0.6661	0.5054	-0.0776	0.0382
age:educationSome college	-0.0326	0.0295	-1.1039	0.2697	-0.0904	0.0253
age:educationCollege graduate	-0.0277	0.0289	-0.9583	0.3380	-0.0844	0.0290

#F-test

inter3 <- lm(physhlth_days ~ age + education, data = brfss_ci)

anova(inter3, inter2) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Is the Age x Education Interaction Significant?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Is the Age x Education Interaction Significant?
term	df.residual	rss	df	sumsq	statistic	p.value
physhlth_days ~ age + education	4995	306764.2	NA	NA	NA	NA
physhlth_days ~ age * education	4992	306671.9	3	92.2713	0.5007	0.6818

3 interaction terms are produced. p > 0.05 so the interaction is not statistically significant.

3e. (5 pts) Create a visualization using ggpredict() showing the predicted physhlth_days by age for each education level. Do the lines appear parallel?

pred_int <- ggpredict(inter2, terms = c("age", "education"))

ggplot(pred_int, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
  labs(
    title    = "Predicted Physical Health Days by Age & Education",
    x        = "Age in years",
    y        = "Predicted Physically Unhealthy Days",
    color    = "Education",
    fill     = "Education"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

The lines do not appear parallel.

Task 4: Confounding Assessment (25 points)

For this task, the exposure is exercise and the outcome is physhlth_days.

4a. (5 pts) Fit the crude model: physhlth_days ~ exercise. Report the exercise coefficient. This is the unadjusted estimate.

mod_crude <- lm(physhlth_days ~ exercise, data = brfss_ci)

tidy(mod_crude, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Crude Model",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Crude Model
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	7.1027	0.2354	30.1692	0	6.6412	7.5643
exerciseYes	-4.7115	0.2656	-17.7401	0	-5.2322	-4.1908

The exercise coefficient is -4.7115.

4b. (10 pts) Systematically assess whether each of the following is a confounder of the exercise-physical health association: age, sex, sleep_hrs, education, and income_cat. For each:

Fit the model physhlth_days ~ exercise + [covariate]
Report the adjusted exercise coefficient
Compute the percent change from the crude estimate
Apply the 10% rule to determine if the variable is a confounder

b_crude = coef(mod_crude)["exerciseYes"]
b_crude

## exerciseYes 
##   -4.711514

mod_adj_age <- lm(physhlth_days ~ exercise + age, data = brfss_ci)
mod_adj_sex <- lm(physhlth_days ~ exercise + sex, data = brfss_ci)
mod_adj_sleep_hrs <- lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci)
mod_adj_education <- lm(physhlth_days ~ exercise + education, data = brfss_ci)
mod_adj_income_cat <- lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)

age_adj <- coef(mod_adj_age)["exerciseYes"]
age_adj

## exerciseYes 
##   -4.550392

sex_adj <- coef(mod_adj_sex)["exerciseYes"]
sex_adj

## exerciseYes 
##   -4.697375

sleep_adj <- coef(mod_adj_sleep_hrs)["exerciseYes"]
sleep_adj

## exerciseYes 
##   -4.683112

edu_adj <- coef(mod_adj_education)["exerciseYes"]
edu_adj

## exerciseYes 
##   -4.391222

inc_adj <- coef(mod_adj_income_cat)["exerciseYes"]
inc_adj

## exerciseYes 
##   -3.940615

age_change <- abs(b_crude - age_adj) / abs(b_crude) * 100
age_change

## exerciseYes 
##    3.419746

sex_change <- abs(b_crude - sex_adj) / abs(b_crude) * 100
sex_change

## exerciseYes 
##   0.3000968

sleep_change <- abs(b_crude - sleep_adj) / abs(b_crude) * 100
sleep_change

## exerciseYes 
##   0.6028321

edu_change <- abs(b_crude - edu_adj) / abs(b_crude) * 100
edu_change

## exerciseYes 
##    6.798071

inc_change <- abs(b_crude - inc_adj) / abs(b_crude) * 100
inc_change

## exerciseYes 
##    16.36203

confounders <- list(
"Age" = lm(physhlth_days ~ exercise + age, data = brfss_ci),
"Sex" = lm(physhlth_days ~ exercise + sex, data = brfss_ci),
"Sleep" = lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci),
"Education" = lm(physhlth_days ~ exercise + education, data = brfss_ci),
"Income" = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci))

conf_table <- map_dfr(names(confounders), \(name) {
  mod <- confounders[[name]]
  b_adj_val <- coef(mod)["exerciseYes"]
  tibble(
    Covariate = name,
    `Crude β (exercise)` = round(b_crude, 4),
    `Adjusted β (exercise)` = round(b_adj_val, 4),
    `% Change` = round(abs(b_crude - b_adj_val) / abs(b_crude) * 100, 1),
    Confounder = ifelse(abs(b_crude - b_adj_val) / abs(b_crude) * 100 > 10,
                        "Yes (>10%)", "No")
  )
})

conf_table

## # A tibble: 5 × 5
##   Covariate `Crude β (exercise)` `Adjusted β (exercise)` `% Change` Confounder
##   <chr>                    <dbl>                   <dbl>      <dbl> <chr>     
## 1 Age                      -4.71                   -4.55        3.4 No        
## 2 Sex                      -4.71                   -4.70        0.3 No        
## 3 Sleep                    -4.71                   -4.68        0.6 No        
## 4 Education                -4.71                   -4.39        6.8 No        
## 5 Income                   -4.71                   -3.94       16.4 Yes (>10%)

conf_table |>
  kable(caption = "Systematic Confounding Assessment: One-at-a-Time Addition") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Systematic Confounding Assessment: One-at-a-Time Addition
Covariate	Crude β (exercise)	Adjusted β (exercise)	% Change	Confounder
Age	-4.7115	-4.5504	3.4	No
Sex	-4.7115	-4.6974	0.3	No
Sleep	-4.7115	-4.6831	0.6	No
Education	-4.7115	-4.3912	6.8	No
Income	-4.7115	-3.9406	16.4	Yes (>10%)

  #column_spec(5, bold = TRUE)

Income is a confounder of the exercise - physical health relationship as the percent change between the crude and adjusted coefficients is >10%. The other variables age, sex, sleep, and education are not confounders as the percent change between the crude and adjusted coefficients is <10% for each.

Present your results in a single summary table.

4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?

full_model <- lm(physhlth_days ~ exercise * age + sex + sleep_hrs + education + income_cat, data = brfss_ci)

tidy(full_model, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Full Model",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Full Model
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.5496	1.1017	8.6681	0.0000	7.3898	11.7094
exerciseYes	-0.4797	0.9525	-0.5037	0.6145	-2.3470	1.3876
age	0.0785	0.0143	5.4826	0.0000	0.0505	0.1066
sexFemale	0.0092	0.2169	0.0426	0.9660	-0.4159	0.4344
sleep_hrs	-0.4096	0.0795	-5.1521	0.0000	-0.5655	-0.2538
educationHS graduate	-1.0692	0.5083	-2.1034	0.0355	-2.0658	-0.0726
educationSome college	-0.7924	0.5145	-1.5401	0.1236	-1.8012	0.2163
educationCollege graduate	-1.0013	0.5222	-1.9175	0.0552	-2.0251	0.0224
income_cat	-0.6311	0.0589	-10.7067	0.0000	-0.7467	-0.5156
exerciseYes:age	-0.0567	0.0159	-3.5588	0.0004	-0.0880	-0.0255

b_crude_full = coef(full_model)["exerciseYes"]

change <- abs(b_crude - b_crude_full) / abs(b_crude) * 100
change

## exerciseYes 
##    89.81761

The exercise coefficient is -0.4797, compared to the crude estimate of -4.7115. The estimate decreased by almost 90%.

4d. (5 pts) Is gen_health a confounder or a mediator of the exercise-physical health relationship? Could it be both? Explain your reasoning with reference to the three conditions for confounding and the concept of the causal pathway.

Gen health is likely a confounder of the exercise-physical health relationship. It is associated with the exposure of exercise and the outcome of physical health. Individuals with good general health will be more likely to engage in exercise, and individuals with good general health are also more likely to have good physical health. It is not likely that exercise elicits an effect on physical health via general health – exercise likely has an effect on physical health directly. Thus it is unlikely that gen health lies on the causal pathway between exercise and physical health.

It cannot be both a confounder or a mediator because a mediator lies on the causal pathway, but a confounder cannot lie on the causal pathway.

Task 5: Public Health Interpretation (20 points)

5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:

The association between exercise and physical health prior to adjusting for other variables was -4.7115, so exercising was associated with 4.7115 LESS physically unhealthy days per month. There was a significant interaction between age and exercise, so age affected the physical health-exercise relationship. Income was a confounder of the exercise-physical health relationship – it was associated with both exercise and physical health and therefore served to distort the true relationship between exercise and physical health. After adjusting for other variables, the association between exercise and physical health was -0.4797. Thus, a negative relationship was still observed, but with a much lower magnitude after adjusting for other variables. Because BRFSS is a cross-sectional study with data collected at a single point in time, we cannot infer a causal relationship between any variables, and there may be some additional confounders that remain unknown and unaccounted for.

Whether the association between exercise and physical health differs by any subgroup (interaction assessment)
Which variables confounded the exercise-physical health association
The direction and approximate magnitude of the adjusted exercise effect
Appropriate caveats about cross-sectional data and potential unmeasured confounding

Do not use statistical jargon.

5b. (10 pts) A colleague suggests including gen_health as a covariate in the final model because it changes the exercise coefficient by more than 10%. You disagree. Write a 3–4 sentence argument explaining why adjusting for general health may not be appropriate if the goal is to estimate the total effect of exercise on physical health days. Use the concept of mediation in your argument.

Removing gen health may not be appropriate if it is a mediating variable. Because the research aim is to determine the total effect of exercise on physically unhealthy days, the study should account for the potential indirect pathway of exercise -> gen health -> physical health. By removing gen health, the relationship between exercise and physical health days might be underestimated.

End of Lab Activity

Confounding and Interactions in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Josh Macera

March 24, 2026