EPI 553 — Confounding and Interactions Lab
In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.
| Variable | Description | Type |
|---|---|---|
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
exercise |
Any physical activity (Yes/No) | Factor |
gen_health |
General health status (5 categories) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(ggeffects)
brfss_ci <- readRDS(
"~/Desktop/EPI553/data/brfss_ci_2020.rds"
)1a. (5 pts) Create a scatterplot of
physhlth_days (y-axis) vs. age (x-axis),
colored by exercise status. Add separate regression lines
for each group. Describe the pattern you observe.
There is a slight positive relationship indicating that as age increases so does physically unhealthy days. Points are scattered so there is not a perfect relationship.
# Create scatterplot
ggplot(brfss_ci, aes(x = age, y = physhlth_days, color= exercise)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Age vs Physically Unhealthy Days",
subtitle = "BRFSS Data",
x = "Age (years)",
y = "Physically Unhealthy Days in the past 30"
) +
theme_minimal()
1b. (5 pts) Compute the mean
physhlth_days
for each combination of sex and exercise.
Present the results in a table. Does it appear that the association
between exercise and physical health days might differ by sex?
For both male and female, those who exercised had a lower mean of unhealthy physical health days. Those who do not exercise had a higher mean. The means for males and females who do not exercise vary slightly with females having a higher mean at 7.43 in comparison to males at 6.64.
brfss_ci %>%
group_by(sex, exercise) %>%
summarise(
mean_physhlth_days= mean(physhlth_days, na.rm= TRUE),
.groups= "drop"
)## # A tibble: 4 × 3
## sex exercise mean_physhlth_days
## <fct> <fct> <dbl>
## 1 Male No 6.64
## 2 Male Yes 2.32
## 3 Female No 7.43
## 4 Female Yes 2.45
1c. (5 pts) Create a scatterplot of
physhlth_days vs. sleep_hrs, faceted by
education level. Comment on whether the slopes appear
similar or different across education groups.
All scatterplots show a negative relationship between hours of sleep each night and physically unhealthy days. The college graduate group has a flatter slope while the less than high school group is the steeper.
# Create scatterplot
ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
facet_wrap(~ education) +
labs(
title = "Sleep Hours per Night vs Physically Unhealthy Days",
subtitle = "BRFSS Data",
x = "Sleep (Hours per Night)",
y = "Physically Unhealthy Days in the past 30"
) +
theme_minimal()2a. (5 pts) Fit separate simple linear regression
models of physhlth_days ~ age for exercisers and
non-exercisers. Report the slope, SE, and 95% CI for age in each
stratum.
Exercisers Slope= 0.019194 Std. error= 0.005984 95% CI= [0.0075,0.0309]
Non-exercisers Slope= 0.0745 Std. error= 0.0212 95% CI= [0.1161, 0.0005]
# Fit separate models for exercisers and nonexercisers
mod_yes <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "Yes"))
mod_no <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "No"))
# Compare coefficients
bind_rows(
tidy(mod_yes, conf.int = TRUE) |> mutate(Stratum = "Exerciser"),
tidy(mod_no, conf.int = TRUE) |> mutate(Stratum = "Nonexerciser")
) |>
filter(term == "age") |>
select(Stratum, estimate, std.error, conf.low, conf.high, p.value) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stratified Analysis: Age → Physical Health Days, by Exercise",
col.names = c("Stratum", "Estimate", "SE", "CI Lower", "CI Upper", "p-value")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Stratum | Estimate | SE | CI Lower | CI Upper | p-value |
|---|---|---|---|---|---|
| Exerciser | 0.0192 | 0.0060 | 0.0075 | 0.0309 | 0.0013 |
| Nonexerciser | 0.0745 | 0.0212 | 0.0328 | 0.1161 | 0.0005 |
2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel?
The lines are not parallel. Both lines do slope upward slightly, confirming a positive relationship between age and unhealthy physical days for exercisers and nonexercisers.
ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
labs(
title = "Association Between Age and Physical Health Days, Stratified by Exercise",
subtitle = "Are the slopes parallel or different?",
x = "Age in Years",
y = "Physically Unhealthy Days (Past 30)",
color = "Exercise"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")
2c. (5 pts) Can you formally test whether the two
slopes are different using only the stratified results? Explain why or
why not.
3a. (5 pts) Fit the interaction model:
physhlth_days ~ age * exercise. Write out the fitted
equation.
Fitted Equation Physhlth_days= 2.7610 + 0.0745 (age) + -1.3858 (Yes) + -0.0553 (age x Yes)
#Model without interaction (additive model)
mod_add <- lm(physhlth_days ~ age + exercise, data = brfss_ci)
# Model with interaction
mod_int <- lm(physhlth_days ~ age * exercise, data = brfss_ci)
tidy(mod_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Age * Exercise → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Uppe
r")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Uppe |
|---|---|---|---|---|---|---|
| (Intercept) | 2.7610 | 0.8798 | 3.1381 | 0.0017 | 1.0361 | 4.4858 |
| age | 0.0745 | 0.0145 | 5.1203 | 0.0000 | 0.0460 | 0.1030 |
| exerciseYes | -1.3858 | 0.9664 | -1.4340 | 0.1516 | -3.2804 | 0.5087 |
| age:exerciseYes | -0.0553 | 0.0162 | -3.4072 | 0.0007 | -0.0871 | -0.0235 |
3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.
Exercisers Physhlth_days= 2.7610 + 0.0745 (age) + -1.3858 (Yes) + -0.0553 (age x Yes) Physhlth_days= 1.3752 + 0.0192(age)
Non-exercisers Physhlth_days= 2.7610 + 0.0745(age)
3c. (5 pts) Conduct the t-test for the interaction
term (age:exerciseYes). State the null and alternative
hypotheses, report the test statistic and p-value, and state your
conclusion about whether interaction is present.
H0= B=0 (slopes are equal, lines are parallel, no interaction) H1= B≠0 (slopes differ, interaction is present) Test statistic= -3.407 p-value= 7e-04
The interaction term (age:exerciseYes) has a coefficient of -0.0553, t-stat of -3.407, p-value of 7e-04 (0.0007). Since the p-value is less than 0.05, we reject the null hypothesis. There is statistically significant evidence that the association between age and physically unhealthy days differ by exercise status, indicating the presence of an interaction.
int_term <- tidy(mod_int) |> filter(term == "age:exerciseYes")
cat("Interaction term (age:exerciseYes):\n")## Interaction term (age:exerciseYes):
## estimate: -0.0553
## t-statistic: -3.407
## p-value: 7e-04
3d. (5 pts) Now fit a model with an interaction
between age and education: physhlth_days ~ age * education.
How many interaction terms are produced? Use a partial F-test to test
whether the age \(\times\) education
interaction as a whole is significant.
There are 3 interaction terms produced. No significance since p-value is 0.6818 which is more than 0.05
mod_add_edu <- lm(physhlth_days ~ age + education, data = brfss_ci)
mod_int_edu <- lm(physhlth_days ~ age * education, data = brfss_ci)
anova(mod_add_edu, mod_int_edu) |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Test for Coincidence: Does Education Affect the Age-Physical Health Relationship at All?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| physhlth_days ~ age + education | 4995 | 306764.2 | NA | NA | NA | NA |
| physhlth_days ~ age * education | 4992 | 306671.9 | 3 | 92.2713 | 0.5007 | 0.6818 |
3e. (5 pts) Create a visualization using
ggpredict() showing the predicted
physhlth_days by age for each education level. Do the lines
appear parallel?
All four are not parallel with eachother. However, some college and college graduate seem to be close to parallel and HS graduate and Less than high school seem to be parallel.
pred_int <- ggpredict(mod_int_edu, terms = c("age", "education"))
ggplot(pred_int, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
labs(
title = "Predicted Physical Health Days by Age and Education",
subtitle = "Are the lines parallel? Non-parallel lines indicate interaction.",
x = "Age in years",
y = "Predicted Physically Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set2")For this task, the exposure is exercise
and the outcome is physhlth_days.
4a. (5 pts) Fit the crude model:
physhlth_days ~ exercise. Report the exercise coefficient.
This is the unadjusted estimate.
exercise coefficient= -4.7115
# Crude model: exercise → physical health days
mod_crude <- lm(physhlth_days ~ exercise, data = brfss_ci)
# Adjusted model: adding age
mod_adj_age <- lm(physhlth_days ~ exercise + age, data = brfss_ci)
# Extract exercise coefficients
b_crude <- coef(mod_crude)["exerciseYes"]
b_adj <- coef(mod_adj_age)["exerciseYes"]
pct_change <- abs(b_crude - b_adj) / abs(b_crude) * 100
tribble(
~Model, ~`Exercise β`, ~SE, ~`95% CI`,
"Crude (exerciseYes)",
round(b_crude, 4),
round(tidy(mod_crude) |> filter(term == "exerciseYes") |> pull(std.error), 4),
paste0("(", round(confint(mod_crude)["exerciseYes", 1], 4), ", ",
round(confint(mod_crude)["exerciseYes", 2], 4), ")"),
"Adjusted (+ age)",
round(b_adj, 4),
round(tidy(mod_adj_age) |> filter(term == "exerciseYes") |> pull(std.error), 4),
paste0("(", round(confint(mod_adj_age)["exerciseYes", 1], 4), ", ",
round(confint(mod_adj_age)["exerciseYes", 2], 4), ")")
) |>
kable(caption = "Confounding Assessment: Does Age Confound the Exercise-Physical Health Association?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Exercise β | SE | 95% CI |
|---|---|---|---|
| Crude (exerciseYes) | -4.7115 | 0.2656 | (-5.2322, -4.1908) |
| Adjusted (+ age) | -4.5504 | 0.2673 | (-5.0744, -4.0264) |
4b. (10 pts) Systematically assess whether each of
the following is a confounder of the exercise-physical health
association: age, sex, sleep_hrs,
education, and income_cat. For each:
physhlth_days ~ exercise + [covariate]Present your results in a single summary table.
# Crude model
b_crude_val <- coef(mod_crude)["exerciseYes"]
# One-at-a-time adjusted models
confounders <- list(
"Age" = lm(physhlth_days ~ exercise + age, data = brfss_ci),
"Sex" = lm(physhlth_days ~ exercise + sex, data = brfss_ci),
"Sleep_hrs" = lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci),
"Education" = lm(physhlth_days ~ exercise + education, data = brfss_ci),
"Income" = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
)
conf_table <- map_dfr(names(confounders), \(name) {
mod <- confounders[[name]]
b_adj_val <- coef(mod)["exerciseYes"]
tibble(
Covariate = name,
`Crude β (exerciseYes)` = round(b_crude_val, 4),
`Adjusted β (exerciseYes)` = round(b_adj_val, 4),
`% Change` = round(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100, 1),
Confounder = ifelse(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100 > 10,
"Yes (>10%)", "No")
)
})
conf_table |>
kable(caption = "Systematic Confounding Assessment: One-at-a-Time Addition") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
column_spec(5, bold = TRUE)| Covariate | Crude β (exerciseYes) | Adjusted β (exerciseYes) | % Change | Confounder |
|---|---|---|---|---|
| Age | -4.7115 | -4.5504 | 3.4 | No |
| Sex | -4.7115 | -4.6974 | 0.3 | No |
| Sleep_hrs | -4.7115 | -4.6831 | 0.6 | No |
| Education | -4.7115 | -4.3912 | 6.8 | No |
| Income | -4.7115 | -3.9406 | 16.4 | Yes (>10%) |
4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?
The exercise coefficient is -3.7297 and the crude estimate is -4.7115. The estimate changed 20.84% after adjusting for confounders.
# fully adjusted model
mod_full <- lm(physhlth_days ~ exercise + age + sex + sleep_hrs + education + income_cat,
data = brfss_ci)
# coefficients
b_crude <- coef(mod_crude)["exerciseYes"]
b_full <- coef(mod_full)["exerciseYes"]
# Percent change in exercise coefficients
pct_change <- abs(b_crude - b_full) / abs(b_crude) * 100
# results
cat("Crude β:", round(b_crude, 4), "\n")## Crude β: -4.7115
## Adjusted β: -3.7297
## Percent change: 20.84 %
4d. (5 pts) Is gen_health a confounder
or a mediator of the exercise-physical health relationship? Could it be
both? Explain your reasoning with reference to the three conditions for
confounding and the concept of the causal pathway.
gen_health can be a confounder or a mediator of exercise-physical health relationship. A confounder exists when the estimated association between an outcome and exposure is affected by a third variable. For a variable to be a confounder it has to be associated with both the exposure and outcome and not on the causal pathway. Being on the causal pathway would make it a mediator. Gen_health could be a confounder that causes poor exercise habits and more unhealthy days but it could be a mediator if the individual has poor exercise habits causing poor general health that leads to poor physical health.
5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:
Do not use statistical jargon.
5b. (10 pts) A colleague suggests including
gen_health as a covariate in the final model because it
changes the exercise coefficient by more than 10%. You disagree. Write a
3–4 sentence argument explaining why adjusting for
general health may not be appropriate if the goal is to estimate the
total effect of exercise on physical health days. Use the concept of
mediation in your argument.
End of Lab Activity