EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026
In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.
| Variable | Description | Type |
|---|---|---|
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
exercise |
Any physical activity (Yes/No) | Factor |
gen_health |
General health status (5 categories) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(ggeffects)
library (ggplot2)
brfss_ci <- readRDS(
"C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/brfss_ci_2020.rds"
)
# Quick data check
dim(brfss_ci) ## [1] 5000 9
## # A tibble: 6 × 9
## physhlth_days sleep_hrs menthlth_days age sex education exercise
## <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct>
## 1 0 8 30 77 Female HS graduate No
## 2 0 8 0 65 Male College graduate Yes
## 3 0 8 4 41 Female College graduate Yes
## 4 0 8 0 71 Male Less than HS Yes
## 5 30 5 0 56 Male Some college No
## 6 0 6 4 58 Female HS graduate Yes
## # ℹ 2 more variables: gen_health <fct>, income_cat <dbl>
1a. (5 pts) Create a scatterplot of
physhlth_days (y-axis) vs. age (x-axis),
colored by exercise status. Add separate regression lines
for each group. Describe the pattern you observe.
# Create scatterplot
p1a <- ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
# Add jitter to reduce overplotting (data are discrete)
geom_jitter(alpha = 0.15, width = 0.5, height = 0.5, size = 1) +
# Fit separate regression lines for each exercise group
geom_smooth(method = "lm", se = TRUE, linewidth = 1.1) +
labs(
title = "Physical Health Days by Age, Stratified by Exercise Status",
subtitle = "Do the slopes differ between exercisers and non-exercisers?",
x = "Age (years)",
y = "Physically Unhealthy Days (past 30)",
color = "Physical Activity"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
print(p1a)1b. (5 pts) Compute the mean
physhlth_days for each combination of sex and
exercise. Present the results in a table. Does it appear
that the association between exercise and physical health days might
differ by sex?
# Calculate means and counts
mean_sex_exercise <- brfss_ci %>%
group_by(sex, exercise) %>%
summarize(
n = n(),
Mean_Physhlth = round(mean(physhlth_days), 2),
SD_Physhlth = round(sd(physhlth_days), 2)
)
# Display in table format
mean_sex_exercise %>%
kable(
caption = "Mean Physical Health Days by Sex and Exercise Status",
col.names = c("Sex", "Exercise", "N", "Mean Unhealthy Days", "SD")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Sex | Exercise | N | Mean Unhealthy Days | SD |
|---|---|---|---|---|
| Male | No | 446 | 6.64 | 10.98 |
| Male | Yes | 1845 | 2.32 | 6.49 |
| Female | No | 625 | 7.43 | 11.46 |
| Female | Yes | 2084 | 2.45 | 6.32 |
Interpretation Based on the table, we can see that exercisers of both sexes report far fewer unhealthy days than non-exercisers and the gap is slightly larger for men (difference of about 4 days) than for women (difference of about 5 days). The association might differ by sex.
1c. (5 pts) Create a scatterplot of
physhlth_days vs. sleep_hrs, faceted by
education level. Comment on whether the slopes appear
similar or different across education groups.
p1b <- ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days, color = education)) +
geom_jitter(alpha = 0.15, width = 0.2, height = 0.3, size = 0.9) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1) +
# FACETING: separate panels for each education level
facet_wrap(~education, nrow = 2) +
labs(
title = "Sleep vs. Physical Health Days, by Education Level",
subtitle = "Are the regression slopes similar or different across education groups?",
x = "Sleep Hours per Night",
y = "Physically Unhealthy Days (past 30)",
color = "Education"
) +
theme_minimal(base_size = 12) +
scale_color_brewer(palette = "Set2") +
theme(legend.position = "none")
print(p1b)Interpretation The relationship between sleep and physical health is consistently negative across all education levels, but the strength of the association differs slightly. People with lower education levels show a stronger drop in unhealthy days as sleep increases, while those with higher education levels show a modest decline.Highest education group, college graduate has the weaker association of sleep with physically unhealthy days.
2a. (5 pts) Fit separate simple linear regression
models of physhlth_days ~ age for exercisers and
non-exercisers. Report the slope, SE, and 95% CI for age in each
stratum.
Research question: Is the age-physical health relationship different for people who exercise vs. don’t?
# Model for EXERCISERS
mod_exerciser <- lm(
physhlth_days ~ age,
data = brfss_ci %>% filter(exercise == "Yes")
)
# Model for NON-EXERCISERS
mod_nonexerciser <- lm(
physhlth_days ~ age,
data = brfss_ci %>% filter(exercise == "No")
)
# Extract and compare results
stratified_model <- bind_rows(
tidy(mod_exerciser, conf.int = TRUE) %>%
filter(term == "age") %>%
mutate(Stratum = "Exercisers"),
tidy(mod_nonexerciser, conf.int = TRUE) %>%
filter(term == "age") %>%
mutate(Stratum = "Non-exercisers")
) %>%
select(Stratum, estimate, std.error, conf.low, conf.high, p.value) %>%
mutate(
across(where(is.numeric), ~round(., 4))
)
# Display results
stratified_model %>%
kable(
caption = "Stratified Analysis: Age Effect on Physical Health Days, by Exercise Status",
col.names = c("Stratum", "Estimate (β)", "SE", "CI Lower", "CI Upper", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Stratum | Estimate (β) | SE | CI Lower | CI Upper | p-value |
|---|---|---|---|---|---|
| Exercisers | 0.0192 | 0.0060 | 0.0075 | 0.0309 | 0.0013 |
| Non-exercisers | 0.0745 | 0.0212 | 0.0328 | 0.1161 | 0.0005 |
Interpretation:
Exercisers - Slope (β): 0.0192 - Standard Error: 0.0060 - 95% CI: 0.0075 to 0.0309
Non‑Exercisers - Slope (β): 0.0745 - Standard Error: 0.0212 - 95% CI: 0.0328 to 0.1161
Age is associated with more physically unhealthy days in both groups, but the increase is much more among non‑exercisers. The confidence intervals do not overlap much, which could be meaningful difference in how age relates to physical health depending on exercise status.
2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel?
p2b <- ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
geom_jitter(alpha = 0.1, width = 0.5, height = 0.3, size = 0.8) +
# Fit regression lines using the stratified models
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2, alpha = 0.2) +
labs(
title = "Stratified Regression Lines: Age vs. Physical Health Days",
x = "Age (years)",
y = "Physically Unhealthy Days (past 30)",
color = "Exercise Status"
) +
theme_minimal(base_size = 13) +
scale_color_manual(values = c("No" = "#e41a1c", "Yes" = "#377eb8")) +
theme(legend.position = "bottom")
print(p2b)Based on this,
Exercisers - Slope (β): 0.0192 - Standard Error: 0.0060 - 95% CI: 0.0075 to 0.0309
Non‑Exercisers - Slope (β): 0.0745 - Standard Error: 0.0212 - 95% CI: 0.0328 to 0.1161
The slope isn’t exactly parallel.
2c. (5 pts) Can you formally test whether the two slopes are different using only the stratified results? Explain why or why not.
No, even though the stratified approach is intuitive and visually compelling, but it has important limitations:
Hence, the solution is to use a single regression model with an interaction term.
3a. (5 pts) Fit the interaction model:
physhlth_days ~ age * exercise. Write out the fitted
equation.
**Model specification: physhlth_days ~ age * exercise This expands to: physhlth_days ~ age + exercise + age:exercise
** WHY age * exercise (not just age:exercise)? The interaction term (age:exercise) only tests if slopes differ We also need the main effects to estimate intercepts.***
# Model without interaction (additive model)
mod_add <- lm(physhlth_days ~ age + exercise, data = brfss_ci)
# Model with interaction
mod_int <- lm(physhlth_days ~ age * exercise, data = brfss_ci)
# Equivalent to: lm(physhlth_days ~ age + exercise + age:exercise, data = brfss_ci)
tidy(mod_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Age * Exercise → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.7610 | 0.8798 | 3.1381 | 0.0017 | 1.0361 | 4.4858 |
| age | 0.0745 | 0.0145 | 5.1203 | 0.0000 | 0.0460 | 0.1030 |
| exerciseYes | -1.3858 | 0.9664 | -1.4340 | 0.1516 | -3.2804 | 0.5087 |
| age:exerciseYes | -0.0553 | 0.0162 | -3.4072 | 0.0007 | -0.0871 | -0.0235 |
The Fitted Equation is:
\[\widehat{\text{Phys Days}} = 2.761 + 0.0745 (\text{age}) - 1.386(\text{Exercise}) - 0.055 (\text{Age} \times \text{Exercise})\]
3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.
For exercisers (Nonexercisers = 0): \[\widehat{\text{Phys Days}} = 2.761 + 0.0745 (\text{age}) - 1.386(\text{Exercise}) - 0.055 (\text{Age} \times \text{1})\]
For nonexercisers (exercisers = 1): \[\widehat{\text{Phys Days}} = 2.761 + 0.0745 (\text{age}) - 1.386(\text{0}) - 0.055 (\text{Age} \times \text{0})\]
Stratum-specific equations
#Exercisers
\[\widehat{\text{Phys Days}} = 1.375 + 0.0192 (\text{age})\]
Hence, the Slope for age: 0.0192
#Nonexercisers
\[\widehat{\text{Phys Days}} = 2.761 + 0.0745 (\text{age})\]
Hence, the Slope for age: 0.0745
These slopes—0.0745 for non-exercisers and 0.0192 for exercisers—match the stratified analysis from Task 2, confirming that the interaction model and the stratified models are consistent.
3c. (5 pts) Conduct the t-test for the interaction
term (age:exerciseYes). State the null and alternative
hypotheses, report the test statistic and p-value, and state your
conclusion about whether interaction is present.
The formal test for interaction is simply the t-test for the interaction coefficient:
\[H_0: \beta_3 = 0 \quad \text{(slopes are equal, lines are parallel, no interaction)}\] \[H_A: \beta_3 \neq 0 \quad \text{(slopes differ, interaction is present)}\]
int_term <- tidy(mod_int) |> filter(term == "age:exerciseYes")
cat("Interaction term (age:exerciseYes):\n")## Interaction term (age:exerciseYes):
## Estimate: -0.0553
## t-statistic: -3.407
## p-value: 7e-04
Interpretation: The interaction term (age:exerciseYes) has an estimated coefficient of –0.0553, with a t‑statistic of –3.407 and a p‑value of 0.0007. Because the p‑value is below the conventional α = 0.05 threshold, there is strong evidence that the age slopes differ for exercisers and non‑exercisers. The negative sign indicates that the increase in physically unhealthy days with age is substantially smaller for exercisers than for non‑exercisers. In other words, the age‑related rise in poor physical health is much steeper among people who do not exercise.
3d. (5 pts) Now fit a model with an interaction
between age and education: physhlth_days ~ age * education.
How many interaction terms are produced? Use a partial F-test to test
whether the age \(\times\) education
interaction as a whole is significant.
# Model without interaction (additive model)
mod_add_edu <- lm(physhlth_days ~ age + education, data = brfss_ci)
# Model with interaction
mod_int_edu <- lm(physhlth_days ~ age * education, data = brfss_ci)
# Equivalent to: lm(physhlth_days ~ age + education + age:education, data = brfss_ci)
tidy(mod_int_edu, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Age * Education → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.7542 | 1.5422 | 1.7859 | 0.0742 | -0.2692 | 5.7775 |
| age | 0.0706 | 0.0269 | 2.6277 | 0.0086 | 0.0179 | 0.1233 |
| educationHS graduate | -1.4016 | 1.6900 | -0.8293 | 0.4069 | -4.7146 | 1.9115 |
| educationSome college | -1.1261 | 1.6893 | -0.6666 | 0.5051 | -4.4380 | 2.1857 |
| educationCollege graduate | -2.7102 | 1.6580 | -1.6346 | 0.1022 | -5.9606 | 0.5402 |
| age:educationHS graduate | -0.0197 | 0.0296 | -0.6661 | 0.5054 | -0.0776 | 0.0382 |
| age:educationSome college | -0.0326 | 0.0295 | -1.1039 | 0.2697 | -0.0904 | 0.0253 |
| age:educationCollege graduate | -0.0277 | 0.0289 | -0.9583 | 0.3380 | -0.0844 | 0.0290 |
The Fitted Equation is:
\[\widehat{\text{Phys Days}} = 2.754+ 0.0706 (\text{age}) - 1.40(\text{educationHS graduate}) - 1.13 (\text{educationSome college} - 2.71 (\text{educationcollege graduate}) - 0.02 (\text{age} \times \text{educationHS graduate}) - 0.033 (\text{age} \times \text{educationSome college}) - 0.028(\text{age} \times \text{educationCollege graduate}) + \varepsilon\] #Interpretation:
This model includes three interaction terms, each comparing an education group to the reference category (“Less than HS”). The reference-group slope for age is 0.0706, meaning that among adults with less than a high school education, each additional year of age is associated with about 0.07 more physically unhealthy days per month. The interaction terms for “HS graduate” (–0.0197), “Some college” (–0.0326), and “College graduate” (–0.0277) are all negative and have p‑values greater than 0.05, indicating that none of these slopes differ meaningfully from the reference group. Because none of the interaction terms are statistically distinguishable from zero, the age–health relationship appears similar across all education levels, with no evidence that aging affects physical health differently for more educated versus less educated adults. This pattern is consistent with the idea that age-related increases in physically unhealthy days occur broadly across the population, regardless of educational attainment.
Individual t-tests for each interaction term may be non-significant, yet the interaction as a whole could be meaningful. To test whether age \(\times\) education is significant overall, we use a partial F-test comparing the model with and without the interaction terms:
\[H_0: \beta_5 = \beta_6 = \beta_7 = 0 \quad \text{(the age slope is the same across all education levels)}\]
mod_no_int_edu <- lm(physhlth_days ~ age + education, data = brfss_ci)
anova(mod_no_int_edu, mod_int_edu) |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Is the age x Education Interaction Significant?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| physhlth_days ~ age + education | 4995 | 306764.2 | NA | NA | NA | NA |
| physhlth_days ~ age * education | 4992 | 306671.9 | 3 | 92.2713 | 0.5007 | 0.6818 |
Interpretation: The partial F‑test evaluates all three age × education interaction terms simultaneously. The F‑statistic is 0.50 with a p‑value of 0.6818, indicating that the overall interaction does not improve model fit at the conventional p =0.05 threshold. In practical terms, this means there is no evidence that the age slope differs across education levels as a whole. Although the individual interaction coefficients were all negative, none were statistically distinguishable from zero, and the omnibus test confirms that these small differences do not collectively matter. This pattern suggests that the relationship between age and physically unhealthy days is fairly consistent across education groups, with no meaningful effect modification detected in this sample.
3e. (5 pts) Create a visualization using
ggpredict() showing the predicted
physhlth_days by age for each education level. Do the lines
appear parallel?
pred_int_edu <- ggpredict(mod_int_edu, terms = c("age [3:12]", "education"))
ggplot(pred_int_edu, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
labs(
title = "Predicted Physical Health Days by Age and Education",
subtitle = "Are the lines parallel? Non-parallel lines indicate interaction.",
x = "Sleep Hours per Night",
y = "Predicted Physically Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set2")Interaction: Age Effect by Education Level
#Interpretation The predicted lines plot shows four predicted regression lines, one for each education group, showing how physically unhealthy days change with age. All lines slope upward, indicating that older adults tend to report more physically unhealthy days, but the slopes are nearly identical across education levels. The “Less than HS” and “HS graduate” groups rise at almost the same rate, and the “Some college” and “College graduate” lines follow a very similar line as well.
For this task, the exposure is exercise
and the outcome is physhlth_days.
4a. (5 pts) Fit the crude model:
physhlth_days ~ exercise. Report the exercise coefficient.
This is the unadjusted estimate.
# Crude model: Exercise → physical health days
mod_crude <- lm(physhlth_days ~ exercise, data = brfss_ci)
# Display results
tidy(mod_crude, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~round(., 4))) %>%
kable(
caption = "Crude Model: physhlth_days ~ exercise",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
print()## <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
## <caption>Crude Model: physhlth_days ~ exercise</caption>
## <thead>
## <tr>
## <th style="text-align:left;"> Term </th>
## <th style="text-align:right;"> Estimate </th>
## <th style="text-align:right;"> SE </th>
## <th style="text-align:right;"> t </th>
## <th style="text-align:right;"> p-value </th>
## <th style="text-align:right;"> CI Lower </th>
## <th style="text-align:right;"> CI Upper </th>
## </tr>
## </thead>
## <tbody>
## <tr>
## <td style="text-align:left;"> (Intercept) </td>
## <td style="text-align:right;"> 7.1027 </td>
## <td style="text-align:right;"> 0.2354 </td>
## <td style="text-align:right;"> 30.1692 </td>
## <td style="text-align:right;"> 0 </td>
## <td style="text-align:right;"> 6.6412 </td>
## <td style="text-align:right;"> 7.5643 </td>
## </tr>
## <tr>
## <td style="text-align:left;"> exerciseYes </td>
## <td style="text-align:right;"> -4.7115 </td>
## <td style="text-align:right;"> 0.2656 </td>
## <td style="text-align:right;"> -17.7401 </td>
## <td style="text-align:right;"> 0 </td>
## <td style="text-align:right;"> -5.2322 </td>
## <td style="text-align:right;"> -4.1908 </td>
## </tr>
## </tbody>
## </table>
# Extract the exercise coefficient (this is our CRUDE unadjusted estimate)
b_crude <- coef(mod_crude)["exerciseYes"]Interpretation - Non‑exercisers have an average of 7.10 physically unhealthy days per month. - Exercisers have 4.71 fewer unhealthy days than non‑exercisers (β = –4.7115, p < 0.001). - The confidence interval (–5.23 to –4.19) shows this is a large, precise, and highly statistically significant difference.
People who report exercising have substantially fewer physically unhealthy days than those who do not. This is a crude association, suggesting that exercise is linked to better physical health before considering any other factors.
4b. (10 pts) Systematically assess whether each of
the following is a confounder of the exercise-physical health
association: age, sex, sleep_hrs,
education, and income_cat. For each:
physhlth_days ~ exercise + [covariate]Present your results in a single summary table.
b_crude <- coef(mod_crude)["exerciseYes"]
confounders_ex <- list(
"Age" = lm(physhlth_days ~ exercise + age, data = brfss_ci),
"Sex" = lm(physhlth_days ~ exercise + sex, data = brfss_ci),
"Sleep hrs" = lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci),
"Education" = lm(physhlth_days ~ exercise + education, data = brfss_ci),
"Income cat" = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
)
conf_table_ex <- map_dfr(names(confounders_ex), function(name) {
mod <- confounders_ex[[name]]
b_adj_val <- coef(mod)["exerciseYes"]
pct_chg <- abs(b_crude - b_adj_val) / abs(b_crude) * 100
tibble(
Covariate = name,
`Crude β (exercise)` = round(b_crude, 4),
`Adjusted β` = round(b_adj_val, 4),
`% Change` = round(pct_chg, 1),
`Confounder (>10%)?` = ifelse(pct_chg > 10, "Yes", "No")
)
})
conf_table_ex %>%
kable(caption = "Table 4b. Systematic Confounding Assessment: Exercise → Physical Health Days") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE) %>%
column_spec(5, bold = TRUE,
color = ifelse(conf_table_ex$`Confounder (>10%)?` == "Yes", "white", "black"),
background = ifelse(conf_table_ex$`Confounder (>10%)?` == "Yes", "#C0392B", "#27AE60"))| Covariate | Crude β (exercise) | Adjusted β | % Change | Confounder (>10%)? |
|---|---|---|---|---|
| Age | -4.7115 | -4.5504 | 3.4 | No |
| Sex | -4.7115 | -4.6974 | 0.3 | No |
| Sleep hrs | -4.7115 | -4.6831 | 0.6 | No |
| Education | -4.7115 | -4.3912 | 6.8 | No |
| Income cat | -4.7115 | -3.9406 | 16.4 | Yes |
# Step 1: Crude model (already done in 4a)
# Step 2: Fit models with each potential confounder
mod_adj_age <- lm(physhlth_days ~ exercise + age, data = brfss_ci)
mod_adj_sex <- lm(physhlth_days ~ exercise + sex, data = brfss_ci)
mod_adj_sleep <- lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci)
mod_adj_edu <- lm(physhlth_days ~ exercise + education, data = brfss_ci)
mod_adj_income <- lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
# Step 3: Extract exercise coefficient from EACH model
# (NOT the covariate coefficient - we want the exerciseYes coefficient)
b_adj_age <- coef(mod_adj_age)["exerciseYes"]
b_adj_sex <- coef(mod_adj_sex)["exerciseYes"]
b_adj_sleep <- coef(mod_adj_sleep)["exerciseYes"]
b_adj_edu <- coef(mod_adj_edu)["exerciseYes"]
b_adj_income <- coef(mod_adj_income)["exerciseYes"]
# Step 4: Calculate percent change for EACH covariate
pct_change_age <- abs(b_crude - b_adj_age) / abs(b_crude) * 100
pct_change_sex <- abs(b_crude - b_adj_sex) / abs(b_crude) * 100
pct_change_sleep <- abs(b_crude - b_adj_sleep) / abs(b_crude) * 100
pct_change_edu <- abs(b_crude - b_adj_edu) / abs(b_crude) * 100
pct_change_income <- abs(b_crude - b_adj_income) / abs(b_crude) * 100
# Step 5: Create confounding assessment table
confounding_table <- tribble(
~Covariate, ~"Crude β", ~"Adjusted β", ~"% Change", ~"Confounder (>10%)?",
# Age
"Age",
round(b_crude, 4),
round(b_adj_age, 4),
round(pct_change_age, 1),
ifelse(pct_change_age > 10, "YES", "NO"),
# Sex
"Sex",
round(b_crude, 4),
round(b_adj_sex, 4),
round(pct_change_sex, 1),
ifelse(pct_change_sex > 10, "YES", "NO"),
# Sleep
"Sleep Hours",
round(b_crude, 4),
round(b_adj_sleep, 4),
round(pct_change_sleep, 1),
ifelse(pct_change_sleep > 10, "YES", "NO"),
# Education
"Education",
round(b_crude, 4),
round(b_adj_edu, 4),
round(pct_change_edu, 1),
ifelse(pct_change_edu > 10, "YES", "NO"),
# Income
"Income",
round(b_crude, 4),
round(b_adj_income, 4),
round(pct_change_income, 1),
ifelse(pct_change_income > 10, "YES", "NO")
)
# Step 6: Display the table
confounding_table %>%
kable(
caption = "Confounding Assessment: Do Covariates confound the exercise- physical health relationship?",
col.names = c("Covariate", "Crude β (Exercise)", "Adjusted β (Exercise)",
"% Change from Crude", "Meets 10% Threshold?")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Covariate | Crude β (Exercise) | Adjusted β (Exercise) | % Change from Crude | Meets 10% Threshold? |
|---|---|---|---|---|
| Age | -4.7115 | -4.5504 | 3.4 | NO |
| Sex | -4.7115 | -4.6974 | 0.3 | NO |
| Sleep Hours | -4.7115 | -4.6831 | 0.6 | NO |
| Education | -4.7115 | -4.3912 | 6.8 | NO |
| Income | -4.7115 | -3.9406 | 16.4 | YES |
Most covariates (age, sex, sleep, education) change the exercise coefficient by less than 10%, meaning they do not confound the exercise–health relationship, except Income. Adjusting for income reduces the exercise coefficient from –4.71 to –3.94, a 16.4% shift, which exceeds the 10% threshold.
Income appears to explain part of the crude association between exercise and physical health. - People with higher incomes may be more likely to exercise, and - They may also have better baseline physical health, independent of exercise. Because income is related to both the exposure (exercise) and the outcome (physical health days), it acts as a confounder.
#Interpretation
Exercise is strongly associated with fewer physically unhealthy days in the crude model. After evaluating potential confounders, income emerges as the only variable that meaningfully alters this association.
4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?
# Identify confounders from the table
var_map <- c("Age"="age","Sex"="sex","Sleep hrs"="sleep_hrs",
"Education"="education","Income cat"="income_cat")
identified <- conf_table_ex %>%
filter(`Confounder (>10%)?` == "Yes") %>%
pull(Covariate)
conf_vars_final <- var_map[identified]
cat("Confounders identified (>10%):", paste(identified, collapse=", "), "\n")## Confounders identified (>10%): Income cat
final_formula <- as.formula(
paste("physhlth_days ~ exercise +", paste(conf_vars_final, collapse=" + "))
)
cat("Final model formula:", deparse(final_formula), "\n\n")## Final model formula: physhlth_days ~ exercise + income_cat
mod_final_ex <- lm(final_formula, data = brfss_ci)
tidy(mod_final_ex, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
kable(
caption = "Table 4c. Fully Adjusted Model: exercise → physhlth_days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6363 | 0.3619 | 29.3924 | 0 | 9.9269 | 11.3457 |
| exerciseYes | -3.9406 | 0.2684 | -14.6842 | 0 | -4.4667 | -3.4145 |
| income_cat | -0.6750 | 0.0531 | -12.7135 | 0 | -0.7790 | -0.5709 |
b_final_ex <- coef(mod_final_ex)["exerciseYes"]
pct_overall <- abs(b_crude - b_final_ex) / abs(b_crude) * 100
tribble(
~Model, ~`Exercise β`, ~`% Change from Crude`,
"Crude", round(b_crude, 4), "---",
"Fully adjusted", round(b_final_ex, 4), paste0(round(pct_overall, 1), "%")
) %>%
kable(caption = "Table 4c. Crude vs. Adjusted Exercise Coefficient") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Model | Exercise β | % Change from Crude |
|---|---|---|
| Crude | -4.7115 | — |
| Fully adjusted | -3.9406 | 16.4% |
Interpretation: After full adjustment, exercisers reported 3.94 fewer physically unhealthy days per month than non‑exercisers, reflecting a 16.4% change from the crude estimate of –4.71 days. This decrease indicates that part of the crude association was inflated by confounding, primarily income, yet the adjusted coefficient remains large, negative, and highly statistically significant.
4d. (5 pts) Is gen_health a confounder
or a mediator of the exercise-physical health relationship? Could it be
both? Explain your reasoning with reference to the three conditions for
confounding and the concept of the causal pathway.
mod_genhlth <- lm(physhlth_days ~ exercise + gen_health, data = brfss_ci)
b_genhlth <- coef(mod_genhlth)["exerciseYes"]
pct_genhlth <- abs(b_crude - b_genhlth) / abs(b_crude) * 100
cat("Exercise coefficient adjusted for gen_health:", round(b_genhlth, 4), "\n")## Exercise coefficient adjusted for gen_health: -1.6596
## % change from crude: 64.8 %
Adjusting for general health changes the exercise coefficient from –4.71 to –1.66, a 64.8% change. That’s a large change, so statistically, gen_health behaves like a strong confounder or mediator (or both) of the exercise–physical health association.
However, a variable is a confounder if it: 1. Is associated with the exposure (exercise) People in better general health are more likely to exercise regularly. 2. Is associated with the outcome (physical unhealthy days), independent of the exposure General health is strongly related to physically unhealthy days even after accounting for exercise. 3. Is not on the causal pathway between exposure and outcome It must not be a consequence of exercise. If we think of gen_health as a baseline health status measured before or independent of exercise, then it satisfies all three and acts as a confounder, meaning healthier people both exercise more and have fewer unhealthy days.
A variable is a mediator if exercise improves general health, which in turn reduces unhealthy days. In that causal story, gen_health lies on the pathway from exercise to physical health.So it can be part both. Because the data is cross‑sectional, we cannot definitively distinguish confounding from mediation, but adjusting for general health likely remove baseline differences.
5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:
Do not use statistical jargon.
Adults who reported exercising had about 3.9 fewer physically unhealthy days per month than those who did not, even after accounting for age, sex, sleep, education, and income. This benefit was consistent across groups, and none of the subgroup analyses showed meaningful differences in the exercise–health relationship. Income was the only factor that changed the exercise effect, suggesting that part of crude difference was due to socioeconomic conditions rather than exercise alone. Even so, the adjusted difference remained large, indicating that regular physical activity is strongly linked to better physical health. Because these data are cross‑sectional, we cannot be sure whether exercise improves health or whether healthier people are more likely to exercise, and unmeasured factors such as chronic illness or access to safe places to be active (for females) may also influence the results.
5b. (10 pts) A colleague suggests including
gen_health as a covariate in the final model because it
changes the exercise coefficient by more than 10%. You disagree. Write a
3–4 sentence argument explaining why adjusting for
general health may not be appropriate if the goal is to estimate the
total effect of exercise on physical health days. Use the concept of
mediation in your argument.
General health should not be included in the final model because it is likely part of the pathway through which exercise affects physical health. Regular activity can improve a person’s overall health, which then leads to fewer unhealthy days; adjusting for general health would remove this pathway and make the exercise effect look smaller than it truly is. The large change in the coefficient after adjusting for general health reflects this mediation, not confounding. If the goal is to estimate the total effect of exercise, general health must be omitted so that the model does not control away part of the exercise benefit.
End of Lab Activity