EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026
In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.
| Variable | Description | Type |
|---|---|---|
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
exercise |
Any physical activity (Yes/No) | Factor |
gen_health |
General health status (5 categories) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
# Load the dataset
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
## Warning: package 'broom' was built under R version 4.5.2
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(janitor)
## Warning: package 'janitor' was built under R version 4.5.2
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(haven)
## Warning: package 'haven' was built under R version 4.5.2
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
brfss_full <- read_xpt(
"C:/Users/abbym/OneDrive/Desktop/STATS553/R Materials/epi553/scripts/LLCP2020XPT/LLCP2020.XPT"
) |>
clean_names()
brfss_ci <- brfss_full |>
mutate(
# Outcome: physically unhealthy days in past 30
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Primary exposure: sleep hours
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Mentally unhealthy days (covariate)
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Age
age = age80,
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Education (4-level)
education = factor(case_when(
educa %in% c(1, 2, 3) ~ "Less than HS",
educa == 4 ~ "HS graduate",
educa == 5 ~ "Some college",
educa == 6 ~ "College graduate",
TRUE ~ NA_character_
), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
# Exercise in past 30 days
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
# General health status
gen_health = factor(case_when(
genhlth == 1 ~ "Excellent",
genhlth == 2 ~ "Very good",
genhlth == 3 ~ "Good",
genhlth == 4 ~ "Fair",
genhlth == 5 ~ "Poor",
TRUE ~ NA_character_
), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
# Income category (ordinal 1-8)
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
)
) |>
filter(
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(menthlth_days),
!is.na(age), age >= 18,
!is.na(sex),
!is.na(education),
!is.na(exercise),
!is.na(gen_health),
!is.na(income_cat)
)
# Reproducible random sample
set.seed(1220)
brfss_ci <- brfss_ci |>
select(physhlth_days, sleep_hrs, menthlth_days, age, sex,
education, exercise, gen_health, income_cat) |>
slice_sample(n = 5000)
# Save for lab activity
saveRDS(brfss_ci,
"C:/Users/abbym/OneDrive/Desktop/STATS553/R Materials/epi553/scripts/brfss_ci.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_ci), ncol(brfss_ci))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
brfss_ci <- readRDS(
"C:/Users/abbym/OneDrive/Desktop/STATS553/R Materials/epi553/scripts/brfss_ci.rds"
)
# Create scatter plot with regression line
p1 <- ggplot(brfss_ci, aes(x = age, y = physhlth_days, fill = exercise)) +
geom_jitter(alpha = 0.2, width = 0.5, height = 0.02, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.2) +
labs(
title = "Relationship Between Age and Physical Health Days, by exercise",
x = "Age (years)",
y = "Poor Physical Health Days"
) +
theme_minimal(base_size = 12)
ggplotly(p1)
## `geom_smooth()` using formula = 'y ~ x'
# Summary statistics by group
summary_stats <- brfss_ci %>%
group_by(sex, exercise) %>%
summarise(
n = n(),
Mean = mean(physhlth_days, na.rm = TRUE),
SD = sd(physhlth_days, na.rm = TRUE),
Median = median(physhlth_days, na.rm = TRUE),
Min = min(physhlth_days, na.rm = TRUE),
Max = max(physhlth_days, na.rm = TRUE)
)
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by sex and exercise.
## ℹ Output is grouped by sex.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(sex, exercise))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
summary_stats %>%
kable(digits = 3,
caption = "Descriptive statistics for Physical Health Days by sex and exercise levels")
| sex | exercise | n | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|---|---|
| Male | No | 446 | 6.643 | 10.983 | 0 | 0 | 30 |
| Male | Yes | 1845 | 2.324 | 6.488 | 0 | 0 | 30 |
| Female | No | 625 | 7.430 | 11.460 | 0 | 0 | 30 |
| Female | Yes | 2084 | 2.451 | 6.322 | 0 | 0 | 30 |
# Create scatter plot with regression line
p2 <- ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days, fill = education)) +
geom_jitter(alpha = 0.2, width = 0.5, height = 0.02, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.2) +
labs(
title = "Relationship Between Hours of Sleep and Physical Health Days, by education",
x = "Sleep (hours)",
y = "Poor Physical Health Days"
) +
theme_minimal(base_size = 12)
ggplotly(p2) %>%
layout(hovermode = "closest")
## `geom_smooth()` using formula = 'y ~ x'
1a. (5 pts) Create a scatterplot of
physhlth_days (y-axis) vs. age (x-axis),
colored by exercise status. Add separate regression lines
for each group. Describe the pattern you observe. The relationship does
not appear very linear. The points are randomly scattered. There seems
to be a slight positive relationship between age, physical health days,
and exercise. 1b. (5 pts) Compute the mean
physhlth_days for each combination of sex and
exercise. Present the results in a table. Does it appear
that the association between exercise and physical health days might
differ by sex? It appears that it might differ slightly by sex.
1c. (5 pts) Create a scatterplot of
physhlth_days vs. sleep_hrs, faceted by
education level. Comment on whether the slopes appear
similar or different across education groups. The slopes appear slightly
different across education groups. Less than high school has the
steepest slope of them all. The rest are all fairly similar. —
# Fit separate models for males and females
mod_yes <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "Yes"))
mod_no <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "No"))
# Compare coefficients
bind_rows(
tidy(mod_yes, conf.int = TRUE) |> mutate(Stratum = "Yes"),
tidy(mod_no, conf.int = TRUE) |> mutate(Stratum = "No")
) |>
filter(term == "age") |>
select(Stratum, estimate, std.error, conf.low, conf.high, p.value) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stratified Analysis: Age → Physical Health Days, by Exercise",
col.names = c("Stratum", "Estimate", "SE", "CI Lower", "CI Upper", "p-value")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Stratum | Estimate | SE | CI Lower | CI Upper | p-value |
|---|---|---|---|---|---|
| Yes | 0.0192 | 0.0060 | 0.0075 | 0.0309 | 0.0013 |
| No | 0.0745 | 0.0212 | 0.0328 | 0.1161 | 0.0005 |
ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
labs(
title = "Association Between Age and Physical Health Days, Stratified by Exercise",
subtitle = "Are the slopes parallel or different?",
x = "Age (years)",
y = "Physically Unhealthy Days (Past 30)",
color = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")
## `geom_smooth()` using formula = 'y ~ x'
2a. (5 pts) Fit separate simple linear regression
models of physhlth_days ~ age for exercisers and
non-exercisers. Report the slope, SE, and 95% CI for age in each
stratum. The slope for “yes” is 0.0192, the SE is 0.0060, and the 95% CI
is [0.0075,0.0309]. The slope for “no” is 0.0745, the SE is 0.0212, and
the 95% CI is [0.0328,0.1161]. 2b. (5 pts) Create a
single plot showing the two fitted regression lines (one per exercise
group) overlaid on the data. Are the lines approximately parallel? Yes,
the lines are approximately parallel. 2c. (5 pts) Can
you formally test whether the two slopes are different using only the
stratified results? Explain why or why not. You cannot formally test
whether the two slopes are different using only the stratified results,
as it doesn’t show the p-value to determine whether the difference is
significant or not. —
# Model with interaction
mod_int <- lm(physhlth_days ~ age * exercise, data = brfss_ci)
tidy(mod_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Age * Exercise → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.7610 | 0.8798 | 3.1381 | 0.0017 | 1.0361 | 4.4858 |
| age | 0.0745 | 0.0145 | 5.1203 | 0.0000 | 0.0460 | 0.1030 |
| exerciseYes | -1.3858 | 0.9664 | -1.4340 | 0.1516 | -3.2804 | 0.5087 |
| age:exerciseYes | -0.0553 | 0.0162 | -3.4072 | 0.0007 | -0.0871 | -0.0235 |
int_term <- tidy(mod_int) |> filter(term == "age:exerciseYes")
cat("Interaction term (age:exerciseYes):\n")
## Interaction term (age:exerciseYes):
cat(" Estimate:", round(int_term$estimate, 4), "\n")
## Estimate: -0.0553
cat(" t-statistic:", round(int_term$statistic, 3), "\n")
## t-statistic: -3.407
cat(" p-value:", round(int_term$p.value, 4), "\n")
## p-value: 7e-04
# Fit Model
mod_new <- lm(physhlth_days ~ age * education, data = brfss_ci)
tidy(mod_new, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Age * Education → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.7542 | 1.5422 | 1.7859 | 0.0742 | -0.2692 | 5.7775 |
| age | 0.0706 | 0.0269 | 2.6277 | 0.0086 | 0.0179 | 0.1233 |
| educationHS graduate | -1.4016 | 1.6900 | -0.8293 | 0.4069 | -4.7146 | 1.9115 |
| educationSome college | -1.1261 | 1.6893 | -0.6666 | 0.5051 | -4.4380 | 2.1857 |
| educationCollege graduate | -2.7102 | 1.6580 | -1.6346 | 0.1022 | -5.9606 | 0.5402 |
| age:educationHS graduate | -0.0197 | 0.0296 | -0.6661 | 0.5054 | -0.0776 | 0.0382 |
| age:educationSome college | -0.0326 | 0.0295 | -1.1039 | 0.2697 | -0.0904 | 0.0253 |
| age:educationCollege graduate | -0.0277 | 0.0289 | -0.9583 | 0.3380 | -0.0844 | 0.0290 |
# F-test
mod_no_int_edu <- lm(physhlth_days ~ age + education, data = brfss_ci)
anova(mod_no_int_edu, mod_new) |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Is the Age x Education Interaction Significant?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| physhlth_days ~ age + education | 4995 | 306764.2 | NA | NA | NA | NA |
| physhlth_days ~ age * education | 4992 | 306671.9 | 3 | 92.2713 | 0.5007 | 0.6818 |
pred_new <- ggpredict(mod_new, terms = c("age", "education"))
ggplot(pred_new, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
labs(
title = "Predicted Physical Health Days by Age and Education",
subtitle = "From interaction model: age * education",
x = "Age (years)",
y = "Predicted Physically Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")
3a. (5 pts) Fit the interaction model:
physhlth_days ~ age * exercise. Write out the fitted
equation. physhlthdays= 2.7610 + 0.0745(age) - 1.3858(exerciseYes) -
0.0553(age:exerciseYes) 3b. (5 pts) Using the fitted
equation, derive the stratum-specific equations for exercisers and
non-exercisers. Verify that these match the stratified analysis from
Task 2. Exercisers: physhlth_days= (2.7610-1.3858) +
(0.0745-0.0553)(age)= 1.3752 + 0.0192(age) Non-exercisers:
physhlth_days= 2.7610 + 0.0745(age) Both of these match the stratified
analysis from task 2. 3c. (5 pts) Conduct the t-test
for the interaction term (age:exerciseYes). State the null
and alternative hypotheses, report the test statistic and p-value, and
state your conclusion about whether interaction is present. The null
hypothesis is that the slopes are equal, and the alternative hypothesis
is that the slopes are not equal. The test statistics if -3.407, and the
p-value is 7e-04. We reject the null hypothesis. There is statistically
significant evidence that the association between age and physically
unhealthy days varies by whether a person exercises or not. 3d.
(5 pts) Now fit a model with an interaction between age and
education: physhlth_days ~ age * education. How many
interaction terms are produced? Use a partial F-test to test whether the
age \(\times\) education interaction as
a whole is significant. 3 interaction terms are produced. The age *
education as a whole is not significant, as the p-value is 0.6818, which
is greater than 0.05. 3e. (5 pts) Create a
visualization using ggpredict() showing the predicted
physhlth_days by age for each education level. Do the lines
appear parallel? Most of the lines are parallel, but the HS graduate and
Some college lines cross. —
# Fit crude model
mod_crude <- lm(physhlth_days ~ exercise, data = brfss_ci)
tidy(mod_crude, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Exercise → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.1027 | 0.2354 | 30.1692 | 0 | 6.6412 | 7.5643 |
| exerciseYes | -4.7115 | 0.2656 | -17.7401 | 0 | -5.2322 | -4.1908 |
# Crude model
b_crude_val <- coef(mod_crude)["exerciseYes"]
# One-at-a-time adjusted models
confounders <- list(
"Age" = lm(physhlth_days ~ exercise + age, data = brfss_ci),
"Sleep" = lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci),
"Education" = lm(physhlth_days ~ exercise + education, data = brfss_ci),
"Sex" = lm(physhlth_days ~ exercise + sex, data = brfss_ci),
"Income" = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
)
conf_table <- map_dfr(names(confounders), \(name) {
mod <- confounders[[name]]
b_adj_val <- coef(mod)["exerciseYes"]
tibble(
Covariate = name,
`Crude β (exercise)` = round(b_crude_val, 4),
`Adjusted β (exercise)` = round(b_adj_val, 4),
`% Change` = round(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100, 1),
Confounder = ifelse(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100 > 10,
"Yes (>10%)", "No")
)
})
conf_table |>
kable(caption = "Systematic Confounding Assessment: One-at-a-Time Addition") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
column_spec(5, bold = TRUE)
| Covariate | Crude β (exercise) | Adjusted β (exercise) | % Change | Confounder |
|---|---|---|---|---|
| Age | -4.7115 | -4.5504 | 3.4 | No |
| Sleep | -4.7115 | -4.6831 | 0.6 | No |
| Education | -4.7115 | -4.3912 | 6.8 | No |
| Sex | -4.7115 | -4.6974 | 0.3 | No |
| Income | -4.7115 | -3.9406 | 16.4 | Yes (>10%) |
mod_income <- lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
tidy(mod_income, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Exercise + Income → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6363 | 0.3619 | 29.3924 | 0 | 9.9269 | 11.3457 |
| exerciseYes | -3.9406 | 0.2684 | -14.6842 | 0 | -4.4667 | -3.4145 |
| income_cat | -0.6750 | 0.0531 | -12.7135 | 0 | -0.7790 | -0.5709 |
For this task, the exposure is exercise
and the outcome is physhlth_days.
4a. (5 pts) Fit the crude model:
physhlth_days ~ exercise. Report the exercise coefficient.
This is the unadjusted estimate. The exercise coefficient is -4.7115.
4b. (10 pts) Systematically assess whether each of the
following is a confounder of the exercise-physical health association:
age, sex, sleep_hrs,
education, and income_cat. For each:
physhlth_days ~ exercise + [covariate]Present your results in a single summary table.
4c. (5 pts) Fit a fully adjusted model including
exercise and all identified confounders. Report the exercise coefficient
and compare it to the crude estimate. How much did the estimate change
overall? The estimate changed 16.4%. The crude estimate was -4.7115 and
the estimate with confounding was -3.9406. 4d. (5 pts)
Is gen_health a confounder or a mediator of the
exercise-physical health relationship? Could it be both? Explain your
reasoning with reference to the three conditions for confounding and the
concept of the causal pathway. I would say that general health is more
likely a mediator, but it could also potentially be a confounder, as it
is related to both the exposure of exercise and physically unhealthy
days, but it could also be on the causal pathway between exercise and
physical health health, as exercise influences general health, which
influences physical health. Determining whether it is a mediator or a
confounder depends on the exact relationship between the three
variables, but it is hard to tell exactly. Based on the likliness that
exercise impacts general health which impacts physical health, I would
say it is more likely a mediator. —
5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:
Do not use statistical jargon.
The relationship between age and physical health vary by whether a
person exercises or not, but education level does not impact this
relationship. Income category impacts the relationship between exercise
and physical health. Compared to those who do not exercise, those who do
experience 4.7115 less physically unhealthy days. When income category
is considered and added into the model, compared to those who do not
exercise, those who do experience 3.9406 less physically unhealthy
days.When conducting cross-sectional research, causality is hard to
establish because the order in which things occur cannot be measured.
Whether or not a person exercised and when they had physically unhealthy
days cannot be measured. There were also other variables that could have
been considered for confounding, or impacting the relationship of
exercise and physically unhealthy days, so the model could be lacking
important information.
5b. (10 pts) A colleague suggests including
gen_health as a covariate in the final model because it
changes the exercise coefficient by more than 10%. You disagree. Write a
3–4 sentence argument explaining why adjusting for
general health may not be appropriate if the goal is to estimate the
total effect of exercise on physical health days. Use the concept of
mediation in your argument. General health may not be a confounder,
meaning it would not be appropriate to add into the model. It is
possible that it is on the causal pathway between exercise and
physically unhealthy days, meaning it acts as a mediator instead of a
confounder. Including general health in the model could negatively
impact the true association between exercise and physically unhealthy
days, especially if it is truly acting as a mediator instead of a
confounder. Adding general health as a confounder could remove an
indirect pathway instead of limiting bias. —
End of Lab Activity