EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
## Warning: package 'broom' was built under R version 4.4.3
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.3
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.4.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.4.3
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.4.3
brfss_dv <- readRDS(
"/Users/sarah/OneDrive/Documents/EPI 553/data/brfss_dv_2020.rds"
)
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
brfss_dv |>
select(menthlth_days, age, sex, gen_health, marital_status) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
age ~ "Age (years)",
sex ~ "Sex",
gen_health ~ "General health status",
marital_status ~ "Marital status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
as_flex_table()
Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
Marital status | 5,000 | |
Married | 2,708 (54%) | |
Divorced | 622 (12%) | |
Widowed | 534 (11%) | |
Separated | 109 (2.2%) | |
Never married | 848 (17%) | |
Unmarried couple | 179 (3.6%) | |
1Mean (SD); n (%) | ||
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
ggplot(brfss_dv, aes(x= gen_health, y= menthlth_days, fill = gen_health)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by General Health Status",
x = "General Health Status",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
brfss_dv %>%
group_by(marital_status) %>%
summarise(mean_menthlth = mean(menthlth_days, na.rm = TRUE),
n =n()
)%>%
arrange(desc(mean_menthlth))
## # A tibble: 6 × 3
## marital_status mean_menthlth n
## <fct> <dbl> <int>
## 1 Separated 6.22 109
## 2 Unmarried couple 6.07 179
## 3 Never married 5.28 848
## 4 Divorced 4.49 622
## 5 Married 3.10 2708
## 6 Widowed 2.67 534
2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
brfss_dv <- brfss_dv %>%
mutate(
gen_health_numeric = case_when(
gen_health == "Excellent" ~ 1,
gen_health == "Very Good" ~ 2,
gen_health == "Good" ~ 3,
gen_health == "Fair" ~ 4,
gen_health == "Poor" ~ 5
)
)
model_2a <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)
summary(model_2a)
##
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6309 -5.0626 -1.4943 -0.0626 28.5057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2899 0.3404 -0.852 0.395
## gen_health_numeric 1.7842 0.1175 15.186 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.335 on 3195 degrees of freedom
## (1803 observations deleted due to missingness)
## Multiple R-squared: 0.06732, Adjusted R-squared: 0.06703
## F-statistic: 230.6 on 1 and 3195 DF, p-value: < 2.2e-16
The coefficient equals 1.7842. For each one-unit worsening in general health status, the number of mentally unhealthy days increases by about 1.78 days.
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
model_2b <- lm(menthlth_days ~ gen_health, data = brfss_dv)
summary(model_2b)
##
## Call:
## lm(formula = menthlth_days ~ gen_health, data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7814 -4.0708 -2.7077 -0.1174 27.8826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1174 0.2332 9.079 < 2e-16 ***
## gen_healthVery good 0.5903 0.2941 2.007 0.0448 *
## gen_healthGood 1.9535 0.3082 6.337 2.54e-10 ***
## gen_healthFair 5.0624 0.4064 12.457 < 2e-16 ***
## gen_healthPoor 9.6640 0.6090 15.868 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.611 on 4995 degrees of freedom
## Multiple R-squared: 0.07334, Adjusted R-squared: 0.07259
## F-statistic: 98.83 on 4 and 4995 DF, p-value: < 2.2e-16
The factor version uses four coefficients because general health has 5 categories, each non-reference category needs its own comparison with the reference group. The numeric model uses 1 coefficient since it treats general health as a continuous variable. This is misleading because differences between categories are not equal.
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation. \[Menthlth_days = 9.5930 + -0.0867(age) + 1.7254 (female) + 0.2314 (physhlth_days) + -0.5866(sleep) + 0.7899(very good) + 1.8436(good) + 3.3953(fair) + 5.3353(poor) + \varepsilon\]
mod_gen_health <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv)
tidy(mod_gen_health, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with General Health Status Dummy Variables",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5930 | 0.6304 | 15.2163 | 0.0000 | 8.3570 | 10.8289 |
| age | -0.0867 | 0.0060 | -14.4888 | 0.0000 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0.0000 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0.0000 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0.0000 | -0.7367 | -0.4365 |
| gen_healthVery good | 0.7899 | 0.2797 | 2.8247 | 0.0048 | 0.2417 | 1.3382 |
| gen_healthGood | 1.8436 | 0.2973 | 6.2020 | 0.0000 | 1.2608 | 2.4264 |
| gen_healthFair | 3.3953 | 0.4180 | 8.1234 | 0.0000 | 2.5759 | 4.2147 |
| gen_healthPoor | 5.3353 | 0.6829 | 7.8122 | 0.0000 | 3.9965 | 6.6742 |
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
The reference group for gen_health is Excellent health.
Holding all other variables constant, adults reporting very good health
have 0.79 more mentally unhealthy days per month compared to those
reporting excellent health. Adults reporting good health have 1.84 more
mentally unhealthy days per month than those reporting excellent health.
Adults reporting fair health have 3.40 more mentally unhealthy days per
month compared to those reporting excellent health. Adults reporting
poor health have 5.34 more mentally unhealthy days per month than those
reporting excellent health.
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
The poor health group differs the most from the reference group.
tidy(mod_gen_health, conf.int = TRUE) %>%
filter(str_detect(term, "gen_health")) %>%
mutate(term = str_replace(term, "gen_health", "")) %>%
ggplot(aes(x = term, y = estimate)) +
geom_point(size = 3, color = "steelblue") +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
width = 0.15, color = "steelblue") +
coord_flip() +
labs(
title = "Coefficient Plot for General Health Categories",
subtitle = "Reference group: Excellent health",
x = "General Health Category",
y = "Estimated Difference in Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13)
4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
brfss_dv$gen_health_ref <- relevel(brfss_dv$gen_health, ref = "Good")
model_4a <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_ref, data = brfss_dv)
tidy(model_4a, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: Good)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.4366 | 0.6298 | 18.1584 | 0e+00 | 10.2019 | 12.6713 |
| age | -0.0867 | 0.0060 | -14.4888 | 0e+00 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0e+00 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0e+00 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0e+00 | -0.7367 | -0.4365 |
| gen_health_refExcellent | -1.8436 | 0.2973 | -6.2020 | 0e+00 | -2.4264 | -1.2608 |
| gen_health_refVery good | -1.0537 | 0.2581 | -4.0819 | 0e+00 | -1.5597 | -0.5476 |
| gen_health_refFair | 1.5517 | 0.3861 | 4.0186 | 1e-04 | 0.7947 | 2.3087 |
| gen_health_refPoor | 3.4917 | 0.6506 | 5.3673 | 0e+00 | 2.2164 | 4.7671 |
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
The coefficient for age, sex, physhlth_days, and sleep_hrs are the same in both models. Changing the reference group only affects the intercept and dummy-variable coefficients.
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
pred_orig <- predict(mod_gen_health)
pred_reref <- predict(model_4a)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
Changing the reference group shifts the intercept and dummy coefficients. These changes cancel each other out when computing predicted values. All continuous predictors keep the same slopes.
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
reduced_model <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data= brfss_dv)
r2_table <- tibble(
Model = c("Reduced model", "Full model"),
R2 = c(summary(reduced_model)$r.squared,
summary(mod_gen_health)$r.squared),
Adj_R2 = c(summary(reduced_model)$adj.r.squared,
summary(mod_gen_health)$adj.r.squared)
) |>
mutate(across(where(is.numeric), round, 4))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(where(is.numeric), round, 4)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
r2_table |>
kable(caption = "R² and Adjusted R² for Reduced vs. Full Model") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Model | R2 | Adj_R2 |
|---|---|---|
| Reduced model | 0.1522 | 0.1515 |
| Full model | 0.1694 | 0.1681 |
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
f_test <- anova(reduced_model, mod_gen_health)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does General Health Status Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health | 4991 | 259335.4 | 4 | 5379.751 | 25.8838 | 0 |
H0:_Very good = _Excellent = _Fair = _Poor = 0
Ha: At least one _gen_health =! 0
F-statistic = 25.8838 Degrees of freedom = 4 P-value = <.001
We reject the null hypothesis since the p value is far less than 0.05. General health status as a whole significantly improves the prediction of mentally unhealthy days, even after adjusting for age, sex, physical health, and sleep.
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
Anova(mod_gen_health, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12030.737 | 1 | 231.5357 | 0 |
| age | 10907.874 | 1 | 209.9258 | 0 |
| sex | 3663.847 | 1 | 70.5120 | 0 |
| physhlth_days | 10633.920 | 1 | 204.6535 | 0 |
| sleep_hrs | 3049.400 | 1 | 58.6868 | 0 |
| gen_health | 5379.751 | 4 | 25.8838 | 0 |
| Residuals | 259335.435 | 4991 | NA | NA |
Theh Type III ANOVA result for gen_health is consistent with the partial F-test.
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
Adults reporting poorer general health also tend to report more days of poor mental health. Compared with people in excellent health, those in fair or poor health experience several additional mentally unhealthy days each month, with the largest different seem among poor health. People in good and very good health status show smaller differences, but still report slightly more mentally unhealthy days. Because the data was collected at one point in time, we cannot say that general health status causes changes in mental health.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.
General health status shows a stronger relationship with mentally unhealthy days than education. In the education model, differences across education levels are small, and only college graduates report noticeably fewer mentally unhealthy days than the lowest education group. In contrast, people with fair or poor general health report several more mentally unhealthy days than those in excellent health. If I were to build a final model I would include general health as it adds more explanatory value.
End of Lab Activity