EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
## Warning: package 'broom' was built under R version 4.5.2
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(flextable)
## Warning: package 'flextable' was built under R version 4.5.3
##
## Attaching package: 'flextable'
##
## The following object is masked from 'package:gtsummary':
##
## continuous_summary
##
## The following objects are masked from 'package:kableExtra':
##
## as_image, footnote
##
## The following object is masked from 'package:purrr':
##
## compose
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
##
## Attaching package: 'plotly'
##
## The following objects are masked from 'package:flextable':
##
## highlight, style
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.2
brfss_dv <- readRDS(
"C:/Users/abbym/OneDrive/Desktop/STATS553/R Materials/epi553/scripts/brfss_dv_2020.rds"
)
brfss_dv |>
select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
education, gen_health) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
sex ~ "Sex",
education ~ "Education level",
gen_health ~ "General health status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
as_flex_table()
Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Physically unhealthy days (past 30) | 5,000 | 3.3 (7.9) |
Sleep (hours/night) | 5,000 | 7.0 (1.4) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
Education level | 5,000 | |
Less than HS | 290 (5.8%) | |
HS graduate | 1,348 (27%) | |
Some college | 1,340 (27%) | |
College graduate | 2,022 (40%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
1Mean (SD); n (%) | ||
ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by General Health",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "General Health",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
ggplot(brfss_dv, aes(x = marital_status, y = menthlth_days, fill = menthlth_days)) +
geom_bar(stat = "summary", fun= "mean", alpha = 0.85) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mean Mentally Unhealthy Days by Marital Status",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Marital Status",
y = "Mean Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect? The poor general health group has
the most mentally unhealthy days. This is consistent with what I would
expect. 1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest? Separated has the highest mean and widowed has the
lowest mean. —
brfss_dv <- brfss_dv |>
mutate(
gen_health_numeric = case_when(
gen_health == "Excellent" ~ 1,
gen_health == "Very good" ~ 2,
gen_health == "Good" ~ 3,
gen_health == "Fair" ~ 4,
gen_health == "Poor" ~ 5,
TRUE ~ NA_real_
)
)
# Fit model
naive_mod <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)
tidy(naive_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Naive Model: General Health Treated as Continuous",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -0.6718 | 0.2705 | -2.4840 | 0.013 | -1.2021 | -0.1416 |
| gen_health_numeric | 1.8578 | 0.1036 | 17.9259 | 0.000 | 1.6547 | 2.0610 |
# Fit model
reg_mod <- lm(menthlth_days ~ gen_health, data = brfss_dv)
tidy(reg_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Regular Model: General Health Treated as Factor",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.1174 | 0.2332 | 9.0790 | 0.0000 | 1.6602 | 2.5746 |
| gen_healthVery good | 0.5903 | 0.2941 | 2.0070 | 0.0448 | 0.0137 | 1.1670 |
| gen_healthGood | 1.9535 | 0.3082 | 6.3375 | 0.0000 | 1.3492 | 2.5577 |
| gen_healthFair | 5.0624 | 0.4064 | 12.4572 | 0.0000 | 4.2657 | 5.8590 |
| gen_healthPoor | 9.6640 | 0.6090 | 15.8678 | 0.0000 | 8.4701 | 10.8580 |
2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it. The coefficient is 1.8578, meaning that for every 1
unit increase in general health, the number of poor mental health days
increases by 1.8578 days. 2b. (5 pts) Now fit the same
model but treating gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading. The factor version uses 4
coefficients, because it is using excellent general health as a
reference category, and each level needs to be compared to something.
The naive approach may be misleading, because general health level is
not a numeric variable. The numeric codes to each level in general
health have no significant meaning, meaning running a regression with
them doesn’t really have that much of a significant meaning. You can’t
really tell the distance between “excellent” general health and “very
good” general health, for example. —
# Fit model
new_mod <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv)
tidy(new_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "New Model",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5930 | 0.6304 | 15.2163 | 0.0000 | 8.3570 | 10.8289 |
| age | -0.0867 | 0.0060 | -14.4888 | 0.0000 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0.0000 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0.0000 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0.0000 | -0.7367 | -0.4365 |
| gen_healthVery good | 0.7899 | 0.2797 | 2.8247 | 0.0048 | 0.2417 | 1.3382 |
| gen_healthGood | 1.8436 | 0.2973 | 6.2020 | 0.0000 | 1.2608 | 2.4264 |
| gen_healthFair | 3.3953 | 0.4180 | 8.1234 | 0.0000 | 2.5759 | 4.2147 |
| gen_healthPoor | 5.3353 | 0.6829 | 7.8122 | 0.0000 | 3.9965 | 6.6742 |
ggcoef_model(new_mod,
include = c("gen_health"),
variable_labels = c(
gen_health = "General Health"),
facet_labeller = ggplot2::label_wrap_gen(10)
)
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation. menthlth_days = 9.5930 -
0.0867(age) + 1.7254(female) + 0.2314(physhlth_days) - 0.5866(sleep_hrs)
+ 0.7899 (very good gen_health) + 1.8436(good gen_health) + 3.3953 (fair
gen_health) + 5.3353(poor gen_health) 3b. (10 pts)
Interpret every dummy variable coefficient for
gen_health in plain language. Be specific about the
reference group, the direction and magnitude of each comparison, and
include the phrase “holding all other variables constant.” Compared to
those with excellent general health, those with very good general health
have 0.7899 more poor mental health days, holding all other variables
constant. Compared to those with excellent general health, those with
good general health have 1.8436 more poor mental health days, holding
all other variables constant. Compared to those with excellent general
health, those with poor general health have 3.3953 more poor mental
health days, holding all other variables constant. Compared to those
with excellent general health, those with poor general health have
5.3353 more poor mental health days, holding all other variables
constant. 3c. (10 pts) Create a coefficient plot
(forest plot) showing the estimated coefficients and 95% confidence
intervals for the gen_health dummy variables only. Which
group differs most from the reference group? Poor general health group
differs the most from the reference group. —
# Change reference group to Good gen_health
brfss_dv$gen_reref <- relevel(brfss_dv$gen_health, ref = "Good")
mod_gen_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_reref, data = brfss_dv)
tidy(mod_gen_reref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: Good)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.4366 | 0.6298 | 18.1584 | 0e+00 | 10.2019 | 12.6713 |
| age | -0.0867 | 0.0060 | -14.4888 | 0e+00 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0e+00 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0e+00 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0e+00 | -0.7367 | -0.4365 |
| gen_rerefExcellent | -1.8436 | 0.2973 | -6.2020 | 0e+00 | -2.4264 | -1.2608 |
| gen_rerefVery good | -1.0537 | 0.2581 | -4.0819 | 0e+00 | -1.5597 | -0.5476 |
| gen_rerefFair | 1.5517 | 0.3861 | 4.0186 | 1e-04 | 0.7947 | 2.3087 |
| gen_rerefPoor | 3.4917 | 0.6506 | 5.3673 | 0e+00 | 2.2164 | 4.7671 |
pred_orig <- predict(new_mod)
pred_reref <- predict(mod_gen_reref)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not? The education coefficients are not the same, because the output given is the conclusion drawn for mental health days when compared to good general health, not excellent general health. The other predictors stay the same, because the rest of the model stayed the same, the only thing that changed was the reference category for general health. 4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions. Changing the reference group does not change the predictions, because it only changes the interpretation of the dummy variables. Everything else stays constant, besides the intercept, including the data and variables used. —
# Fit model
reduced_mod <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)
tidy(reduced_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "New Model",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.5041 | 0.6124 | 17.1531 | 0 | 9.3036 | 11.7046 |
| age | -0.0773 | 0.0060 | -12.9721 | 0 | -0.0890 | -0.0656 |
| sexFemale | 1.6864 | 0.2074 | 8.1296 | 0 | 1.2797 | 2.0931 |
| physhlth_days | 0.3172 | 0.0132 | 24.0152 | 0 | 0.2913 | 0.3431 |
| sleep_hrs | -0.6330 | 0.0772 | -8.2029 | 0 | -0.7843 | -0.4817 |
summary(reduced_mod)
##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
## data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.314 -3.520 -1.805 0.159 32.013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.504082 0.612374 17.153 < 2e-16 ***
## age -0.077335 0.005962 -12.972 < 2e-16 ***
## sexFemale 1.686413 0.207440 8.130 5.38e-16 ***
## physhlth_days 0.317247 0.013210 24.015 < 2e-16 ***
## sleep_hrs -0.633026 0.077171 -8.203 2.96e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.28 on 4995 degrees of freedom
## Multiple R-squared: 0.1522, Adjusted R-squared: 0.1515
## F-statistic: 224.2 on 4 and 4995 DF, p-value: < 2.2e-16
summary(new_mod)
##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs +
## gen_health, data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5175 -3.5549 -1.6999 0.4316 31.3631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.592982 0.630441 15.216 < 2e-16 ***
## age -0.086672 0.005982 -14.489 < 2e-16 ***
## sexFemale 1.725379 0.205472 8.397 < 2e-16 ***
## physhlth_days 0.231420 0.016177 14.306 < 2e-16 ***
## sleep_hrs -0.586595 0.076572 -7.661 2.21e-14 ***
## gen_healthVery good 0.789947 0.279661 2.825 0.00475 **
## gen_healthGood 1.843601 0.297260 6.202 6.03e-10 ***
## gen_healthFair 3.395283 0.417964 8.123 5.66e-16 ***
## gen_healthPoor 5.335347 0.682949 7.812 6.80e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1681
## F-statistic: 127.3 on 8 and 4991 DF, p-value: < 2.2e-16
f_test <- anova(reduced_mod, new_mod)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does General Health Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health | 4991 | 259335.4 | 4 | 5379.751 | 25.8838 | 0 |
# Type III using car::Anova()
anova_type3 <- Anova(new_mod, type = "III")
print(anova_type3)
## Anova Table (Type III tests)
##
## Response: menthlth_days
## Sum Sq Df F value Pr(>F)
## (Intercept) 12031 1 231.536 < 2.2e-16 ***
## age 10908 1 209.926 < 2.2e-16 ***
## sex 3664 1 70.512 < 2.2e-16 ***
## physhlth_days 10634 1 204.654 < 2.2e-16 ***
## sleep_hrs 3049 1 58.687 2.207e-14 ***
## gen_health 5380 4 25.884 < 2.2e-16 ***
## Residuals 259335 4991
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the
full model (from Task 3a). The \(R^2\)
for the full model is 0.1694 and the \(R^2\) for the reduced model is 0.1522. The
adjusted \(R^2\) for the full model is
0.1681 and the adjusted \(R^2\) for the
reduced model is 0.1515. 5b. (10 pts) Conduct a partial
F-test using anova() to test whether
gen_health as a whole significantly improves the model.
State the null and alternative hypotheses. Report the F-statistic,
degrees of freedom, and p-value. State your conclusion. The null
hypothesis is that general health does not significantly improve the
model, and the alternative hypothesis is that general health does
significantly improve the model. The F-statistic is 25.8838, the degrees
of freedom are 4, and the p-value is 0. This means you reject the null
hypothesis. General health does significantly improve the model.
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
Yes, they are consistent. —
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
End of Lab Activity