In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.
When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.
Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
brfss_dv <- readRDS(
"C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Labs/brfss_dv_2020.rds") |>
clean_names()1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
brfss_dv|>
select(menthlth_days, age, sex, gen_health, marital_status) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
age ~ "Age (years)",
sex ~ "Sex",
gen_health ~ "General health status",
marital_status ~ "Marital status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")| Characteristic | N | N = 5,0001 |
|---|---|---|
| Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
| Age (years) | 5,000 | 54.9 (17.5) |
| Sex | 5,000 | |
| Male | 2,303 (46%) | |
| Female | 2,697 (54%) | |
| General health status | 5,000 | |
| Excellent | 1,065 (21%) | |
| Very good | 1,803 (36%) | |
| Good | 1,426 (29%) | |
| Fair | 523 (10%) | |
| Poor | 183 (3.7%) | |
| Marital status | 5,000 | |
| Married | 2,708 (54%) | |
| Divorced | 622 (12%) | |
| Widowed | 534 (11%) | |
| Separated | 109 (2.2%) | |
| Never married | 848 (17%) | |
| Unmarried couple | 179 (3.6%) | |
| 1 Mean (SD); n (%) | ||
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by General Health Status",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "General Health Status",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Mental Health Days by General Health
Poor general health individuals report the most mentally unhealthy days. The pattern is exactly what I expected – the number of poor mental health days increases with worsening general health status.
1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
group_means <- brfss_dv |>
summarise(mean_days = mean(menthlth_days), .by = c(marital_status))
group_means %>%
kable(caption = "Descriptive Statistics by Marital Status",
align = "lrrrrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| marital_status | mean_days |
|---|---|
| Married | 3.100443 |
| Widowed | 2.670412 |
| Never married | 5.278302 |
| Divorced | 4.487138 |
| Unmarried couple | 6.067039 |
| Separated | 6.220184 |
“Separated” has the highest mean number of mentally unhealthy days. “Widowed” has the lowest mean number of mentally unhealthy days.
2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
brfss_dv <- brfss_dv |>
mutate(
gen_health_numeric = case_when(
gen_health == "Excellent" ~ 1,
gen_health == "Very good" ~ 2,
gen_health == "Good" ~ 3,
gen_health == "Fair" ~ 4,
gen_health == "Poor" ~ 5,
TRUE ~ NA_real_))
# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)
tidy(naive_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Naive Model: Gen Health Treated as Continuous",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -0.6718 | 0.2705 | -2.4840 | 0.013 | -1.2021 | -0.1416 |
| gen_health_numeric | 1.8578 | 0.1036 | 17.9259 | 0.000 | 1.6547 | 2.0610 |
The coefficient is 1.8578. For each increase in gen_health level, there is an increase of 1.8578 poor mental health days.
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
new_mod <- lm(menthlth_days ~ gen_health, data = brfss_dv)
tidy(new_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "New Model: Gen Health Treated as Factor",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.1174 | 0.2332 | 9.0790 | 0.0000 | 1.6602 | 2.5746 |
| gen_healthVery good | 0.5903 | 0.2941 | 2.0070 | 0.0448 | 0.0137 | 1.1670 |
| gen_healthGood | 1.9535 | 0.3082 | 6.3375 | 0.0000 | 1.3492 | 2.5577 |
| gen_healthFair | 5.0624 | 0.4064 | 12.4572 | 0.0000 | 4.2657 | 5.8590 |
| gen_healthPoor | 9.6640 | 0.6090 | 15.8678 | 0.0000 | 8.4701 | 10.8580 |
The factor version uses four coefficients because the regression is now comparing each category to the reference category of Gen_health = Excellent. The numeric approach is misleading because it assumes that the difference between each category is the same. However, in the case of this ordinal variable, the differences between groups are not quite equal. For example, it is not justified to say that the difference between Poor health and Fair health is the same as the difference between Very good and Excellent health.
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
dummy_mod <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv)
tidy(dummy_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Dummy model",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5930 | 0.6304 | 15.2163 | 0.0000 | 8.3570 | 10.8289 |
| age | -0.0867 | 0.0060 | -14.4888 | 0.0000 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0.0000 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0.0000 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0.0000 | -0.7367 | -0.4365 |
| gen_healthVery good | 0.7899 | 0.2797 | 2.8247 | 0.0048 | 0.2417 | 1.3382 |
| gen_healthGood | 1.8436 | 0.2973 | 6.2020 | 0.0000 | 1.2608 | 2.4264 |
| gen_healthFair | 3.3953 | 0.4180 | 8.1234 | 0.0000 | 2.5759 | 4.2147 |
| gen_healthPoor | 5.3353 | 0.6829 | 7.8122 | 0.0000 | 3.9965 | 6.6742 |
Write out the fitted regression equation.
Menthlth_days = 9.5930 + (-0.0867 * age) + (1.7254 * sexFemale) + (0.2314 * physhlth_days) + (-0.5866 * sleep_hrs) + (0.7899 * gen_healthVery good) + (1.8436 * gen_healthGood) + (3.3953 * gen_healthFair) + (5.3353 * gen_healthPoor)
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
gen_health = Very good is associated with 0.7899 more mentally unhealthy days than gen_health = Excellent, holding all other variables constant. gen_health = Good is associated with 1.8436 more mentally unhealthy days than gen_health = Excellent, holding all other variables constant. gen_health = Fair is associated with 3.3953 more mentally unhealthy days than gen_health = Excellent, holding all other variables constant. gen_health = Poor is associated with 5.3353 more mentally unhealthy days than gen_health = Excellent, holding all other variables constant.
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
tidy(dummy_mod, conf.int = TRUE) |>
filter(str_detect(term, "gen_health")) |>
mutate(
term = str_replace(term, "gen_health", ""),
term = factor(term, levels = c("Very good", "Good", "Fair", "Poor"))
) |>
ggplot(aes(x = estimate, y = term, color = term)) +
geom_point(size = 4) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 1) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
scale_color_brewer(palette = "Dark2", direction = -1) +
labs(
title = "Coefficient Plot for Gen_health",
subtitle = "Reference: Health = Excellent, adjusted for other variables",
x = "Difference in Mentally Unhealthy Days",
y = "General Health Category"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Poor Health differs most from the reference group.
4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
brfss_dv$gen_health_new <- relevel(brfss_dv$gen_health, ref = "Good")
dummy_mod_new <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_new,
data = brfss_dv)
tidy(dummy_mod_new, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Good Health Reference",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.4366 | 0.6298 | 18.1584 | 0e+00 | 10.2019 | 12.6713 |
| age | -0.0867 | 0.0060 | -14.4888 | 0e+00 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0e+00 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0e+00 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0e+00 | -0.7367 | -0.4365 |
| gen_health_newExcellent | -1.8436 | 0.2973 | -6.2020 | 0e+00 | -2.4264 | -1.2608 |
| gen_health_newVery good | -1.0537 | 0.2581 | -4.0819 | 0e+00 | -1.5597 | -0.5476 |
| gen_health_newFair | 1.5517 | 0.3861 | 4.0186 | 1e-04 | 0.7947 | 2.3087 |
| gen_health_newPoor | 3.4917 | 0.6506 | 5.3673 | 0e+00 | 2.2164 | 4.7671 |
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
The other variables that were controlled for are the same, which is expected because changing the reference of one variable should not affect the relationship between mental health and the other variables. What did change was the intercept and the coefficients for the Gen Health categories, which makes sense because we have releveled the variable and changed the reference category.
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
original <- predict(dummy_mod)
releveled <- predict(dummy_mod_new)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(round(max(abs(original - releveled)), 10),
round(cor(original, releveled), 10))
) |>
kable(caption = "Predictions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
The correlation between the two predictions is 1, so the predicted values are identical between the two predictions. Changing the reference group does not change predictions because it only changes the reference point for dummy variable coefficients, not the actual model itself.
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
mod_nohlth <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
data = brfss_dv)
tribble(
~Model, ~R2, ~`Adjusted R2`,
"No gen_health", round(summary(mod_nohlth)$r.squared, 4),
round(summary(mod_nohlth)$adj.r.squared, 4),
"Gen_health", round(summary(dummy_mod)$r.squared, 4),
round(summary(dummy_mod)$adj.r.squared, 4)
) |>
kable(caption = "Reduced vs. Full Model") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | R2 | Adjusted R2 |
|---|---|---|
| No gen_health | 0.1522 | 0.1515 |
| Gen_health | 0.1694 | 0.1681 |
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
No gen_health Model R2: 0.1522 Adjusted R2: 0.1515
Gen_health
Model R2: 0.1694 Adjusted R2: 0.1681
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
F_test_hlth <- anova(mod_nohlth, dummy_mod)
F_test_hlth |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "F-test of Reduced vs. Full Model") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health | 4991 | 259335.4 | 4 | 5379.751 | 25.8838 | 0 |
Null hypothesis: Gen_health does not significantly improve the model. Alternative hypothesis: Gen_health does significantly improve the model.
F-statistic: 25.8838 df1 = 4991 df2 = 4 p value = 0
I conclude that adding gen_health to the model significantly improves its predictive ability.
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
Anova(dummy_mod, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type 3 ANOVA on Full Model") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12030.737 | 1 | 231.5357 | 0 |
| age | 10907.874 | 1 | 209.9258 | 0 |
| sex | 3663.847 | 1 | 70.5120 | 0 |
| physhlth_days | 10633.920 | 1 | 204.6535 | 0 |
| sleep_hrs | 3049.400 | 1 | 58.6868 | 0 |
| gen_health | 5379.751 | 4 | 25.8838 | 0 |
| Residuals | 259335.435 | 4991 | NA | NA |
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
When compared to the reference category of Excellent Health, worsening general health category is associated with a higher number of poor mental health days. Compared to Excellent health, Very Good health is associated with 0.79 more poor mental health days, Good Health is associated with 1.8 more poor mental health days, Fair health is associated with 3.4 more poor mental health days, and Poor health is associated with 5.3 more poor mental health days, keeping age, sex, physical health, and sleep constant. BRFSS is a cross-sectional study that captures data at a single time point, so it is not possible to determine any causal effect, that is whether poor general health CAUSES poor mental health.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.
General health appears to be more strongly associated with mental health days. The highest education category (College graduate) is associated with 1.1429 less poor mental health days than the reference category (< High School). The worst general health category (Poor health) is associated with 5.3 more poor mental health days than the reference category (Excellent health). Thus, the magnitude of the general health effect is higher, so I would be more inclined to use it in the model.
End of Lab Activity