In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.
When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.
Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))EPI 553 — Dummy Variables Lab
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
brfss_dv <- readRDS(
"~/Desktop/EPI553/data/brfss_dv_2020.rds"
)1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
brfss_dv |>
select(menthlth_days, age, sex, gen_health, marital_status) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
age ~ "Age (years)",
sex ~ "Sex",
gen_health ~ "General health status",
marital_status ~ "Marital Status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
Marital Status | 5,000 | |
Married | 2,708 (54%) | |
Divorced | 622 (12%) | |
Widowed | 534 (11%) | |
Separated | 109 (2.2%) | |
Never married | 848 (17%) | |
Unmarried couple | 179 (3.6%) | |
1Mean (SD); n (%) | ||
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
The poor general health status group reports the most mentally unhealthy days. The pattern in the boxplot appears consistent with what you would expect.
ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by General Health Status",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "General Health Status",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
The separated marital status group has he highest mean and the widowed has the lowest mean of mentally unhealthy days.
brfss_dv |>
group_by(marital_status) |>
summarize(
mean_unhealthydays = mean(menthlth_days, na.rm = TRUE)
)## # A tibble: 6 × 2
## marital_status mean_unhealthydays
## <fct> <dbl>
## 1 Married 3.10
## 2 Divorced 4.49
## 3 Widowed 2.67
## 4 Separated 6.22
## 5 Never married 5.28
## 6 Unmarried couple 6.07
2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
The coefficient for gen_health_numeric is 1.78. Which indicates that for each one unit (status) increase in general health status the mean number of mentally unhealthy days increase by 1.78 days
#create numeric version
brfss_dv <- brfss_dv |>
mutate(
gen_health_numeric= case_when(
gen_health == "Excellent" ~1,
gen_health == "Very Good" ~2,
gen_health == "Good" ~3,
gen_health == "Fair" ~4,
gen_health == "Poor" ~5
)
)
#fit simple linear regression
naive_mod <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)
tidy(naive_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Naive Model: General Health Status Treated as Continuous",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -0.2899 | 0.3404 | -0.8516 | 0.3945 | -0.9574 | 0.3776 |
| gen_health_numeric | 1.7842 | 0.1175 | 15.1859 | 0.0000 | 1.5538 | 2.0145 |
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
The factor version uses 4 coefficients instead of 1 because the naive approach regression model assumes there are equally spaced, linear increments between the groups (excellent, very good, good, fair, and poor). The factor version treats one category as a reference variable, in this case excellent, and then 4 coefficients are given that represent the difference in mentally unhealthy days between the reference (excellent) and that category. The naive approach is misleading because for ordinal categories the assumption of equal spacing between the groups is not justified.
## # A tibble: 5 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.12 0.233 9.08 1.55e-19 1.66 2.57
## 2 gen_healthVery good 0.590 0.294 2.01 4.48e- 2 0.0137 1.17
## 3 gen_healthGood 1.95 0.308 6.34 2.54e-10 1.35 2.56
## 4 gen_healthFair 5.06 0.406 12.5 4.23e-35 4.27 5.86
## 5 gen_healthPoor 9.66 0.609 15.9 2.34e-55 8.47 10.9
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation.
MentalHealthDays= 9.593 + -0.0867(age) + 1.7254(female) + 0.2314(PhysDays) + -0.5866(Sleep) + 0.7899(VeryGood) + 1.8436(Good) + 3.3953(Fair) + 5.3353(Poor)
# Fit model with gen health as a factor
mod_health <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv)
tidy(mod_health, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with General Health Dummy Variables (Reference: Excellent)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5930 | 0.6304 | 15.2163 | 0.0000 | 8.3570 | 10.8289 |
| age | -0.0867 | 0.0060 | -14.4888 | 0.0000 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0.0000 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0.0000 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0.0000 | -0.7367 | -0.4365 |
| gen_healthVery good | 0.7899 | 0.2797 | 2.8247 | 0.0048 | 0.2417 | 1.3382 |
| gen_healthGood | 1.8436 | 0.2973 | 6.2020 | 0.0000 | 1.2608 | 2.4264 |
| gen_healthFair | 3.3953 | 0.4180 | 8.1234 | 0.0000 | 2.5759 | 4.2147 |
| gen_healthPoor | 5.3353 | 0.6829 | 7.8122 | 0.0000 | 3.9965 | 6.6742 |
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
Very Good (B= 0.7899): Compared to those with excellent general health, individuals with very good health report an estimated 0.7899 more mentally unhealthy days, holding all other variables constant. Good (B= 1.8436): Compared to those with excellent general health, individuals with good health report an estimated 1.8436 more mentally unhealthy days, holding all other variables constant. Fair (B= 3.3953): Compared to those with excellent general health, individuals with fair health report an estimated 3.3953 more mentally unhealthy days, holding all other variables constant. Poor (B= 5.3353): Compared to those with excellent general health, individuals with poor health report an estimated 5.3353 more mentally unhealthy days, holding all other variables constant.
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
The poor general health status group differs most from the reference group.
coef_gen_health <- tidy(mod_factor, conf.int = TRUE) |>
filter(grepl("gen_health", term))
ggplot(coef_gen_health, aes(x = term, y = estimate)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
coord_flip() +
labs(
title = "Coefficient Plot for General Health (Reference: Excellent)",
x = "General Health Status",
y = "Estimated Effect on Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13)4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
# Change reference group to good
brfss_dv$genhealth_reref <- relevel(brfss_dv$gen_health, ref = "Good")
mod_gen_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + genhealth_reref,
data = brfss_dv)
tidy(mod_gen_reref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: Good)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.4366 | 0.6298 | 18.1584 | 0e+00 | 10.2019 | 12.6713 |
| age | -0.0867 | 0.0060 | -14.4888 | 0e+00 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0e+00 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0e+00 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0e+00 | -0.7367 | -0.4365 |
| genhealth_rerefExcellent | -1.8436 | 0.2973 | -6.2020 | 0e+00 | -2.4264 | -1.2608 |
| genhealth_rerefVery good | -1.0537 | 0.2581 | -4.0819 | 0e+00 | -1.5597 | -0.5476 |
| genhealth_rerefFair | 1.5517 | 0.3861 | 4.0186 | 1e-04 | 0.7947 | 2.3087 |
| genhealth_rerefPoor | 3.4917 | 0.6506 | 5.3673 | 0e+00 | 2.2164 | 4.7671 |
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
The continuous variables stayed the same between the two models. They are the same because changing the reference group only changes the interpretation of the dummy variable coefficients not the model’s fit or predictions. The continuous variables are uneffected by the reference group change so their coefficients remain the same.
tribble(
~Quantity, ~`Ref: Excellent`, ~`Ref: Good`,
"Intercept", round(coef(mod_health)[1], 3), round(coef(mod_gen_reref)[1], 3),
"Age coefficient", round(coef(mod_health)[2], 3), round(coef(mod_gen_reref)[2], 3),
"Sex coefficient", round(coef(mod_health)[3], 3), round(coef(mod_gen_reref)[3], 3),
"Physical health days", round(coef(mod_health)[4], 3), round(coef(mod_gen_reref)[4], 3),
"Sleep hours", round(coef(mod_health)[5], 3), round(coef(mod_gen_reref)[5], 3),
"R-squared", round(summary(mod_health)$r.squared, 4), round(summary(mod_gen_reref)$r.squared, 4),
"Residual SE", round(summary(mod_health)$sigma, 3), round(summary(mod_gen_reref)$sigma, 3)
) |>
kable(caption = "Comparing Models with Different Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Ref: Excellent | Ref: Good |
|---|---|---|
| Intercept | 9.5930 | 11.4370 |
| Age coefficient | -0.0870 | -0.0870 |
| Sex coefficient | 1.7250 | 1.7250 |
| Physical health days | 0.2310 | 0.2310 |
| Sleep hours | -0.5870 | -0.5870 |
| R-squared | 0.1694 | 0.1694 |
| Residual SE | 7.2080 | 7.2080 |
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
Changing the reference groups dont change the predictions instead it changes how we word our interpretation.
# Verify that predicted values are identical
pred_orig <- predict(mod_health)
pred_reref <- predict(mod_gen_reref)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
Reduced model R^2= 0.1522 Adjusted R^2= 0.1515
Full model R^2= 0.1694 Adjusted R^2= 0.1681
# Reduced model (no gen_health)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)
glance(mod_reduced)## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.152 0.152 7.28 224. 3.17e-177 4 -17018. 34047. 34087.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.169 0.168 7.21 127. 8.21e-195 8 -16966. 33953. 34018.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
H0= gen health does not improve the model H1= gen health improves the model F-stat= 25.8838 df= 4 (residual= 4991) p-value= 0 The p-value is <0.001 therefore we reject the null hypothesis.
# Partial F-test
f_test <- anova(mod_reduced, mod_health)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does General Health Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health | 4991 | 259335.4 | 4 | 5379.751 | 25.8838 | 0 |
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
They are consistent.
Anova(mod_health, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12030.737 | 1 | 231.5357 | 0 |
| age | 10907.874 | 1 | 209.9258 | 0 |
| sex | 3663.847 | 1 | 70.5120 | 0 |
| physhlth_days | 10633.920 | 1 | 204.6535 | 0 |
| sleep_hrs | 3049.400 | 1 | 58.6868 | 0 |
| gen_health | 5379.751 | 4 | 25.8838 | 0 |
| Residuals | 259335.435 | 4991 | NA | NA |
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
The poor general health group differs most from the excellent (reference) group, therefore individuals with poor general health status tend to experience more mentally unhealthy days than those with excellent general health. Compared to those with excellent general health, individuals with poor health report an estimated 5.3353 more mentally unhealthy days, holding all other variables constant. Overall, as general health worsens, the number of mentally unhealthy days increases. Due to this being a cross sectional study with data being collected from a single point in time, we cannot assume causality (that poor health causes poor mental health) only correlation.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.
End of Lab Activity