Introduction

In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.

When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.

Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:

  • Why numeric coding of categories fails
  • How to construct dummy variables for dichotomous and multichotomous predictors
  • The concept of the reference group and how to change it
  • Interpreting dummy variable coefficients
  • Testing whether a categorical variable as a whole is significant (partial F-test)
  • Alternative coding schemes (effect coding, ordinal contrasts)

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

Part 2: In-Class Lab Activity

EPI 553 — Dummy Variables Lab


Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable Description Type
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
age Age in years (capped at 80) Continuous
sex Sex (Male/Female) Factor
education Education level (4 categories) Factor
gen_health General health status (5 categories) Factor
marital_status Marital status (6 categories) Factor
educ_numeric Education as numeric code (1–4) Numeric
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)

brfss_dv <- readRDS(
  "~/Desktop/EPI553/data/brfss_dv_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

brfss_dv |>
  select(menthlth_days, age, sex, gen_health, marital_status) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      gen_health     ~ "General health status",
      marital_status ~ "Marital Status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
  as_flex_table()
**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**

Characteristic

N

N = 5,0001

Mentally unhealthy days (past 30)

5,000

3.8 (7.9)

Age (years)

5,000

54.9 (17.5)

Sex

5,000

Male

2,303 (46%)

Female

2,697 (54%)

General health status

5,000

Excellent

1,065 (21%)

Very good

1,803 (36%)

Good

1,426 (29%)

Fair

523 (10%)

Poor

183 (3.7%)

Marital Status

5,000

Married

2,708 (54%)

Divorced

622 (12%)

Widowed

534 (11%)

Separated

109 (2.2%)

Never married

848 (17%)

Unmarried couple

179 (3.6%)

1Mean (SD); n (%)

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

The poor general health status group reports the most mentally unhealthy days. The pattern in the boxplot appears consistent with what you would expect.

ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by General Health Status",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "General Health Status",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

The separated marital status group has he highest mean and the widowed has the lowest mean of mentally unhealthy days.

brfss_dv |>
  group_by(marital_status) |>
  summarize(
    mean_unhealthydays = mean(menthlth_days, na.rm = TRUE)
  )
## # A tibble: 6 × 2
##   marital_status   mean_unhealthydays
##   <fct>                         <dbl>
## 1 Married                        3.10
## 2 Divorced                       4.49
## 3 Widowed                        2.67
## 4 Separated                      6.22
## 5 Never married                  5.28
## 6 Unmarried couple               6.07

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

The coefficient for gen_health_numeric is 1.78. Which indicates that for each one unit (status) increase in general health status the mean number of mentally unhealthy days increase by 1.78 days

#create numeric version
brfss_dv <- brfss_dv |>
  mutate(
    gen_health_numeric= case_when(
      gen_health == "Excellent" ~1,
      gen_health == "Very Good" ~2,
      gen_health == "Good"      ~3,
      gen_health == "Fair"      ~4,
      gen_health == "Poor"      ~5
    )
  )

#fit simple linear regression
naive_mod <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)

tidy(naive_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Naive Model: General Health Status Treated as Continuous",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Naive Model: General Health Status Treated as Continuous
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) -0.2899 0.3404 -0.8516 0.3945 -0.9574 0.3776
gen_health_numeric 1.7842 0.1175 15.1859 0.0000 1.5538 2.0145

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

The factor version uses 4 coefficients instead of 1 because the naive approach regression model assumes there are equally spaced, linear increments between the groups (excellent, very good, good, fair, and poor). The factor version treats one category as a reference variable, in this case excellent, and then 4 coefficients are given that represent the difference in mentally unhealthy days between the reference (excellent) and that category. The naive approach is misleading because for ordinal categories the assumption of equal spacing between the groups is not justified.

mod_factor <- lm(menthlth_days ~ gen_health, data = brfss_dv)
tidy(mod_factor, conf.int = TRUE)
## # A tibble: 5 × 7
##   term                estimate std.error statistic  p.value conf.low conf.high
##   <chr>                  <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)            2.12      0.233      9.08 1.55e-19   1.66        2.57
## 2 gen_healthVery good    0.590     0.294      2.01 4.48e- 2   0.0137      1.17
## 3 gen_healthGood         1.95      0.308      6.34 2.54e-10   1.35        2.56
## 4 gen_healthFair         5.06      0.406     12.5  4.23e-35   4.27        5.86
## 5 gen_healthPoor         9.66      0.609     15.9  2.34e-55   8.47       10.9

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

MentalHealthDays= 9.593 + -0.0867(age) + 1.7254(female) + 0.2314(PhysDays) + -0.5866(Sleep) + 0.7899(VeryGood) + 1.8436(Good) + 3.3953(Fair) + 5.3353(Poor)

# Fit model with gen health as a factor 
mod_health <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
               data = brfss_dv)

tidy(mod_health, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with General Health Dummy Variables (Reference: Excellent)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Model with General Health Dummy Variables (Reference: Excellent)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 9.5930 0.6304 15.2163 0.0000 8.3570 10.8289
age -0.0867 0.0060 -14.4888 0.0000 -0.0984 -0.0749
sexFemale 1.7254 0.2055 8.3971 0.0000 1.3226 2.1282
physhlth_days 0.2314 0.0162 14.3057 0.0000 0.1997 0.2631
sleep_hrs -0.5866 0.0766 -7.6607 0.0000 -0.7367 -0.4365
gen_healthVery good 0.7899 0.2797 2.8247 0.0048 0.2417 1.3382
gen_healthGood 1.8436 0.2973 6.2020 0.0000 1.2608 2.4264
gen_healthFair 3.3953 0.4180 8.1234 0.0000 2.5759 4.2147
gen_healthPoor 5.3353 0.6829 7.8122 0.0000 3.9965 6.6742

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

Very Good (B= 0.7899): Compared to those with excellent general health, individuals with very good health report an estimated 0.7899 more mentally unhealthy days, holding all other variables constant. Good (B= 1.8436): Compared to those with excellent general health, individuals with good health report an estimated 1.8436 more mentally unhealthy days, holding all other variables constant. Fair (B= 3.3953): Compared to those with excellent general health, individuals with fair health report an estimated 3.3953 more mentally unhealthy days, holding all other variables constant. Poor (B= 5.3353): Compared to those with excellent general health, individuals with poor health report an estimated 5.3353 more mentally unhealthy days, holding all other variables constant.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

The poor general health status group differs most from the reference group.

coef_gen_health <- tidy(mod_factor, conf.int = TRUE) |>
  filter(grepl("gen_health", term))

ggplot(coef_gen_health, aes(x = term, y = estimate)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  coord_flip() +
  labs(
    title = "Coefficient Plot for General Health (Reference: Excellent)",
    x = "General Health Status",
    y = "Estimated Effect on Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13)


Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

# Change reference group to good 
brfss_dv$genhealth_reref <- relevel(brfss_dv$gen_health, ref = "Good")

mod_gen_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + genhealth_reref,
                     data = brfss_dv)

tidy(mod_gen_reref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: Good)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Same Model, Different Reference Group (Reference: Good)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 11.4366 0.6298 18.1584 0e+00 10.2019 12.6713
age -0.0867 0.0060 -14.4888 0e+00 -0.0984 -0.0749
sexFemale 1.7254 0.2055 8.3971 0e+00 1.3226 2.1282
physhlth_days 0.2314 0.0162 14.3057 0e+00 0.1997 0.2631
sleep_hrs -0.5866 0.0766 -7.6607 0e+00 -0.7367 -0.4365
genhealth_rerefExcellent -1.8436 0.2973 -6.2020 0e+00 -2.4264 -1.2608
genhealth_rerefVery good -1.0537 0.2581 -4.0819 0e+00 -1.5597 -0.5476
genhealth_rerefFair 1.5517 0.3861 4.0186 1e-04 0.7947 2.3087
genhealth_rerefPoor 3.4917 0.6506 5.3673 0e+00 2.2164 4.7671

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

The continuous variables stayed the same between the two models. They are the same because changing the reference group only changes the interpretation of the dummy variable coefficients not the model’s fit or predictions. The continuous variables are uneffected by the reference group change so their coefficients remain the same.

tribble(
  ~Quantity, ~`Ref: Excellent`, ~`Ref: Good`,
  "Intercept", round(coef(mod_health)[1], 3), round(coef(mod_gen_reref)[1], 3),
  "Age coefficient", round(coef(mod_health)[2], 3), round(coef(mod_gen_reref)[2], 3),
  "Sex coefficient", round(coef(mod_health)[3], 3), round(coef(mod_gen_reref)[3], 3),
  "Physical health days", round(coef(mod_health)[4], 3), round(coef(mod_gen_reref)[4], 3),
  "Sleep hours", round(coef(mod_health)[5], 3), round(coef(mod_gen_reref)[5], 3),
  "R-squared", round(summary(mod_health)$r.squared, 4), round(summary(mod_gen_reref)$r.squared, 4),
  "Residual SE", round(summary(mod_health)$sigma, 3), round(summary(mod_gen_reref)$sigma, 3)
) |>
  kable(caption = "Comparing Models with Different Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Comparing Models with Different Reference Groups
Quantity Ref: Excellent Ref: Good
Intercept 9.5930 11.4370
Age coefficient -0.0870 -0.0870
Sex coefficient 1.7250 1.7250
Physical health days 0.2310 0.2310
Sleep hours -0.5870 -0.5870
R-squared 0.1694 0.1694
Residual SE 7.2080 7.2080

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

Changing the reference groups dont change the predictions instead it changes how we word our interpretation.

# Verify that predicted values are identical
pred_orig <- predict(mod_health)
pred_reref <- predict(mod_gen_reref)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(max(abs(pred_orig - pred_reref)),
            cor(pred_orig, pred_reref))
) |>
  kable(caption = "Verification: Predicted Values Are Identical") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Verification: Predicted Values Are Identical
Check Value
Maximum absolute difference in predictions 0
Correlation between predictions 1

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

Reduced model R^2= 0.1522 Adjusted R^2= 0.1515

Full model R^2= 0.1694 Adjusted R^2= 0.1681

# Reduced model (no gen_health)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)

glance(mod_reduced)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.152         0.152  7.28      224. 3.17e-177     4 -17018. 34047. 34087.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(mod_health)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.169         0.168  7.21      127. 8.21e-195     8 -16966. 33953. 34018.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

H0= gen health does not improve the model H1= gen health improves the model F-stat= 25.8838 df= 4 (residual= 4991) p-value= 0 The p-value is <0.001 therefore we reject the null hypothesis.

# Partial F-test
f_test <- anova(mod_reduced, mod_health)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does General Health Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Partial F-test: Does General Health Improve the Model?
term df.residual rss df sumsq statistic p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs 4995 264715.2 NA NA NA NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health 4991 259335.4 4 5379.751 25.8838 0

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

They are consistent.

Anova(mod_health, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Type III ANOVA: Testing Each Predictor’s Contribution
term sumsq df statistic p.value
(Intercept) 12030.737 1 231.5357 0
age 10907.874 1 209.9258 0
sex 3663.847 1 70.5120 0
physhlth_days 10633.920 1 204.6535 0
sleep_hrs 3049.400 1 58.6868 0
gen_health 5379.751 4 25.8838 0
Residuals 259335.435 4991 NA NA

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

  • Identify which general health groups differ most from the reference
  • State the direction and approximate magnitude of the association
  • Appropriately acknowledge the cross-sectional nature of the data
  • Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

The poor general health group differs most from the excellent (reference) group, therefore individuals with poor general health status tend to experience more mentally unhealthy days than those with excellent general health. Compared to those with excellent general health, individuals with poor health report an estimated 5.3353 more mentally unhealthy days, holding all other variables constant. Overall, as general health worsens, the number of mentally unhealthy days increases. Due to this being a cross sectional study with data being collected from a single point in time, we cannot assume causality (that poor health causes poor mental health) only correlation.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

General health status appears to be more strongly associated with mental health days. The difference in mentally unhealthy days is much larger for gen health status in comparison to the difference across education groups. Those in poor health report approximately 5 more mentally unhealthy days in comparison to those with excellent health. Due to the stronger relationship between general health and mental health I would choose to include this relationship when building a final model.

End of Lab Activity