Introduction

In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.

When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.

Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:

  • Why numeric coding of categories fails
  • How to construct dummy variables for dichotomous and multichotomous predictors
  • The concept of the reference group and how to change it
  • Interpreting dummy variable coefficients
  • Testing whether a categorical variable as a whole is significant (partial F-test)
  • Alternative coding schemes (effect coding, ordinal contrasts)

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.

Research question for today:

How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?

brfss_full <- read_rds(
  "/Users/jingjunyang/Desktop/EPI553 Project/brfss_dv_2020.rds"
) |>
  clean_names()

In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026


Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable Description Type
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
age Age in years (capped at 80) Continuous
sex Sex (Male/Female) Factor
education Education level (4 categories) Factor
gen_health General health status (5 categories) Factor
marital_status Marital status (6 categories) Factor
educ_numeric Education as numeric code (1–4) Numeric
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)

brfss_dv <- readRDS(
  "/Users/jingjunyang/Desktop/EPI553 Project/brfss_dv_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

brfss_full |>
  select(menthlth_days, age, sex, gen_health, marital_status) |>
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      age           ~ "Age (years)",
      sex           ~ "Sex",
      gen_health   ~ "General Health status",
      marital_status ~"Marital Status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |> 
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics - BRFSS 2020 Analytic Sample (n= 5,000)**") |>
  as_flex_table()
**Table 1. Descriptive Statistics - BRFSS 2020 Analytic Sample (n= 5,000)**

Characteristic

N

N = 5,0001

Mentally unhealthy days (past 30)

5,000

3.8 (7.9)

Age (years)

5,000

54.9 (17.5)

Sex

5,000

Male

2,303 (46%)

Female

2,697 (54%)

General Health status

5,000

Excellent

1,065 (21%)

Very good

1,803 (36%)

Good

1,426 (29%)

Fair

523 (10%)

Poor

183 (3.7%)

Marital Status

5,000

Married

2,708 (54%)

Divorced

622 (12%)

Widowed

534 (11%)

Separated

109 (2.2%)

Never married

848 (17%)

Unmarried couple

179 (3.6%)

1Mean (SD); n (%)

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

ggplot(brfss_full, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by General Health Status",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "General Health Status",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

## People reported “Poor” general health status have the most mentally unhealthy days. The pattern appear consistent with what I was expecting.

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

ggplot(brfss_full, aes(x = marital_status, y = menthlth_days, fill = marital_status)) + stat_summary(fun = "mean", geom = "bar", alpha = 0.8) +
  scale_fill_brewer(palette = "Blues")+
  labs(
    title = "Mentally Unhealthy Days by marital status",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "Marital Status",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

## The seperated group has the highest mean number of mentally unhealthy days.The widowed group has the lowest mean number of mentally unhealthy days.


Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

#Create a numeric version of gen_health
brfss_numeric <- brfss_full |>
  mutate(
    gen_health_numeric = case_when(
      gen_health == "Excellent" ~1,
      gen_health == "Very good" ~2,
      gen_health == "Good"      ~3,
      gen_health == "Fair"      ~4,
      gen_health == "Poor"      ~5,
      TRUE                     ~NA_real_  
    )
  )

#Fit simple regression model
model_naive <- lm(menthlth_days ~ gen_health_numeric, data = brfss_numeric)

summary(model_naive)
## 
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = brfss_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6173 -4.9016 -3.0438 -0.0438 28.8140 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.6718     0.2705  -2.484    0.013 *  
## gen_health_numeric   1.8578     0.1036  17.926   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.661 on 4998 degrees of freedom
## Multiple R-squared:  0.06041,    Adjusted R-squared:  0.06022 
## F-statistic: 321.3 on 1 and 4998 DF,  p-value: < 2.2e-16

The coefficient for gen_health is 1.8578, meaning that each one-unit increase in genral health category, the mean number of mentally unhealthy days increases by 1.8578 days.

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

#Fit simple regression model
model_factor <- lm(menthlth_days ~ gen_health, data = brfss_full)
summary(model_factor)
## 
## Call:
## lm(formula = menthlth_days ~ gen_health, data = brfss_full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7814  -4.0708  -2.7077  -0.1174  27.8826 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.1174     0.2332   9.079  < 2e-16 ***
## gen_healthVery good   0.5903     0.2941   2.007   0.0448 *  
## gen_healthGood        1.9535     0.3082   6.337 2.54e-10 ***
## gen_healthFair        5.0624     0.4064  12.457  < 2e-16 ***
## gen_healthPoor        9.6640     0.6090  15.868  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.611 on 4995 degrees of freedom
## Multiple R-squared:  0.07334,    Adjusted R-squared:  0.07259 
## F-statistic: 98.83 on 4 and 4995 DF,  p-value: < 2.2e-16

## The factor version uses 4 coefficients because gen_health has 5 categories, and R uses one category (Excellent) as the reference group. So the remaining 4 categories (Very good, Good, Fair, Poor) each get their own coefficient showing how different they are from Excellent. The naive numeric approach is misleading because it assumes the gap between each category is equal. For example, it treats Excellent to Very good as the same distance as Fair to Poor. In reality, these gaps may not be equal, so using a single number to represent all of them can give a misleading result.

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

model_full <- lm(menthlth_days ~age + sex + physhlth_days + sleep_hrs + gen_health,
                 data=brfss_full)
summary(model_full)
## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs + 
##     gen_health, data = brfss_full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5175  -3.5549  -1.6999   0.4316  31.3631 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          9.592982   0.630441  15.216  < 2e-16 ***
## age                 -0.086672   0.005982 -14.489  < 2e-16 ***
## sexFemale            1.725379   0.205472   8.397  < 2e-16 ***
## physhlth_days        0.231420   0.016177  14.306  < 2e-16 ***
## sleep_hrs           -0.586595   0.076572  -7.661 2.21e-14 ***
## gen_healthVery good  0.789947   0.279661   2.825  0.00475 ** 
## gen_healthGood       1.843601   0.297260   6.202 6.03e-10 ***
## gen_healthFair       3.395283   0.417964   8.123 5.66e-16 ***
## gen_healthPoor       5.335347   0.682949   7.812 6.80e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1681 
## F-statistic: 127.3 on 8 and 4991 DF,  p-value: < 2.2e-16

Predicted mentally unhealthy days = 9.592982 - 0.0867(age)+1.73(sexfemale)+0.23(physhlth_days)-0.58(sleep_hrs)+0.80(gen_health_very good) + 1.84(gen_healthgood)+3.40(gen_health:fair)+5.33(gen_healthpoor)

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.” ## “very good”(0.79): Individual with very good general health reports 0.79 more mentally unhealthy days compared to those with excellent health, holding all other variables constant. Good(1.84): Individual with good general health reports 1.84 more mentally unhealthy days compared to those with excellent health, holding all other variables constant. Fair(3.40): Individual with fair general health reports 3.4 more mentally unhealthy days compared to those with excellent health, holding all other variables constant. Poor(5.33): Individual with poor general health reports 5.3 more mentally unhealthy days compared to those with excellent health, holding all other variables constant.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

coef_data <- tidy(model_full, conf.int = TRUE) %>%
  filter(grepl("gen_health", term))

coef_data <- coef_data %>%
  mutate(
    term = case_when(
      term == "gen_healthVery good" ~ "Very good",
      term == "gen_healthGood"      ~ "Good",
      term == "gen_healthFair"      ~ "Fair",
      term == "gen_healthPoor"      ~ "Poor",
      TRUE ~ term
    ))

ggplot(coef_data, aes(x= estimate, y= term)) +
  geom_point() +
  geom_errorbarh(aes(xmin =conf.low, xmax= conf.high), height =0.2) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(
    x = "Estimated Difference in Mentally Unhealthy Days",
    y = "General Health Category",
    title = "Effect of General Health on Mentally Unhealthy Days"
  )

## The group that differs most from the reference group is the “Poor” general health group, about 5.3 more mentally unhealthy days, holding all other variable constant. —

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

brfss_full$gen_health <- relevel(brfss_full$gen_health, ref = "Good")

model_refit <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
                     data = brfss_full)

tidy(model_refit, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: Good)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Same Model, Different Reference Group (Reference: Good)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 11.4366 0.6298 18.1584 0e+00 10.2019 12.6713
age -0.0867 0.0060 -14.4888 0e+00 -0.0984 -0.0749
sexFemale 1.7254 0.2055 8.3971 0e+00 1.3226 2.1282
physhlth_days 0.2314 0.0162 14.3057 0e+00 0.1997 0.2631
sleep_hrs -0.5866 0.0766 -7.6607 0e+00 -0.7367 -0.4365
gen_healthExcellent -1.8436 0.2973 -6.2020 0e+00 -2.4264 -1.2608
gen_healthVery good -1.0537 0.2581 -4.0819 0e+00 -1.5597 -0.5476
gen_healthFair 1.5517 0.3861 4.0186 1e-04 0.7947 2.3087
gen_healthPoor 3.4917 0.6506 5.3673 0e+00 2.2164 4.7671

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not? ## The coefficient for the variable including age, sex, physhlth_days, sleep_hrs remained the same. This is because changing the reference group only affects how the dummy variables are compared, but not the relationship the relationship between the other variables and the outcome. However,the coefficient of gen_health categories changed because they are now compared to different reference group. For example, instead of comparing each category to Excellent, they are now compared to the new reference group. The differences between categories remain the same.

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

pred_original <- predict(model_full)
pred_new <- predict(model_refit)

cor(pred_original, pred_new)
## [1] 1

The correlation between the predicted values from both models is 1, indicating that the two sets of predictions are perfectly identical. This makes sense because changing the reference group does not change the model. It only changes which category is used for comparison.

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

model_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
                    data=brfss_full)
summary(model_reduced)
## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs, 
##     data = brfss_full)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.314  -3.520  -1.805   0.159  32.013 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.504082   0.612374  17.153  < 2e-16 ***
## age           -0.077335   0.005962 -12.972  < 2e-16 ***
## sexFemale      1.686413   0.207440   8.130 5.38e-16 ***
## physhlth_days  0.317247   0.013210  24.015  < 2e-16 ***
## sleep_hrs     -0.633026   0.077171  -8.203 2.96e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.28 on 4995 degrees of freedom
## Multiple R-squared:  0.1522, Adjusted R-squared:  0.1515 
## F-statistic: 224.2 on 4 and 4995 DF,  p-value: < 2.2e-16
data.frame(
  Model = c("Reduced", "Full"),
  R_squared = c(
    summary(model_reduced)$r.squared,
    summary(model_full)$r.squared
  ),
  Adj_R_squared = c(
    summary(model_reduced)$adj.r.squared,
    summary(model_full)$adj.r.squared
  )
)
##     Model R_squared Adj_R_squared
## 1 Reduced 0.1521948     0.1515159
## 2    Full 0.1694246     0.1680933

##The full model has a higher R-squared (0.169) compared to the reduced model (0.152), indicating that including general health increases the proportion of variance explained in mentally unhealthy days. The adjusted R-squared also increases from 0.152 to 0.168, suggesting that the additional variables model fit. Overall, this indicates that general health is an important predictor of mentally unhealthy days.

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: menthlth_days ~ age + sex + physhlth_days + sleep_hrs
## Model 2: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4995 264715                                  
## 2   4991 259335  4    5379.8 25.884 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null hypothesis(H0): All coefficients for the gen_health dummy variables are equal to 0 β very good = β good = βfair = β poor = 0. Alternative hypothesis(H1): At least one of the gen_health coefficients is not equal to 0. Df=4 and 4991, p-value < 2.2e-16, F-statistic:25.884. Since the p-value is <0.05, we reject the null hypothesis. This indicates that general health significantly imporve the model and should be included as a predictor of mentally unhealthy days.

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

car::Anova(model_full,type = "III") 
## Anova Table (Type III tests)
## 
## Response: menthlth_days
##               Sum Sq   Df F value    Pr(>F)    
## (Intercept)    12031    1 231.536 < 2.2e-16 ***
## age            10908    1 209.926 < 2.2e-16 ***
## sex             3664    1  70.512 < 2.2e-16 ***
## physhlth_days  10634    1 204.654 < 2.2e-16 ***
## sleep_hrs       3049    1  58.687 2.207e-14 ***
## gen_health      5380    4  25.884 < 2.2e-16 ***
## Residuals     259335 4991                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# The result is consistent with the results from the partial F test (F-statistic=25.88), which indicates that gen_health significantly improves the model.

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

  • Identify which general health groups differ most from the reference
  • State the direction and approximate magnitude of the association
  • Appropriately acknowledge the cross-sectional nature of the data
  • Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”) ## People with worse general health tend to report more mentally unhealthy days compared to those with excellent health. The difference is largest for individuals reporting poor health, who experience several more mentally unhealthy days on average than those in excellent health. Those with fair and good health also report higher numbers of mentally unhealthy days, but the increase is smaller compared to the poor health group. Because this is a cross-sectional study, we can describe an association but cannot conclude that general health status causes changes in mental health days.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison. ##General health seems more strongly related to mentally unhealthy days than education. The differences between health groups are larger and more consistent than the differences between education levels. People with poorer general health report many more mentally unhealthy days compared to those in excellent health, while education shows smaller changes. If I were choosing a final model, I would compare how much each variable improves the model fit and also think about which one makes more sense for explaining mental health.


End of Lab Activity