Introduction

In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.

When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.

Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:

Why numeric coding of categories fails
How to construct dummy variables for dichotomous and multichotomous predictors
The concept of the reference group and how to change it
Interpreting dummy variable coefficients
Testing whether a categorical variable as a whole is significant (partial F-test)
Alternative coding schemes (effect coding, ordinal contrasts)

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.

Research question for today:

How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?

brfss_full <- read_xpt(
  "LLCP2020.XPT"
) |>
  clean_names()

brfss_dv <- brfss_full |>
  mutate(
    # Outcome: mentally unhealthy days in past 30
    menthlth_days = case_when(
      menthlth == 88                    ~ 0,
      menthlth >= 1 & menthlth <= 30   ~ as.numeric(menthlth),
      TRUE                             ~ NA_real_
    ),
    # Physical health days
    physhlth_days = case_when(
      physhlth == 88                    ~ 0,
      physhlth >= 1 & physhlth <= 30   ~ as.numeric(physhlth),
      TRUE                             ~ NA_real_
    ),
    # Sleep hours
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14   ~ as.numeric(sleptim1),
      TRUE                             ~ NA_real_
    ),
    # Age
    age = age80,
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Education (6-level raw BRFSS variable EDUCA)
    # 1 = Never attended school or only kindergarten
    # 2 = Grades 1 through 8 (Elementary)
    # 3 = Grades 9 through 11 (Some high school)
    # 4 = Grade 12 or GED (High school graduate)
    # 5 = College 1 year to 3 years (Some college or technical school)
    # 6 = College 4 years or more (College graduate)
    # 9 = Refused
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    # General health status (5-level)
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    # Marital status
    marital_status = factor(case_when(
      marital == 1 ~ "Married",
      marital == 2 ~ "Divorced",
      marital == 3 ~ "Widowed",
      marital == 4 ~ "Separated",
      marital == 5 ~ "Never married",
      marital == 6 ~ "Unmarried couple",
      TRUE         ~ NA_character_
    ), levels = c("Married", "Divorced", "Widowed", "Separated",
                  "Never married", "Unmarried couple")),
    # Store the raw education numeric code for the "naive approach" demonstration
    educ_numeric = case_when(
      educa %in% c(1, 2, 3) ~ 1,
      educa == 4             ~ 2,
      educa == 5             ~ 3,
      educa == 6             ~ 4,
      TRUE                   ~ NA_real_
    )
  ) |>
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age), age >= 18,
    !is.na(sex),
    !is.na(education),
    !is.na(gen_health),
    !is.na(marital_status)
  )

# Reproducible random sample
set.seed(1220)
brfss_dv <- brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health, marital_status, educ_numeric) |>
  slice_sample(n = 5000)
setwd("C:/Users/Alexandra/Desktop/Epi 553/Labs")
# Save for lab activity
saveRDS(brfss_dv,
  "brfss_dv_2020.rds")
brfss_dv <- readRDS("brfss_dv_2020.rds")
tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_dv), ncol(brfss_dv))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

Descriptive Statistics

brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      physhlth_days  ~ "Physically unhealthy days (past 30)",
      sleep_hrs      ~ "Sleep (hours/night)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      education      ~ "Education level",
      gen_health     ~ "General health status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Physically unhealthy days (past 30)	5,000	3.3 (7.9)
Sleep (hours/night)	5,000	7.0 (1.4)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
Education level	5,000
Less than HS		290 (5.8%)
HS graduate		1,348 (27%)
Some college		1,340 (27%)
College graduate		2,022 (40%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
1Mean (SD); n (%)

ggplot(brfss_dv, aes(x = education, fill = education)) +
  geom_bar(alpha = 0.85) +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Distribution of Education Level",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x = "Education Level",
    y = "Count"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Distribution of Education Level in Analytic Sample

ggplot(brfss_dv, aes(x = education, y = menthlth_days, fill = education)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by Education Level",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "Education Level",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mental Health Days by Education Level

Part 1: Guided Practice — Dummy Variables

1. Categorical Variables: The Problem

1.1 Types of Categorical Variables

Categorical predictor variables come in two forms:

Type	Definition	Examples
Nominal	Categories with no natural ordering	Sex, race/ethnicity, marital status, blood type
Ordinal	Categories with a meaningful order	Education level, income bracket, disease stage, Likert scale

A further distinction is:

Dichotomous (k = 2): Only two categories (e.g., sex: Male/Female)
Multichotomous (k > 2): Three or more categories (e.g., education: 4 levels)

Note that categorical variables can also be created by grouping continuous variables (e.g., age groups from continuous age), though this generally results in a loss of information.

1.2 The Naive Approach: Why Numeric Codes Fail

Suppose education has been coded as: 1 = Less than HS, 2 = HS graduate, 3 = Some college, 4 = College graduate.

If we include this numeric code directly in a regression model, we are assuming:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 \cdot \text{educ_numeric} + \varepsilon\]

This forces the model to assume that the difference in expected \(Y\) between “Less than HS” and “HS graduate” is the same as the difference between “HS graduate” and “Some college,” and the same again between “Some college” and “College graduate.” In other words, we are assuming equally spaced, linear increments.

# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ age + educ_numeric, data = brfss_dv)

tidy(naive_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Naive Model: Education Treated as Continuous",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Naive Model: Education Treated as Continuous
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	9.5601	0.5039	18.9727	8.5723	10.5479
age	-0.0661	0.0063	-10.5135	-0.0784	-0.0538
educ_numeric	-0.7168	0.1158	-6.1917	-0.9437	-0.4898

This model estimates a single coefficient for education, meaning each step up the education ladder is associated with the same change in mentally unhealthy days. This constraint is problematic for two reasons:

For nominal variables (like race or marital status), numeric codes are entirely arbitrary. The “distance” between code 1 and code 2 has no meaning.
Even for ordinal variables (like education), the assumption of equal spacing is rarely justified. The jump from “Less than HS” to “HS graduate” is substantively different from “Some college” to “College graduate.”

Let’s visualize why this matters:

# Compute observed group means
group_means <- brfss_dv |>
  summarise(mean_days = mean(menthlth_days), .by = c(education, educ_numeric))

# Generate predictions from the naive model
pred_naive <- tibble(
  educ_numeric = 1:4,
  predicted = predict(naive_mod, newdata = tibble(age = mean(brfss_dv$age), educ_numeric = 1:4))
)

ggplot() +
  geom_point(data = group_means,
             aes(x = educ_numeric, y = mean_days),
             size = 4, color = "steelblue") +
  geom_line(data = pred_naive,
            aes(x = educ_numeric, y = predicted),
            color = "tomato", linewidth = 1.2, linetype = "dashed") +
  geom_point(data = pred_naive,
             aes(x = educ_numeric, y = predicted),
             size = 3, color = "tomato", shape = 17) +
  scale_x_continuous(
    breaks = 1:4,
    labels = c("Less than HS", "HS graduate", "Some college", "College graduate")
  ) +
  labs(
    title = "Observed Group Means (blue) vs. Naive Linear Fit (red)",
    subtitle = "The naive model forces equal spacing between education levels",
    x = "Education Level",
    y = "Mean Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13)

Naive Linear Fit vs. Actual Group Means by Education

Key takeaway: The blue dots (observed means) do not fall along a straight line. The naive linear model (red) misrepresents the actual pattern. We need a more flexible approach.

2. Dummy (Indicator) Variables

2.1 Definition and the k - 1 Rule

A dummy variable (also called an indicator variable) is a variable that takes on only two possible values:

1 to indicate the presence of a condition
0 to indicate its absence

If a categorical predictor has \(k\) categories, we need exactly \(k - 1\) dummy variables when the model includes an intercept. The omitted category becomes the reference group (also called the control group or baseline group).

Why \(k - 1\) and not \(k\)? Because the intercept already captures the mean for the reference group. Including all \(k\) dummies plus the intercept would create perfect multicollinearity (the dummy variables would sum to equal the intercept column), and the model could not be estimated.

2.2 The Dichotomous Case (k = 2)

The simplest example is a variable with two categories, such as sex.

With \(k = 2\), we need \(2 - 1 = 1\) dummy variable. If we choose Female as the reference group:

\[\text{male} = \begin{cases} 1 & \text{if male} \\ 0 & \text{if female} \end{cases}\]

The regression model becomes:

\[Y = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{male} + \varepsilon\]

For males (\(\text{male} = 1\)): \[E(Y | \text{age}, \text{male}) = (\beta_0 + \beta_2) + \beta_1 \cdot \text{age}\]

For females (\(\text{male} = 0\)): \[E(Y | \text{age}, \text{female}) = \beta_0 + \beta_1 \cdot \text{age}\]

Both groups share the same slope for age but have different intercepts. The coefficient \(\beta_2\) is the expected difference in \(Y\) between males and females, holding age constant.

# Fit model with sex as a dummy variable
mod_sex <- lm(menthlth_days ~ age + sex, data = brfss_dv)

tidy(mod_sex, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Dichotomous Dummy Variable: Sex",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Dichotomous Dummy Variable: Sex
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	6.6262	0.3730	17.7666	5.8951	7.3574
age	-0.0698	0.0063	-11.1011	-0.0821	-0.0575
sexFemale	1.8031	0.2210	8.1585	1.3698	2.2364

b_sex <- round(coef(mod_sex), 3)

Interpretation:

Intercept (6.626): The estimated mean number of mentally unhealthy days for a male of age 0. (This is a mathematical artifact, not a substantive value.)
Age (-0.07): Each additional year of age is associated with a 0.07 day change in mentally unhealthy days, holding sex constant.
Sex: Female (1.803): Compared to males (the reference group), females report an estimated 1.803 more mentally unhealthy days on average, holding age constant.

Note that R automatically creates dummy variables when a factor is included in lm(). It uses alphabetical or level order to set the reference group, which is why Male (the first level) is the reference here.

pred_sex <- ggpredict(mod_sex, terms = c("age [20:80]", "sex"))

ggplot(pred_sex, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Age and Sex",
    subtitle = "Parallel lines: same slope, different intercepts",
    x        = "Age (years)",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Sex",
    fill     = "Sex"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

Parallel Regression Lines: Males vs. Females

Geometrically: Dummy variables produce parallel regression lines. The intercept shifts by \(\beta_2\) for the non-reference group, but the slope remains the same.

3. Multichotomous Dummy Variables (k > 2)

3.1 Constructing the Dummies

Education has \(k = 4\) categories, so we need \(4 - 1 = 3\) dummy variables. If we choose “Less than HS” as the reference group:

\[\text{HS_graduate} = \begin{cases} 1 & \text{if HS graduate} \\ 0 & \text{otherwise} \end{cases}\]

\[\text{Some_college} = \begin{cases} 1 & \text{if Some college} \\ 0 & \text{otherwise} \end{cases}\]

\[\text{College_graduate} = \begin{cases} 1 & \text{if College graduate} \\ 0 & \text{otherwise} \end{cases}\]

The data matrix looks like this:

Dummy Variable Encoding for Education (Reference: Less than HS)
Observation	Education	HS_graduate	Some_college	College_graduate
1	Less than HS	0	0	0
2	HS graduate	1	0	0
3	Some college	0	1	0
4	College graduate	0	0	1
5	Less than HS	0	0	0

Notice that the reference group is identified by having all dummy variables equal to zero.

3.2 The Reference Group

The reference group is the category against which all others are compared. Key points:

The choice of reference group affects the interpretation of coefficients but not the model’s predictive ability
The predicted values are identical regardless of which reference group is chosen
Choose the reference group based on your research question (e.g., the most common category, the “control” condition, or the group of primary interest)
You can always test differences between any pair of groups, not just comparisons to the reference

3.3 Fitting the Model in R

When we include a factor variable in lm(), R automatically creates the dummy variables. The first level of the factor is used as the reference group by default.

# Fit model with education as a factor (R creates dummies automatically)
mod_educ <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education,
               data = brfss_dv)

tidy(mod_educ, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Education Dummy Variables (Reference: Less than HS)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Education Dummy Variables (Reference: Less than HS)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.1377	0.7390	15.0709	0.0000	9.6889	12.5865
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
educationHS graduate	-0.5873	0.4719	-1.2445	0.2134	-1.5125	0.3379
educationSome college	-0.1289	0.4735	-0.2723	0.7854	-1.0572	0.7993
educationCollege graduate	-1.1429	0.4607	-2.4805	0.0132	-2.0461	-0.2396

b_educ <- round(coef(mod_educ), 3)
ci_educ <- round(confint(mod_educ), 3)

3.4 Interpreting Each Dummy Coefficient

The model is:

\[\widehat{\text{Mental Health Days}} = 11.138 + -0.077(\text{Age}) + 1.681(\text{Female}) + 0.311(\text{Phys Days}) + -0.628(\text{Sleep}) + -0.587(\text{HS grad}) + -0.129(\text{Some college}) + -1.143(\text{College grad})\]

Each education coefficient represents the estimated difference in mentally unhealthy days between that group and the reference group (Less than HS), holding all other variables constant:

HS graduate (\(\hat{\beta}\) = -0.587): Compared to those with less than a high school education, HS graduates report an estimated 0.587 fewer mentally unhealthy days, holding age, sex, physical health days, and sleep constant.
Some college (\(\hat{\beta}\) = -0.129): Compared to those with less than a high school education, those with some college report an estimated 0.129 fewer mentally unhealthy days, holding all else constant.
College graduate (\(\hat{\beta}\) = -1.143): Compared to those with less than a high school education, college graduates report an estimated 1.143 fewer mentally unhealthy days, holding all else constant.

Key pattern: All comparisons are made relative to the reference group. The coefficients do NOT directly tell us the difference between, say, HS graduates and college graduates. We would need to compute \(\hat{\beta}_{\text{HS grad}} - \hat{\beta}_{\text{College grad}}\) for that comparison (or change the reference group).

3.5 Visualizing the Parallel Lines

pred_educ <- ggpredict(mod_educ, terms = c("age [20:80]", "education"))

ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Age and Education",
    subtitle = "Parallel lines: same slopes for age, different intercepts by education",
    x        = "Age (years)",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Education",
    fill     = "Education"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set2")

Predicted Mental Health Days by Age and Education Level

These are a series of parallel lines, one for each education level. The slope for age is the same across all groups; only the intercept differs. Each education dummy shifts the intercept up or down relative to the reference group.

4. Changing the Reference Group

4.1 Using `relevel()` in R

We may want to change the reference group to a category that is more epidemiologically meaningful. For instance, “College graduate” is the largest group and could serve as a natural comparison.

# Change reference group to College graduate
brfss_dv$education_reref <- relevel(brfss_dv$education, ref = "College graduate")

mod_educ_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_reref,
                     data = brfss_dv)

tidy(mod_educ_reref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: College graduate)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Same Model, Different Reference Group (Reference: College graduate)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.9948	0.6272	15.9349	0.0000	8.7652	11.2245
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
education_rerefLess than HS	1.1429	0.4607	2.4805	0.0132	0.2396	2.0461
education_rerefHS graduate	0.5556	0.2574	2.1586	0.0309	0.0510	1.0601
education_rerefSome college	1.0139	0.2566	3.9507	0.0001	0.5108	1.5171

4.2 What Changes and What Stays the Same?

tribble(
  ~Quantity, ~`Ref: Less than HS`, ~`Ref: College graduate`,
  "Intercept", round(coef(mod_educ)[1], 3), round(coef(mod_educ_reref)[1], 3),
  "Age coefficient", round(coef(mod_educ)[2], 3), round(coef(mod_educ_reref)[2], 3),
  "Sex coefficient", round(coef(mod_educ)[3], 3), round(coef(mod_educ_reref)[3], 3),
  "Physical health days", round(coef(mod_educ)[4], 3), round(coef(mod_educ_reref)[4], 3),
  "Sleep hours", round(coef(mod_educ)[5], 3), round(coef(mod_educ_reref)[5], 3),
  "R-squared", round(summary(mod_educ)$r.squared, 4), round(summary(mod_educ_reref)$r.squared, 4),
  "Residual SE", round(summary(mod_educ)$sigma, 3), round(summary(mod_educ_reref)$sigma, 3)
) |>
  kable(caption = "Comparing Models with Different Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparing Models with Different Reference Groups
Quantity	Ref: Less than HS	Ref: College graduate
Intercept	11.1380	9.9950
Age coefficient	-0.0770	-0.0770
Sex coefficient	1.6810	1.6810
Physical health days	0.3110	0.3110
Sleep hours	-0.6280	-0.6280
R-squared	0.1553	0.1553
Residual SE	7.2690	7.2690

What changes:

The intercept
The education dummy variable coefficients (they now represent comparisons to the new reference group)

What stays the same:

Coefficients for age, sex, physical health days, and sleep
\(R^2\), Adjusted \(R^2\), and Residual Standard Error
All predicted values
The overall F-test

This is a critical point: Changing the reference group does not change the model’s fit or predictions. It only changes the interpretation of the dummy variable coefficients.

# Verify that predicted values are identical
pred_orig <- predict(mod_educ)
pred_reref <- predict(mod_educ_reref)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(max(abs(pred_orig - pred_reref)),
            cor(pred_orig, pred_reref))
) |>
  kable(caption = "Verification: Predicted Values Are Identical") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Verification: Predicted Values Are Identical
Check	Value
Maximum absolute difference in predictions	0
Correlation between predictions	1

5. The Dummy Variable Trap (Perfect Multicollinearity)

5.1 Why We Cannot Include All k Dummies

If we include \(k\) dummy variables and an intercept for a variable with \(k\) categories, the columns of the design matrix \(X\) are linearly dependent. Specifically:

\[\text{Intercept} = D_1 + D_2 + \cdots + D_k\]

where \(D_1, \ldots, D_k\) are the \(k\) dummy variables (one for each category). This means the matrix \(X^TX\) is singular and cannot be inverted, so the OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^TY\) does not exist.

This is called the dummy variable trap.

The Dummy Variable Trap: Intercept = A + B + C (perfect linear dependence)
Obs	Intercept	A	B	C	A + B + C
1	1	1	0	0	1
2	1	0	1	0	1
3	1	0	0	1	1
4	1	1	0	0	1

Solutions:

Reference cell coding (default in R): Include \(k - 1\) dummies; the omitted category is absorbed into the intercept
Remove the intercept: Fit the model with - 1 in the formula and include all \(k\) dummies. Then each coefficient is the group mean (adjusted for other predictors) rather than a difference from a reference.

# Model without intercept: all k dummies included
mod_no_int <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education - 1,
                 data = brfss_dv)

tidy(mod_no_int, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model Without Intercept: All k Education Dummies Included",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Without Intercept: All k Education Dummies Included
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexMale	11.1377	0.7390	15.0709	0.0000	9.6889	12.5865
sexFemale	12.8190	0.7524	17.0365	0.0000	11.3439	14.2941
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
educationHS graduate	-0.5873	0.4719	-1.2445	0.2134	-1.5125	0.3379
educationSome college	-0.1289	0.4735	-0.2723	0.7854	-1.0572	0.7993
educationCollege graduate	-1.1429	0.4607	-2.4805	0.0132	-2.0461	-0.2396

Caution: Removing the intercept changes the interpretation of \(R^2\) and should only be done when there is a substantive reason. In most epidemiological applications, reference cell coding (the default) is preferred.

6. Testing Whether a Categorical Variable is Significant

6.1 The Partial F-Test (Type I and Type III)

When a categorical variable with \(k\) levels enters the model as \(k - 1\) dummies, we cannot assess its overall significance by looking at individual t-tests for each dummy. A single dummy might not be statistically significant on its own, yet the variable as a whole might be.

To test whether education as a whole is associated with the outcome, we use a partial F-test (also called an extra sum of squares F-test):

\[H_0: \beta_{\text{HS grad}} = \beta_{\text{Some college}} = \beta_{\text{College grad}} = 0\] \[H_A: \text{At least one } \beta_j \neq 0\]

This compares the full model (with education) to a reduced model (without education):

# Reduced model (no education)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)

# Partial F-test
f_test <- anova(mod_reduced, mod_educ)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does Education Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Does Education Improve the Model?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	264715.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education	4992	263744.4	3	970.7509	6.1246	4e-04

6.2 Using `car::Anova()` for Type III Tests

The car::Anova() function with type = "III" provides a convenient way to test the overall significance of each predictor, including categorical variables:

Anova(mod_educ, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Type III ANOVA: Testing Each Predictor’s Contribution
term	sumsq	df	statistic	p.value
(Intercept)	12000.1867	1	227.1325	0e+00
age	8863.3522	1	167.7603	0e+00
sex	3469.6448	1	65.6714	0e+00
physhlth_days	28765.1139	1	544.4492	0e+00
sleep_hrs	3506.1243	1	66.3619	0e+00
education	970.7509	3	6.1246	4e-04
Residuals	263744.4348	4992	NA	NA

Type I vs. Type III: Type I (sequential) sums of squares depend on the order variables enter the model. Type III (partial) sums of squares test each variable after all others, regardless of order. For unbalanced observational data (the norm in epidemiology), Type III is preferred.

7. Contrasts and Alternative Coding Schemes

7.1 Reference (Treatment) Coding — The Default

This is what R uses by default (contr.treatment). Each coefficient represents the difference between a group and the reference group.

contrasts(brfss_dv$education)

##                  HS graduate Some college College graduate
## Less than HS               0            0                0
## HS graduate                1            0                0
## Some college               0            1                0
## College graduate           0            0                1

7.2 Effect (Deviation) Coding

In effect coding (contr.sum), each coefficient represents the difference between a group’s mean and the grand mean (the unweighted average of all group means). This is common in ANOVA contexts.

# Set effect coding
brfss_dv$education_effect <- brfss_dv$education
contrasts(brfss_dv$education_effect) <- contr.sum(4)

mod_effect <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_effect,
                 data = brfss_dv)

tidy(mod_effect, conf.int = TRUE) |>
  mutate(
    term = case_when(
      str_detect(term, "education_effect1") ~ "Education: Less than HS vs. Grand Mean",
      str_detect(term, "education_effect2") ~ "Education: HS graduate vs. Grand Mean",
      str_detect(term, "education_effect3") ~ "Education: Some college vs. Grand Mean",
      TRUE ~ term
    ),
    across(where(is.numeric), \(x) round(x, 4))
  ) |>
  kable(
    caption = "Effect Coding: Each Education Coefficient vs. Grand Mean",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Effect Coding: Each Education Coefficient vs. Grand Mean
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	10.6729	0.6172	17.2911	0.0000	9.4628	11.8830
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
Education: Less than HS vs. Grand Mean	0.4648	0.3323	1.3988	0.1619	-0.1866	1.1162
Education: HS graduate vs. Grand Mean	-0.1225	0.1939	-0.6319	0.5275	-0.5026	0.2576
Education: Some college vs. Grand Mean	0.3358	0.1946	1.7257	0.0845	-0.0457	0.7174

With effect coding, the intercept is the grand mean (adjusted for covariates), and each education coefficient shows how far that group deviates from the grand mean. The omitted group’s deviation is the negative sum of the others.

7.3 Polynomial (Orthogonal) Contrasts for Ordinal Variables

When a categorical variable is truly ordinal (like education), we can test for specific patterns using orthogonal polynomial contrasts (contr.poly). These decompose the group differences into linear, quadratic, and cubic trends.

# Ordinal polynomial contrasts
brfss_dv$education_ord <- brfss_dv$education
contrasts(brfss_dv$education_ord) <- contr.poly(4)

mod_ord <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_ord,
              data = brfss_dv)

tidy(mod_ord, conf.int = TRUE) |>
  mutate(
    term = case_when(
      str_detect(term, "\\.L$") ~ "Education: Linear trend",
      str_detect(term, "\\.Q$") ~ "Education: Quadratic trend",
      str_detect(term, "\\.C$") ~ "Education: Cubic trend",
      TRUE ~ term
    ),
    across(where(is.numeric), \(x) round(x, 4))
  ) |>
  kable(
    caption = "Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	10.6729	0.6172	17.2911	0.0000	9.4628	11.8830
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
Education: Linear trend	-0.6642	0.3158	-2.1028	0.0355	-1.2833	-0.0450
Education: Quadratic trend	-0.2133	0.2682	-0.7954	0.4264	-0.7391	0.3125
Education: Cubic trend	-0.5630	0.2142	-2.6282	0.0086	-0.9830	-0.1431

Interpretation:

The linear contrast tests whether there is a consistent trend (higher education = fewer/more days)
The quadratic contrast tests whether the relationship curves (U-shaped or inverted-U)
The cubic contrast tests for an S-shaped pattern

Polynomial contrasts are most useful when the categories have a clear, meaningful order and you want to characterize the shape of the trend rather than compare individual groups to a reference.

7.4 Coding Scheme Comparison Summary

Summary of Dummy Variable Coding Schemes
Coding Scheme	R Function	Intercept	Each β represents	Best for
Treatment (Reference)	contr.treatment (default)	Reference group mean	Difference from reference group	Group comparisons to baseline
Effect (Deviation)	contr.sum	Grand mean	Deviation from grand mean	ANOVA-style analyses
Polynomial (Ordinal)	contr.poly	Grand mean	Linear/quadratic/cubic trend	Ordinal variables with ordered levels

8. Practical Considerations

8.1 Choosing the Reference Group

Guidelines for choosing the reference group:

Most common category — maximizes the precision of comparisons (largest reference sample)
Control or baseline condition — natural in experimental or clinical settings
Epidemiologically meaningful comparator — the group to which you want to compare all others
Consistency with published literature — facilitates comparison across studies

8.2 When `as.factor()` Is Required

If a categorical variable is stored as numeric in your data (e.g., coded 0, 1, 2, 3), R will treat it as continuous by default. You must use as.factor() or factor() to tell R it is categorical:

# WRONG: R treats educ_numeric as continuous
mod_wrong <- lm(menthlth_days ~ educ_numeric, data = brfss_dv)

# RIGHT: Convert to factor first
mod_right <- lm(menthlth_days ~ factor(educ_numeric), data = brfss_dv)

# Compare: 1 coefficient (wrong) vs. 3 coefficients (right)
tribble(
  ~Model, ~`Number of education coefficients`, ~`Degrees of freedom used`,
  "Numeric (wrong)", 1, 1,
  "Factor (correct)", 3, 3
) |>
  kable(caption = "Numeric vs. Factor Treatment of Categorical Variables") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Numeric vs. Factor Treatment of Categorical Variables
Model	Number of education coefficients	Degrees of freedom used
Numeric (wrong)	1	1
Factor (correct)	3	3

8.3 Comparing Non-Reference Groups

What if we want to compare HS graduates to college graduates, but neither is the reference group? We have two options:

Option 1: Change the reference group with relevel().

Option 2: Compute the difference manually from the model output.

# Difference between HS graduate and College graduate
# = β_HS_grad - β_College_grad
diff_est <- coef(mod_educ)["educationHS graduate"] - coef(mod_educ)["educationCollege graduate"]

# Use linearHypothesis() for a formal test with SE and p-value
lin_test <- linearHypothesis(mod_educ, "educationHS graduate - educationCollege graduate = 0")

cat("Estimated difference (HS grad - College grad):", round(diff_est, 3), "days\n")

## Estimated difference (HS grad - College grad): 0.556 days

cat("F-statistic:", round(lin_test$F[2], 3), "\n")

## F-statistic: 4.66

cat("p-value:", round(lin_test$`Pr(>F)`[2], 4), "\n")

## p-value: 0.0309

car::linearHypothesis() is a powerful function for testing any linear combination of coefficients, not just comparisons to the reference group.

Summary of Key Concepts

Concept	Key Point
Categorical predictors	Cannot be included as raw numeric codes in regression
Dummy variables	Binary (0/1) indicators; need \(k - 1\) for \(k\) categories
Reference group	The omitted category; all comparisons are relative to it
Changing reference	Use `relevel()`; predictions unchanged, interpretation changes
Partial F-test	Tests whether the categorical variable as a whole is significant
Dummy variable trap	Including \(k\) dummies + intercept = perfect multicollinearity
`as.factor()`	Required when categorical variable is stored as numeric
Coding schemes	Treatment (default), effect, polynomial — each answers a different question
Type III ANOVA	Preferred for unbalanced observational data
Linear hypothesis	`linearHypothesis()` tests comparisons between non-reference groups

Part 2: In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026

Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)

brfss_dv <- readRDS(
  "brfss_dv_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

###1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

brfss_dv |>
  select(menthlth_days, age, sex, gen_health, marital_status) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      gen_health     ~ "General health status",
      marital_status ~ "Marital status"      
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)") |>
  as_flex_table()

Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
Marital status	5,000
Married		2,708 (54%)
Divorced		622 (12%)
Widowed		534 (11%)
Separated		109 (2.2%)
Never married		848 (17%)
Unmarried couple		179 (3.6%)
1Mean (SD); n (%)

1b. (5 pts) Create a boxplot of `menthlth_days` by `gen_health`. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by General Health",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "General Health",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mental Health Days by General Health

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by `marital_status`. Which marital status group has the highest mean? The lowest?

# Compute observed group means
groups_lab <- brfss_dv |>
  summarise(mean_days = mean(menthlth_days), .by = c(marital_status))

ggplot(groups_lab, aes(x = marital_status, y = mean_days, fill = marital_status)) +
  geom_col(alpha = 0.85) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by Marital Status",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x = "Marital Status",
    y = "Mean Mentally Unhealthy Days"
  ) +
  theme_classic(base_size = 13) +
  theme(legend.position = "none")

Mean Mentally Unhealthy Days by Marital Status in Analytic Sample

Separated people have the highest mean number of mentally unhealthy days and widowed people have the lowest mean number.

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the `gen_health` variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: `menthlth_days ~ gen_health_numeric`. Report the coefficient and interpret it.

# The WRONG way: treating education as a continuous numeric variable


naive_mod <- lm(menthlth_days ~ as.numeric(gen_health), data=brfss_dv)


tidy(naive_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Naive Model: General Health Treated as Continuous",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Naive Model: General Health Treated as Continuous
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	-0.6718	0.2705	-2.4840	0.013	-1.2021	-0.1416
as.numeric(gen_health)	1.8578	0.1036	17.9259	0.000	1.6547	2.0610

In the naive version of the model, for every categorical increase by 1 unit (ie from Excellent down to Very Good) of general health quality, the number of mentally healthy days increases by 1.8.

2b. (5 pts) Now fit the same model but treating `gen_health` as a factor: `menthlth_days ~ gen_health`. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

mod <- lm(menthlth_days ~ gen_health, data=brfss_dv)


tidy(mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Categorical Model: General Health Treated as Categorical",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Categorical Model: General Health Treated as Categorical
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.1174	0.2332	9.0790	0.0000	1.6602	2.5746
gen_healthVery good	0.5903	0.2941	2.0070	0.0448	0.0137	1.1670
gen_healthGood	1.9535	0.3082	6.3375	0.0000	1.3492	2.5577
gen_healthFair	5.0624	0.4064	12.4572	0.0000	4.2657	5.8590
gen_healthPoor	9.6640	0.6090	15.8678	0.0000	8.4701	10.8580

The factor version reports 4 discrete categories because each category of a factor is split out as its own covariate. This is less misleading than the continuous version because the difference between a single numeric category (ie. very good health to good health) is unlikely to be uniform across all categories, but the naive model interprets them as if they are.

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with `gen_health` as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

mod_genhealth <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv)


tidy(mod_genhealth, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Dummy Variable: General Health",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Dummy Variable: General Health
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.5930	0.6304	15.2163	0.0000	8.3570	10.8289
age	-0.0867	0.0060	-14.4888	0.0000	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0.0000	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0.0000	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0.0000	-0.7367	-0.4365
gen_healthVery good	0.7899	0.2797	2.8247	0.0048	0.2417	1.3382
gen_healthGood	1.8436	0.2973	6.2020	0.0000	1.2608	2.4264
gen_healthFair	3.3953	0.4180	8.1234	0.0000	2.5759	4.2147
gen_healthPoor	5.3353	0.6829	7.8122	0.0000	3.9965	6.6742

Write out the fitted regression equation.

b_genhealth <- round(coef(mod_genhealth), 3)
ci_genhealth <- round(confint(mod_genhealth), 3)

3.4 Interpreting Each Dummy Coefficient

The model is: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

\[\widehat{\text{Mental Health Days}} = 9.593 + -0.087(\text{Age}) + 1.725(\text{Female}) + 0.231(\text{Phys Days}) + -0.587(\text{Sleep}) + 0.79(\text{Very Good Health}) + -0.129(\text{Good Health}) + -1.143(\text{Fair Health}) NA(\text{Poor Health})\]

3b. (10 pts) Interpret every dummy variable coefficient for `gen_health` in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.

The reference group for general health reflects responses reporting “Excellent” general health, and all other dummy variables for general health reflects a comparison with that group. Based on our model, holding all other variables constant, participants reporting “Very good” general health report an increase of 0.78 poor mental health days over a 30-day period. Those reporting “Good” health report an increase of 1.8 poor mental health days over those reporting “excellent” health, holding all other variables constant. Those reporting “Fair” general health report an increase of 3.4 poor mental health days over those who report “excellent health. And those reporting”Poor” general health report 5.33 additional poor mental health days over those who report “Excellent” general health, holding all other variables constant.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the `gen_health` dummy variables only. Which group differs most from the reference group?

tidy(mod_genhealth, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  filter( grepl("gen", term)) %>%
  mutate(
    term = dplyr::recode(term,
   #   "menthlth_days" = "Mentally unhealthy days (past 30)",
    #  "age"           = "Age (years)",
    #  "sexFemale"     = "Sex: Female",
    #  "physhlth_days" = "Physical health days",
      "gen_healthVery good"= "General health: Very good",
      "gen_healthGood"     = "General health: Good",
      "gen_healthFair"     = "General health: Fair",
      "gen_healthPoor"     = "General health: Poor"
    ),
    term = fct_reorder(term, estimate),
    sig  = ifelse(p.value < 0.05, "Significant (p < 0.05)", "Non-significant")
  ) %>%
  ggplot(aes(x = estimate, y = term, color = sig)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray60") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.25, linewidth = 0.9) +
  geom_point(size = 3.5) +
  scale_color_manual(values = c("Significant (p < 0.05)" = "steelblue",
                                "Non-significant" = "tomato")) +
  labs(
    title    = "Partial Regression Coefficients with 95% Confidence Intervals",
    subtitle = "Outcome: Mentally Unhealthy Days (BRFSS 2020, n = 5,000)",
    x        = "Estimated Change in Mental Health Days (β̂)",
    y        = NULL,
    color    = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

People reporting “Poor” general health have the largest difference from the reference group.

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use `relevel()` to change the reference group for `gen_health` to “Good.” Refit the model from Task 3a.

# Change reference group to "College graduate"Good"
brfss_dv$gen_reref <- relevel(brfss_dv$gen_health, ref = "Good")

mod_gen_reref <-  lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_reref, data = brfss_dv)

tidy(mod_gen_reref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: Good)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Same Model, Different Reference Group (Reference: Good)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.4366	0.6298	18.1584	0e+00	10.2019	12.6713
age	-0.0867	0.0060	-14.4888	0e+00	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0e+00	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0e+00	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0e+00	-0.7367	-0.4365
gen_rerefExcellent	-1.8436	0.2973	-6.2020	0e+00	-2.4264	-1.2608
gen_rerefVery good	-1.0537	0.2581	-4.0819	0e+00	-1.5597	-0.5476
gen_rerefFair	1.5517	0.3861	4.0186	1e-04	0.7947	2.3087
gen_rerefPoor	3.4917	0.6506	5.3673	0e+00	2.2164	4.7671

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

Yes - the continuous variable coefficients are the same between the two models. This is because the difference between the two models is just what the reference group, and only impacts the variables that are relative to the reference group - the dummy variables for general health. The continuous variables are unafffected.

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

tribble(
  ~Quantity, ~`Ref: Excellent`, ~`Ref: Good`,
 "R-squared", round(summary(mod_genhealth)$r.squared, 4), round(summary(mod_gen_reref)$r.squared, 4),
  "Residual SE", round(summary(mod_genhealth)$sigma, 3), round(summary(mod_gen_reref)$sigma, 3)
) |>
  kable(caption = "Comparing Models with Different Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparing Models with Different Reference Groups
Quantity	Ref: Excellent	Ref: Good
R-squared	0.1694	0.1694
Residual SE	7.2080	7.2080

Changing the reference group does not change the prediction because the only thing that has changes is the interpretation of the dummy variables, not the model itself.

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

# Reduced model (no gen_health)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)


tribble(
  ~Quantity, ~`Full Model`, ~`Reduced Model`,
  "R-squared", round(summary(mod_genhealth)$r.squared, 4), round(summary(mod_reduced)$r.squared, 4),
  "Residual SE", round(summary(mod_genhealth)$sigma, 3), round(summary(mod_reduced)$sigma, 3)
) |>
  kable(caption = "Comparing Models with Different Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparing Models with Different Reference Groups
Quantity	Full Model	Reduced Model
R-squared	0.1694	0.1522
Residual SE	7.2080	7.2800

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

# Reduced model (no education)

# Partial F-test
f_test <- anova(mod_reduced, mod_genhealth)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does Education Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Does Education Improve the Model?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	264715.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health	4991	259335.4	4	5379.751	25.8838	0

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

Anova(mod_genhealth, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Type III ANOVA: Testing Each Predictor’s Contribution
term	sumsq	df	statistic	p.value
(Intercept)	12030.737	1	231.5357	0
age	10907.874	1	209.9258	0
sex	3663.847	1	70.5120	0
physhlth_days	10633.920	1	204.6535	0
sleep_hrs	3049.400	1	58.6868	0
gen_health	5379.751	4	25.8838	0
Residuals	259335.435	4991	NA	NA

Yes - the results of the effect of the Statistic of gen_health as a component to the partial f-test comparing the full model and the reduced model are the same - 25.883.

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

Identify which general health groups differ most from the reference
State the direction and approximate magnitude of the association
Appropriately acknowledge the cross-sectional nature of the data
Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

This analysis indicates that in the BRFSS 2020 data, participants who reported their general health as “poor”, reported the largest number of additional mentally unhealthy days in a 30-day time period compared to those reporting their general health as “Excellent”. Respondents indicating their general health was poor reported over 5 mentally unhealthy days compared to those who described their general health as excellent. Because the BRFSS dataset is collected all at a single point in time, there is no way to confirm causation or which of these effects occurred first.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

The education model indicated that categories representing increased education corresponded to a decreased number of mentally unhealthy days, and our lab indicated that increasing reporting levels of poor general health corresponded to an increased number of reported mentally unhealthy days. Both models accounted for Sleep + Age + Sex + Physically Unhealthy Days reported, and then either Education category or General Health category - they had very similar R2s, at 0.155 for Education vs. 0.169 for General Health. However - the Type III ANOVA analysis on each model indicated that General Health had a larger effect on the model (25 vs 6), and the effect of the General Health categories on the model - with an expected additional number of 5 days reporting poor mental health for participants reporting poor mental health. Because these measures indicate greater effect on mentally unhealthy days for the general health response compared to the participant’s level of education, I would recommend using reported general health in the final model.

End of Lab Activity

Dummy Variables in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Alex Mossawir Instructor: Muntasir Masum

March 23, 2026

Introduction

Setup and Data

The BRFSS 2020 Dataset

Descriptive Statistics

Part 1: Guided Practice — Dummy Variables

1. Categorical Variables: The Problem

1.1 Types of Categorical Variables

1.2 The Naive Approach: Why Numeric Codes Fail

2. Dummy (Indicator) Variables

2.1 Definition and the k - 1 Rule

2.2 The Dichotomous Case (k = 2)

3. Multichotomous Dummy Variables (k > 2)

3.1 Constructing the Dummies

3.2 The Reference Group

3.3 Fitting the Model in R

3.4 Interpreting Each Dummy Coefficient

3.5 Visualizing the Parallel Lines

4. Changing the Reference Group

4.1 Using relevel() in R

4.2 What Changes and What Stays the Same?

5. The Dummy Variable Trap (Perfect Multicollinearity)

5.1 Why We Cannot Include All k Dummies

6. Testing Whether a Categorical Variable is Significant

6.1 The Partial F-Test (Type I and Type III)

6.2 Using car::Anova() for Type III Tests

7. Contrasts and Alternative Coding Schemes

7.1 Reference (Treatment) Coding — The Default

7.2 Effect (Deviation) Coding

7.3 Polynomial (Orthogonal) Contrasts for Ordinal Variables

7.4 Coding Scheme Comparison Summary

8. Practical Considerations

8.1 Choosing the Reference Group

8.2 When as.factor() Is Required

8.3 Comparing Non-Reference Groups

Summary of Key Concepts

Part 2: In-Class Lab Activity

Instructions

Data for the Lab

Task 1: Exploratory Data Analysis (15 points)

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

3.4 Interpreting Each Dummy Coefficient

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

Task 5: Partial F-Test for General Health (20 points)

Task 6: Public Health Interpretation (15 points)

Alex Mossawir
Instructor: Muntasir Masum

4.1 Using `relevel()` in R

6.2 Using `car::Anova()` for Type III Tests

8.2 When `as.factor()` Is Required

1b. (5 pts) Create a boxplot of `menthlth_days` by `gen_health`. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by `marital_status`. Which marital status group has the highest mean? The lowest?

2a. (5 pts) Using the `gen_health` variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: `menthlth_days ~ gen_health_numeric`. Report the coefficient and interpret it.

2b. (5 pts) Now fit the same model but treating `gen_health` as a factor: `menthlth_days ~ gen_health`. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

3a. (5 pts) Fit the following model with `gen_health` as a factor:

3b. (10 pts) Interpret every dummy variable coefficient for `gen_health` in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the `gen_health` dummy variables only. Which group differs most from the reference group?

4a. (5 pts) Use `relevel()` to change the reference group for `gen_health` to “Good.” Refit the model from Task 3a.