Introduction

In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.

When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.

Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:

Why numeric coding of categories fails
How to construct dummy variables for dichotomous and multichotomous predictors
The concept of the reference group and how to change it
Interpreting dummy variable coefficients
Testing whether a categorical variable as a whole is significant (partial F-test)
Alternative coding schemes (effect coding, ordinal contrasts)

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.

Research question for today:

How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?

brfss_full <- read_xpt(
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2024/Papers/BRFSS Project/Final Analysis/BRFSS Data 2013 - 2023/LLCP2020.XPT"
) |>
  clean_names()

brfss_dv <- brfss_full |>
  mutate(
    # Outcome: mentally unhealthy days in past 30
    menthlth_days = case_when(
      menthlth == 88                    ~ 0,
      menthlth >= 1 & menthlth <= 30   ~ as.numeric(menthlth),
      TRUE                             ~ NA_real_
    ),
    # Physical health days
    physhlth_days = case_when(
      physhlth == 88                    ~ 0,
      physhlth >= 1 & physhlth <= 30   ~ as.numeric(physhlth),
      TRUE                             ~ NA_real_
    ),
    # Sleep hours
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14   ~ as.numeric(sleptim1),
      TRUE                             ~ NA_real_
    ),
    # Age
    age = age80,
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Education (6-level raw BRFSS variable EDUCA)
    # 1 = Never attended school or only kindergarten
    # 2 = Grades 1 through 8 (Elementary)
    # 3 = Grades 9 through 11 (Some high school)
    # 4 = Grade 12 or GED (High school graduate)
    # 5 = College 1 year to 3 years (Some college or technical school)
    # 6 = College 4 years or more (College graduate)
    # 9 = Refused
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    # General health status (5-level)
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    # Marital status
    marital_status = factor(case_when(
      marital == 1 ~ "Married",
      marital == 2 ~ "Divorced",
      marital == 3 ~ "Widowed",
      marital == 4 ~ "Separated",
      marital == 5 ~ "Never married",
      marital == 6 ~ "Unmarried couple",
      TRUE         ~ NA_character_
    ), levels = c("Married", "Divorced", "Widowed", "Separated",
                  "Never married", "Unmarried couple")),
    # Store the raw education numeric code for the "naive approach" demonstration
    educ_numeric = case_when(
      educa %in% c(1, 2, 3) ~ 1,
      educa == 4             ~ 2,
      educa == 5             ~ 3,
      educa == 6             ~ 4,
      TRUE                   ~ NA_real_
    )
  ) |>
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age), age >= 18,
    !is.na(sex),
    !is.na(education),
    !is.na(gen_health),
    !is.na(marital_status)
  )

# Reproducible random sample
set.seed(1220)
brfss_dv <- brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health, marital_status, educ_numeric) |>
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_dv,
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Dummy Variables/brfss_dv_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_dv), ncol(brfss_dv))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

Descriptive Statistics

brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      physhlth_days  ~ "Physically unhealthy days (past 30)",
      sleep_hrs      ~ "Sleep (hours/night)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      education      ~ "Education level",
      gen_health     ~ "General health status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Physically unhealthy days (past 30)	5,000	3.3 (7.9)
Sleep (hours/night)	5,000	7.0 (1.4)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
Education level	5,000
Less than HS		290 (5.8%)
HS graduate		1,348 (27%)
Some college		1,340 (27%)
College graduate		2,022 (40%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
1Mean (SD); n (%)

ggplot(brfss_dv, aes(x = education, fill = education)) +
  geom_bar(alpha = 0.85) +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Distribution of Education Level",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x = "Education Level",
    y = "Count"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Distribution of Education Level in Analytic Sample

ggplot(brfss_dv, aes(x = education, y = menthlth_days, fill = education)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by Education Level",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "Education Level",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mental Health Days by Education Level

Part 1: Guided Practice — Dummy Variables

1. Categorical Variables: The Problem

1.1 Types of Categorical Variables

Categorical predictor variables come in two forms:

Type	Definition	Examples
Nominal	Categories with no natural ordering	Sex, race/ethnicity, marital status, blood type
Ordinal	Categories with a meaningful order	Education level, income bracket, disease stage, Likert scale

A further distinction is:

Dichotomous (k = 2): Only two categories (e.g., sex: Male/Female)
Multichotomous (k > 2): Three or more categories (e.g., education: 4 levels)

Note that categorical variables can also be created by grouping continuous variables (e.g., age groups from continuous age), though this generally results in a loss of information.

1.2 The Naive Approach: Why Numeric Codes Fail

Suppose education has been coded as: 1 = Less than HS, 2 = HS graduate, 3 = Some college, 4 = College graduate.

If we include this numeric code directly in a regression model, we are assuming:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 \cdot \text{educ_numeric} + \varepsilon\]

This forces the model to assume that the difference in expected \(Y\) between “Less than HS” and “HS graduate” is the same as the difference between “HS graduate” and “Some college,” and the same again between “Some college” and “College graduate.” In other words, we are assuming equally spaced, linear increments.

# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ age + educ_numeric, data = brfss_dv)

tidy(naive_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Naive Model: Education Treated as Continuous",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Naive Model: Education Treated as Continuous
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	9.5601	0.5039	18.9727	8.5723	10.5479
age	-0.0661	0.0063	-10.5135	-0.0784	-0.0538
educ_numeric	-0.7168	0.1158	-6.1917	-0.9437	-0.4898

This model estimates a single coefficient for education, meaning each step up the education ladder is associated with the same change in mentally unhealthy days. This constraint is problematic for two reasons:

For nominal variables (like race or marital status), numeric codes are entirely arbitrary. The “distance” between code 1 and code 2 has no meaning.
Even for ordinal variables (like education), the assumption of equal spacing is rarely justified. The jump from “Less than HS” to “HS graduate” is substantively different from “Some college” to “College graduate.”

Let’s visualize why this matters:

# Compute observed group means
group_means <- brfss_dv |>
  summarise(mean_days = mean(menthlth_days), .by = c(education, educ_numeric))

# Generate predictions from the naive model
pred_naive <- tibble(
  educ_numeric = 1:4,
  predicted = predict(naive_mod, newdata = tibble(age = mean(brfss_dv$age), educ_numeric = 1:4))
)

ggplot() +
  geom_point(data = group_means,
             aes(x = educ_numeric, y = mean_days),
             size = 4, color = "steelblue") +
  geom_line(data = pred_naive,
            aes(x = educ_numeric, y = predicted),
            color = "tomato", linewidth = 1.2, linetype = "dashed") +
  geom_point(data = pred_naive,
             aes(x = educ_numeric, y = predicted),
             size = 3, color = "tomato", shape = 17) +
  scale_x_continuous(
    breaks = 1:4,
    labels = c("Less than HS", "HS graduate", "Some college", "College graduate")
  ) +
  labs(
    title = "Observed Group Means (blue) vs. Naive Linear Fit (red)",
    subtitle = "The naive model forces equal spacing between education levels",
    x = "Education Level",
    y = "Mean Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13)

Naive Linear Fit vs. Actual Group Means by Education

Key takeaway: The blue dots (observed means) do not fall along a straight line. The naive linear model (red) misrepresents the actual pattern. We need a more flexible approach.

2. Dummy (Indicator) Variables

2.1 Definition and the k - 1 Rule

A dummy variable (also called an indicator variable) is a variable that takes on only two possible values:

1 to indicate the presence of a condition
0 to indicate its absence

If a categorical predictor has \(k\) categories, we need exactly \(k - 1\) dummy variables when the model includes an intercept. The omitted category becomes the reference group (also called the control group or baseline group).

Why \(k - 1\) and not \(k\)? Because the intercept already captures the mean for the reference group. Including all \(k\) dummies plus the intercept would create perfect multicollinearity (the dummy variables would sum to equal the intercept column), and the model could not be estimated.

2.2 The Dichotomous Case (k = 2)

The simplest example is a variable with two categories, such as sex.

With \(k = 2\), we need \(2 - 1 = 1\) dummy variable. If we choose Female as the reference group:

\[\text{male} = \begin{cases} 1 & \text{if male} \\ 0 & \text{if female} \end{cases}\]

The regression model becomes:

\[Y = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{male} + \varepsilon\]

For males (\(\text{male} = 1\)): \[E(Y | \text{age}, \text{male}) = (\beta_0 + \beta_2) + \beta_1 \cdot \text{age}\]

For females (\(\text{male} = 0\)): \[E(Y | \text{age}, \text{female}) = \beta_0 + \beta_1 \cdot \text{age}\]

Both groups share the same slope for age but have different intercepts. The coefficient \(\beta_2\) is the expected difference in \(Y\) between males and females, holding age constant.

# Fit model with sex as a dummy variable
mod_sex <- lm(menthlth_days ~ age + sex, data = brfss_dv)

tidy(mod_sex, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Dichotomous Dummy Variable: Sex",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Dichotomous Dummy Variable: Sex
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	6.6262	0.3730	17.7666	5.8951	7.3574
age	-0.0698	0.0063	-11.1011	-0.0821	-0.0575
sexFemale	1.8031	0.2210	8.1585	1.3698	2.2364

b_sex <- round(coef(mod_sex), 3)

Interpretation:

Intercept (6.626): The estimated mean number of mentally unhealthy days for a male of age 0. (This is a mathematical artifact, not a substantive value.)
Age (-0.07): Each additional year of age is associated with a 0.07 day change in mentally unhealthy days, holding sex constant.
Sex: Female (1.803): Compared to males (the reference group), females report an estimated 1.803 more mentally unhealthy days on average, holding age constant.

Note that R automatically creates dummy variables when a factor is included in lm(). It uses alphabetical or level order to set the reference group, which is why Male (the first level) is the reference here.

pred_sex <- ggpredict(mod_sex, terms = c("age [20:80]", "sex"))

ggplot(pred_sex, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Age and Sex",
    subtitle = "Parallel lines: same slope, different intercepts",
    x        = "Age (years)",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Sex",
    fill     = "Sex"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

Parallel Regression Lines: Males vs. Females

Geometrically: Dummy variables produce parallel regression lines. The intercept shifts by \(\beta_2\) for the non-reference group, but the slope remains the same.

3. Multichotomous Dummy Variables (k > 2)

3.1 Constructing the Dummies

Education has \(k = 4\) categories, so we need \(4 - 1 = 3\) dummy variables. If we choose “Less than HS” as the reference group:

\[\text{HS_graduate} = \begin{cases} 1 & \text{if HS graduate} \\ 0 & \text{otherwise} \end{cases}\]

\[\text{Some_college} = \begin{cases} 1 & \text{if Some college} \\ 0 & \text{otherwise} \end{cases}\]

\[\text{College_graduate} = \begin{cases} 1 & \text{if College graduate} \\ 0 & \text{otherwise} \end{cases}\]

The data matrix looks like this:

Dummy Variable Encoding for Education (Reference: Less than HS)
Observation	Education	HS_graduate	Some_college	College_graduate
1	Less than HS	0	0	0
2	HS graduate	1	0	0
3	Some college	0	1	0
4	College graduate	0	0	1
5	Less than HS	0	0	0

Notice that the reference group is identified by having all dummy variables equal to zero.

3.2 The Reference Group

The reference group is the category against which all others are compared. Key points:

The choice of reference group affects the interpretation of coefficients but not the model’s predictive ability
The predicted values are identical regardless of which reference group is chosen
Choose the reference group based on your research question (e.g., the most common category, the “control” condition, or the group of primary interest)
You can always test differences between any pair of groups, not just comparisons to the reference

3.3 Fitting the Model in R

When we include a factor variable in lm(), R automatically creates the dummy variables. The first level of the factor is used as the reference group by default.

# Fit model with education as a factor (R creates dummies automatically)
mod_educ <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education,
               data = brfss_dv)

tidy(mod_educ, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Education Dummy Variables (Reference: Less than HS)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Education Dummy Variables (Reference: Less than HS)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.1377	0.7390	15.0709	0.0000	9.6889	12.5865
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
educationHS graduate	-0.5873	0.4719	-1.2445	0.2134	-1.5125	0.3379
educationSome college	-0.1289	0.4735	-0.2723	0.7854	-1.0572	0.7993
educationCollege graduate	-1.1429	0.4607	-2.4805	0.0132	-2.0461	-0.2396

b_educ <- round(coef(mod_educ), 3)
ci_educ <- round(confint(mod_educ), 3)

3.4 Interpreting Each Dummy Coefficient

The model is:

\[\widehat{\text{Mental Health Days}} = 11.138 + -0.077(\text{Age}) + 1.681(\text{Female}) + 0.311(\text{Phys Days}) + -0.628(\text{Sleep}) + -0.587(\text{HS grad}) + -0.129(\text{Some college}) + -1.143(\text{College grad})\]

Each education coefficient represents the estimated difference in mentally unhealthy days between that group and the reference group (Less than HS), holding all other variables constant:

HS graduate (\(\hat{\beta}\) = -0.587): Compared to those with less than a high school education, HS graduates report an estimated 0.587 fewer mentally unhealthy days, holding age, sex, physical health days, and sleep constant.
Some college (\(\hat{\beta}\) = -0.129): Compared to those with less than a high school education, those with some college report an estimated 0.129 fewer mentally unhealthy days, holding all else constant.
College graduate (\(\hat{\beta}\) = -1.143): Compared to those with less than a high school education, college graduates report an estimated 1.143 fewer mentally unhealthy days, holding all else constant.

Key pattern: All comparisons are made relative to the reference group. The coefficients do NOT directly tell us the difference between, say, HS graduates and college graduates. We would need to compute \(\hat{\beta}_{\text{HS grad}} - \hat{\beta}_{\text{College grad}}\) for that comparison (or change the reference group).

3.5 Visualizing the Parallel Lines

pred_educ <- ggpredict(mod_educ, terms = c("age [20:80]", "education"))

ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Age and Education",
    subtitle = "Parallel lines: same slopes for age, different intercepts by education",
    x        = "Age (years)",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Education",
    fill     = "Education"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set2")

Predicted Mental Health Days by Age and Education Level

These are a series of parallel lines, one for each education level. The slope for age is the same across all groups; only the intercept differs. Each education dummy shifts the intercept up or down relative to the reference group.

4. Changing the Reference Group

4.1 Using `relevel()` in R

We may want to change the reference group to a category that is more epidemiologically meaningful. For instance, “College graduate” is the largest group and could serve as a natural comparison.

# Change reference group to College graduate
brfss_dv$education_reref <- relevel(brfss_dv$education, ref = "College graduate")

mod_educ_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_reref,
                     data = brfss_dv)

tidy(mod_educ_reref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: College graduate)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Same Model, Different Reference Group (Reference: College graduate)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.9948	0.6272	15.9349	0.0000	8.7652	11.2245
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
education_rerefLess than HS	1.1429	0.4607	2.4805	0.0132	0.2396	2.0461
education_rerefHS graduate	0.5556	0.2574	2.1586	0.0309	0.0510	1.0601
education_rerefSome college	1.0139	0.2566	3.9507	0.0001	0.5108	1.5171

4.2 What Changes and What Stays the Same?

tribble(
  ~Quantity, ~`Ref: Less than HS`, ~`Ref: College graduate`,
  "Intercept", round(coef(mod_educ)[1], 3), round(coef(mod_educ_reref)[1], 3),
  "Age coefficient", round(coef(mod_educ)[2], 3), round(coef(mod_educ_reref)[2], 3),
  "Sex coefficient", round(coef(mod_educ)[3], 3), round(coef(mod_educ_reref)[3], 3),
  "Physical health days", round(coef(mod_educ)[4], 3), round(coef(mod_educ_reref)[4], 3),
  "Sleep hours", round(coef(mod_educ)[5], 3), round(coef(mod_educ_reref)[5], 3),
  "R-squared", round(summary(mod_educ)$r.squared, 4), round(summary(mod_educ_reref)$r.squared, 4),
  "Residual SE", round(summary(mod_educ)$sigma, 3), round(summary(mod_educ_reref)$sigma, 3)
) |>
  kable(caption = "Comparing Models with Different Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparing Models with Different Reference Groups
Quantity	Ref: Less than HS	Ref: College graduate
Intercept	11.1380	9.9950
Age coefficient	-0.0770	-0.0770
Sex coefficient	1.6810	1.6810
Physical health days	0.3110	0.3110
Sleep hours	-0.6280	-0.6280
R-squared	0.1553	0.1553
Residual SE	7.2690	7.2690

What changes:

The intercept
The education dummy variable coefficients (they now represent comparisons to the new reference group)

What stays the same:

Coefficients for age, sex, physical health days, and sleep
\(R^2\), Adjusted \(R^2\), and Residual Standard Error
All predicted values
The overall F-test

This is a critical point: Changing the reference group does not change the model’s fit or predictions. It only changes the interpretation of the dummy variable coefficients.

# Verify that predicted values are identical
pred_orig <- predict(mod_educ)
pred_reref <- predict(mod_educ_reref)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(max(abs(pred_orig - pred_reref)),
            cor(pred_orig, pred_reref))
) |>
  kable(caption = "Verification: Predicted Values Are Identical") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Verification: Predicted Values Are Identical
Check	Value
Maximum absolute difference in predictions	0
Correlation between predictions	1

5. The Dummy Variable Trap (Perfect Multicollinearity)

5.1 Why We Cannot Include All k Dummies

If we include \(k\) dummy variables and an intercept for a variable with \(k\) categories, the columns of the design matrix \(X\) are linearly dependent. Specifically:

\[\text{Intercept} = D_1 + D_2 + \cdots + D_k\]

where \(D_1, \ldots, D_k\) are the \(k\) dummy variables (one for each category). This means the matrix \(X^TX\) is singular and cannot be inverted, so the OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^TY\) does not exist.

This is called the dummy variable trap.

The Dummy Variable Trap: Intercept = A + B + C (perfect linear dependence)
Obs	Intercept	A	B	C	A + B + C
1	1	1	0	0	1
2	1	0	1	0	1
3	1	0	0	1	1
4	1	1	0	0	1

Solutions:

Reference cell coding (default in R): Include \(k - 1\) dummies; the omitted category is absorbed into the intercept
Remove the intercept: Fit the model with - 1 in the formula and include all \(k\) dummies. Then each coefficient is the group mean (adjusted for other predictors) rather than a difference from a reference.

# Model without intercept: all k dummies included
mod_no_int <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education - 1,
                 data = brfss_dv)

tidy(mod_no_int, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model Without Intercept: All k Education Dummies Included",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Without Intercept: All k Education Dummies Included
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexMale	11.1377	0.7390	15.0709	0.0000	9.6889	12.5865
sexFemale	12.8190	0.7524	17.0365	0.0000	11.3439	14.2941
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
educationHS graduate	-0.5873	0.4719	-1.2445	0.2134	-1.5125	0.3379
educationSome college	-0.1289	0.4735	-0.2723	0.7854	-1.0572	0.7993
educationCollege graduate	-1.1429	0.4607	-2.4805	0.0132	-2.0461	-0.2396

Caution: Removing the intercept changes the interpretation of \(R^2\) and should only be done when there is a substantive reason. In most epidemiological applications, reference cell coding (the default) is preferred.

6. Testing Whether a Categorical Variable is Significant

6.1 The Partial F-Test (Type I and Type III)

When a categorical variable with \(k\) levels enters the model as \(k - 1\) dummies, we cannot assess its overall significance by looking at individual t-tests for each dummy. A single dummy might not be statistically significant on its own, yet the variable as a whole might be.

To test whether education as a whole is associated with the outcome, we use a partial F-test (also called an extra sum of squares F-test):

\[H_0: \beta_{\text{HS grad}} = \beta_{\text{Some college}} = \beta_{\text{College grad}} = 0\] \[H_A: \text{At least one } \beta_j \neq 0\]

This compares the full model (with education) to a reduced model (without education):

# Reduced model (no education)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)

# Partial F-test
f_test <- anova(mod_reduced, mod_educ)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does Education Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Does Education Improve the Model?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	264715.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education	4992	263744.4	3	970.7509	6.1246	4e-04

6.2 Using `car::Anova()` for Type III Tests

The car::Anova() function with type = "III" provides a convenient way to test the overall significance of each predictor, including categorical variables:

Anova(mod_educ, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Type III ANOVA: Testing Each Predictor’s Contribution
term	sumsq	df	statistic	p.value
(Intercept)	12000.1867	1	227.1325	0e+00
age	8863.3522	1	167.7603	0e+00
sex	3469.6448	1	65.6714	0e+00
physhlth_days	28765.1139	1	544.4492	0e+00
sleep_hrs	3506.1243	1	66.3619	0e+00
education	970.7509	3	6.1246	4e-04
Residuals	263744.4348	4992	NA	NA

Type I vs. Type III: Type I (sequential) sums of squares depend on the order variables enter the model. Type III (partial) sums of squares test each variable after all others, regardless of order. For unbalanced observational data (the norm in epidemiology), Type III is preferred.

7. Contrasts and Alternative Coding Schemes

7.1 Reference (Treatment) Coding — The Default

This is what R uses by default (contr.treatment). Each coefficient represents the difference between a group and the reference group.

contrasts(brfss_dv$education)

##                  HS graduate Some college College graduate
## Less than HS               0            0                0
## HS graduate                1            0                0
## Some college               0            1                0
## College graduate           0            0                1

7.2 Effect (Deviation) Coding

In effect coding (contr.sum), each coefficient represents the difference between a group’s mean and the grand mean (the unweighted average of all group means). This is common in ANOVA contexts.

# Set effect coding
brfss_dv$education_effect <- brfss_dv$education
contrasts(brfss_dv$education_effect) <- contr.sum(4)

mod_effect <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_effect,
                 data = brfss_dv)

tidy(mod_effect, conf.int = TRUE) |>
  mutate(
    term = case_when(
      str_detect(term, "education_effect1") ~ "Education: Less than HS vs. Grand Mean",
      str_detect(term, "education_effect2") ~ "Education: HS graduate vs. Grand Mean",
      str_detect(term, "education_effect3") ~ "Education: Some college vs. Grand Mean",
      TRUE ~ term
    ),
    across(where(is.numeric), \(x) round(x, 4))
  ) |>
  kable(
    caption = "Effect Coding: Each Education Coefficient vs. Grand Mean",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Effect Coding: Each Education Coefficient vs. Grand Mean
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	10.6729	0.6172	17.2911	0.0000	9.4628	11.8830
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
Education: Less than HS vs. Grand Mean	0.4648	0.3323	1.3988	0.1619	-0.1866	1.1162
Education: HS graduate vs. Grand Mean	-0.1225	0.1939	-0.6319	0.5275	-0.5026	0.2576
Education: Some college vs. Grand Mean	0.3358	0.1946	1.7257	0.0845	-0.0457	0.7174

With effect coding, the intercept is the grand mean (adjusted for covariates), and each education coefficient shows how far that group deviates from the grand mean. The omitted group’s deviation is the negative sum of the others.

7.3 Polynomial (Orthogonal) Contrasts for Ordinal Variables

When a categorical variable is truly ordinal (like education), we can test for specific patterns using orthogonal polynomial contrasts (contr.poly). These decompose the group differences into linear, quadratic, and cubic trends.

# Ordinal polynomial contrasts
brfss_dv$education_ord <- brfss_dv$education
contrasts(brfss_dv$education_ord) <- contr.poly(4)

mod_ord <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_ord,
              data = brfss_dv)

tidy(mod_ord, conf.int = TRUE) |>
  mutate(
    term = case_when(
      str_detect(term, "\\.L$") ~ "Education: Linear trend",
      str_detect(term, "\\.Q$") ~ "Education: Quadratic trend",
      str_detect(term, "\\.C$") ~ "Education: Cubic trend",
      TRUE ~ term
    ),
    across(where(is.numeric), \(x) round(x, 4))
  ) |>
  kable(
    caption = "Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	10.6729	0.6172	17.2911	0.0000	9.4628	11.8830
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
Education: Linear trend	-0.6642	0.3158	-2.1028	0.0355	-1.2833	-0.0450
Education: Quadratic trend	-0.2133	0.2682	-0.7954	0.4264	-0.7391	0.3125
Education: Cubic trend	-0.5630	0.2142	-2.6282	0.0086	-0.9830	-0.1431

Interpretation:

The linear contrast tests whether there is a consistent trend (higher education = fewer/more days)
The quadratic contrast tests whether the relationship curves (U-shaped or inverted-U)
The cubic contrast tests for an S-shaped pattern

Polynomial contrasts are most useful when the categories have a clear, meaningful order and you want to characterize the shape of the trend rather than compare individual groups to a reference.

7.4 Coding Scheme Comparison Summary

Summary of Dummy Variable Coding Schemes
Coding Scheme	R Function	Intercept	Each β represents	Best for
Treatment (Reference)	contr.treatment (default)	Reference group mean	Difference from reference group	Group comparisons to baseline
Effect (Deviation)	contr.sum	Grand mean	Deviation from grand mean	ANOVA-style analyses
Polynomial (Ordinal)	contr.poly	Grand mean	Linear/quadratic/cubic trend	Ordinal variables with ordered levels

8. Practical Considerations

8.1 Choosing the Reference Group

Guidelines for choosing the reference group:

Most common category — maximizes the precision of comparisons (largest reference sample)
Control or baseline condition — natural in experimental or clinical settings
Epidemiologically meaningful comparator — the group to which you want to compare all others
Consistency with published literature — facilitates comparison across studies

8.2 When `as.factor()` Is Required

If a categorical variable is stored as numeric in your data (e.g., coded 0, 1, 2, 3), R will treat it as continuous by default. You must use as.factor() or factor() to tell R it is categorical:

# WRONG: R treats educ_numeric as continuous
mod_wrong <- lm(menthlth_days ~ educ_numeric, data = brfss_dv)

# RIGHT: Convert to factor first
mod_right <- lm(menthlth_days ~ factor(educ_numeric), data = brfss_dv)

# Compare: 1 coefficient (wrong) vs. 3 coefficients (right)
tribble(
  ~Model, ~`Number of education coefficients`, ~`Degrees of freedom used`,
  "Numeric (wrong)", 1, 1,
  "Factor (correct)", 3, 3
) |>
  kable(caption = "Numeric vs. Factor Treatment of Categorical Variables") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Numeric vs. Factor Treatment of Categorical Variables
Model	Number of education coefficients	Degrees of freedom used
Numeric (wrong)	1	1
Factor (correct)	3	3

8.3 Comparing Non-Reference Groups

What if we want to compare HS graduates to college graduates, but neither is the reference group? We have two options:

Option 1: Change the reference group with relevel().

Option 2: Compute the difference manually from the model output.

# Difference between HS graduate and College graduate
# = β_HS_grad - β_College_grad
diff_est <- coef(mod_educ)["educationHS graduate"] - coef(mod_educ)["educationCollege graduate"]

# Use linearHypothesis() for a formal test with SE and p-value
lin_test <- linearHypothesis(mod_educ, "educationHS graduate - educationCollege graduate = 0")

cat("Estimated difference (HS grad - College grad):", round(diff_est, 3), "days\n")

## Estimated difference (HS grad - College grad): 0.556 days

cat("F-statistic:", round(lin_test$F[2], 3), "\n")

## F-statistic: 4.66

cat("p-value:", round(lin_test$`Pr(>F)`[2], 4), "\n")

## p-value: 0.0309

car::linearHypothesis() is a powerful function for testing any linear combination of coefficients, not just comparisons to the reference group.

Summary of Key Concepts

Concept	Key Point
Categorical predictors	Cannot be included as raw numeric codes in regression
Dummy variables	Binary (0/1) indicators; need \(k - 1\) for \(k\) categories
Reference group	The omitted category; all comparisons are relative to it
Changing reference	Use `relevel()`; predictions unchanged, interpretation changes
Partial F-test	Tests whether the categorical variable as a whole is significant
Dummy variable trap	Including \(k\) dummies + intercept = perfect multicollinearity
`as.factor()`	Required when categorical variable is stored as numeric
Coding schemes	Treatment (default), effect, polynomial — each answers a different question
Type III ANOVA	Preferred for unbalanced observational data
Linear hypothesis	`linearHypothesis()` tests comparisons between non-reference groups

Part 2: In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026

Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)

brfss_dv <- readRDS(
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Dummy Variables/brfss_dv_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

Identify which general health groups differ most from the reference
State the direction and approximate magnitude of the association
Appropriately acknowledge the cross-sectional nature of the data
Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

End of Lab Activity

Dummy Variables in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Muntasir Masum

March 23, 2026