In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.
When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.
Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.
Research question for today:
How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?
brfss_full <- read_xpt(
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2024/Papers/BRFSS Project/Final Analysis/BRFSS Data 2013 - 2023/LLCP2020.XPT"
) |>
clean_names()brfss_dv <- brfss_full |>
mutate(
# Outcome: mentally unhealthy days in past 30
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age
age = age80,
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Education (6-level raw BRFSS variable EDUCA)
# 1 = Never attended school or only kindergarten
# 2 = Grades 1 through 8 (Elementary)
# 3 = Grades 9 through 11 (Some high school)
# 4 = Grade 12 or GED (High school graduate)
# 5 = College 1 year to 3 years (Some college or technical school)
# 6 = College 4 years or more (College graduate)
# 9 = Refused
education = factor(case_when(
educa %in% c(1, 2, 3) ~ "Less than HS",
educa == 4 ~ "HS graduate",
educa == 5 ~ "Some college",
educa == 6 ~ "College graduate",
TRUE ~ NA_character_
), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
# General health status (5-level)
gen_health = factor(case_when(
genhlth == 1 ~ "Excellent",
genhlth == 2 ~ "Very good",
genhlth == 3 ~ "Good",
genhlth == 4 ~ "Fair",
genhlth == 5 ~ "Poor",
TRUE ~ NA_character_
), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
# Marital status
marital_status = factor(case_when(
marital == 1 ~ "Married",
marital == 2 ~ "Divorced",
marital == 3 ~ "Widowed",
marital == 4 ~ "Separated",
marital == 5 ~ "Never married",
marital == 6 ~ "Unmarried couple",
TRUE ~ NA_character_
), levels = c("Married", "Divorced", "Widowed", "Separated",
"Never married", "Unmarried couple")),
# Store the raw education numeric code for the "naive approach" demonstration
educ_numeric = case_when(
educa %in% c(1, 2, 3) ~ 1,
educa == 4 ~ 2,
educa == 5 ~ 3,
educa == 6 ~ 4,
TRUE ~ NA_real_
)
) |>
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(age), age >= 18,
!is.na(sex),
!is.na(education),
!is.na(gen_health),
!is.na(marital_status)
)
# Reproducible random sample
set.seed(1220)
brfss_dv <- brfss_dv |>
select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
education, gen_health, marital_status, educ_numeric) |>
slice_sample(n = 5000)
# Save for lab activity
saveRDS(brfss_dv,
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Dummy Variables/brfss_dv_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_dv), ncol(brfss_dv))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
brfss_dv |>
select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
education, gen_health) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
sex ~ "Sex",
education ~ "Education level",
gen_health ~ "General health status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Physically unhealthy days (past 30) | 5,000 | 3.3 (7.9) |
Sleep (hours/night) | 5,000 | 7.0 (1.4) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
Education level | 5,000 | |
Less than HS | 290 (5.8%) | |
HS graduate | 1,348 (27%) | |
Some college | 1,340 (27%) | |
College graduate | 2,022 (40%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
1Mean (SD); n (%) | ||
ggplot(brfss_dv, aes(x = education, fill = education)) +
geom_bar(alpha = 0.85) +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Distribution of Education Level",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Education Level",
y = "Count"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Distribution of Education Level in Analytic Sample
ggplot(brfss_dv, aes(x = education, y = menthlth_days, fill = education)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by Education Level",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "Education Level",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Mental Health Days by Education Level
Categorical predictor variables come in two forms:
| Type | Definition | Examples |
|---|---|---|
| Nominal | Categories with no natural ordering | Sex, race/ethnicity, marital status, blood type |
| Ordinal | Categories with a meaningful order | Education level, income bracket, disease stage, Likert scale |
A further distinction is:
Note that categorical variables can also be created by grouping continuous variables (e.g., age groups from continuous age), though this generally results in a loss of information.
Suppose education has been coded as: 1 = Less than HS, 2 = HS graduate, 3 = Some college, 4 = College graduate.
If we include this numeric code directly in a regression model, we are assuming:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 \cdot \text{educ_numeric} + \varepsilon\]
This forces the model to assume that the difference in expected \(Y\) between “Less than HS” and “HS graduate” is the same as the difference between “HS graduate” and “Some college,” and the same again between “Some college” and “College graduate.” In other words, we are assuming equally spaced, linear increments.
# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ age + educ_numeric, data = brfss_dv)
tidy(naive_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Naive Model: Education Treated as Continuous",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5601 | 0.5039 | 18.9727 | 0 | 8.5723 | 10.5479 |
| age | -0.0661 | 0.0063 | -10.5135 | 0 | -0.0784 | -0.0538 |
| educ_numeric | -0.7168 | 0.1158 | -6.1917 | 0 | -0.9437 | -0.4898 |
This model estimates a single coefficient for education, meaning each step up the education ladder is associated with the same change in mentally unhealthy days. This constraint is problematic for two reasons:
Let’s visualize why this matters:
# Compute observed group means
group_means <- brfss_dv |>
summarise(mean_days = mean(menthlth_days), .by = c(education, educ_numeric))
# Generate predictions from the naive model
pred_naive <- tibble(
educ_numeric = 1:4,
predicted = predict(naive_mod, newdata = tibble(age = mean(brfss_dv$age), educ_numeric = 1:4))
)
ggplot() +
geom_point(data = group_means,
aes(x = educ_numeric, y = mean_days),
size = 4, color = "steelblue") +
geom_line(data = pred_naive,
aes(x = educ_numeric, y = predicted),
color = "tomato", linewidth = 1.2, linetype = "dashed") +
geom_point(data = pred_naive,
aes(x = educ_numeric, y = predicted),
size = 3, color = "tomato", shape = 17) +
scale_x_continuous(
breaks = 1:4,
labels = c("Less than HS", "HS graduate", "Some college", "College graduate")
) +
labs(
title = "Observed Group Means (blue) vs. Naive Linear Fit (red)",
subtitle = "The naive model forces equal spacing between education levels",
x = "Education Level",
y = "Mean Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13)Naive Linear Fit vs. Actual Group Means by Education
Key takeaway: The blue dots (observed means) do not fall along a straight line. The naive linear model (red) misrepresents the actual pattern. We need a more flexible approach.
A dummy variable (also called an indicator variable) is a variable that takes on only two possible values:
If a categorical predictor has \(k\) categories, we need exactly \(k - 1\) dummy variables when the model includes an intercept. The omitted category becomes the reference group (also called the control group or baseline group).
Why \(k - 1\) and not \(k\)? Because the intercept already captures the mean for the reference group. Including all \(k\) dummies plus the intercept would create perfect multicollinearity (the dummy variables would sum to equal the intercept column), and the model could not be estimated.
The simplest example is a variable with two categories, such as sex.
With \(k = 2\), we need \(2 - 1 = 1\) dummy variable. If we choose Female as the reference group:
\[\text{male} = \begin{cases} 1 & \text{if male} \\ 0 & \text{if female} \end{cases}\]
The regression model becomes:
\[Y = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{male} + \varepsilon\]
For males (\(\text{male} = 1\)): \[E(Y | \text{age}, \text{male}) = (\beta_0 + \beta_2) + \beta_1 \cdot \text{age}\]
For females (\(\text{male} = 0\)): \[E(Y | \text{age}, \text{female}) = \beta_0 + \beta_1 \cdot \text{age}\]
Both groups share the same slope for age but have different intercepts. The coefficient \(\beta_2\) is the expected difference in \(Y\) between males and females, holding age constant.
# Fit model with sex as a dummy variable
mod_sex <- lm(menthlth_days ~ age + sex, data = brfss_dv)
tidy(mod_sex, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Dichotomous Dummy Variable: Sex",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 6.6262 | 0.3730 | 17.7666 | 0 | 5.8951 | 7.3574 |
| age | -0.0698 | 0.0063 | -11.1011 | 0 | -0.0821 | -0.0575 |
| sexFemale | 1.8031 | 0.2210 | 8.1585 | 0 | 1.3698 | 2.2364 |
Interpretation:
Note that R automatically creates dummy variables when a factor is included in
lm(). It uses alphabetical or level order to set the reference group, which is why Male (the first level) is the reference here.
pred_sex <- ggpredict(mod_sex, terms = c("age [20:80]", "sex"))
ggplot(pred_sex, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15, color = NA) +
labs(
title = "Predicted Mental Health Days by Age and Sex",
subtitle = "Parallel lines: same slope, different intercepts",
x = "Age (years)",
y = "Predicted Mentally Unhealthy Days",
color = "Sex",
fill = "Sex"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")Parallel Regression Lines: Males vs. Females
Geometrically: Dummy variables produce parallel regression lines. The intercept shifts by \(\beta_2\) for the non-reference group, but the slope remains the same.
Education has \(k = 4\) categories, so we need \(4 - 1 = 3\) dummy variables. If we choose “Less than HS” as the reference group:
\[\text{HS_graduate} = \begin{cases} 1 & \text{if HS graduate} \\ 0 & \text{otherwise} \end{cases}\]
\[\text{Some_college} = \begin{cases} 1 & \text{if Some college} \\ 0 & \text{otherwise} \end{cases}\]
\[\text{College_graduate} = \begin{cases} 1 & \text{if College graduate} \\ 0 & \text{otherwise} \end{cases}\]
The data matrix looks like this:
| Observation | Education | HS_graduate | Some_college | College_graduate |
|---|---|---|---|---|
| 1 | Less than HS | 0 | 0 | 0 |
| 2 | HS graduate | 1 | 0 | 0 |
| 3 | Some college | 0 | 1 | 0 |
| 4 | College graduate | 0 | 0 | 1 |
| 5 | Less than HS | 0 | 0 | 0 |
Notice that the reference group is identified by having all dummy variables equal to zero.
The reference group is the category against which all others are compared. Key points:
When we include a factor variable in lm(), R
automatically creates the dummy variables. The first level of the factor
is used as the reference group by default.
# Fit model with education as a factor (R creates dummies automatically)
mod_educ <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education,
data = brfss_dv)
tidy(mod_educ, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Education Dummy Variables (Reference: Less than HS)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.1377 | 0.7390 | 15.0709 | 0.0000 | 9.6889 | 12.5865 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| educationHS graduate | -0.5873 | 0.4719 | -1.2445 | 0.2134 | -1.5125 | 0.3379 |
| educationSome college | -0.1289 | 0.4735 | -0.2723 | 0.7854 | -1.0572 | 0.7993 |
| educationCollege graduate | -1.1429 | 0.4607 | -2.4805 | 0.0132 | -2.0461 | -0.2396 |
The model is:
\[\widehat{\text{Mental Health Days}} = 11.138 + -0.077(\text{Age}) + 1.681(\text{Female}) + 0.311(\text{Phys Days}) + -0.628(\text{Sleep}) + -0.587(\text{HS grad}) + -0.129(\text{Some college}) + -1.143(\text{College grad})\]
Each education coefficient represents the estimated difference in mentally unhealthy days between that group and the reference group (Less than HS), holding all other variables constant:
HS graduate (\(\hat{\beta}\) = -0.587): Compared to those with less than a high school education, HS graduates report an estimated 0.587 fewer mentally unhealthy days, holding age, sex, physical health days, and sleep constant.
Some college (\(\hat{\beta}\) = -0.129): Compared to those with less than a high school education, those with some college report an estimated 0.129 fewer mentally unhealthy days, holding all else constant.
College graduate (\(\hat{\beta}\) = -1.143): Compared to those with less than a high school education, college graduates report an estimated 1.143 fewer mentally unhealthy days, holding all else constant.
Key pattern: All comparisons are made relative to the reference group. The coefficients do NOT directly tell us the difference between, say, HS graduates and college graduates. We would need to compute \(\hat{\beta}_{\text{HS grad}} - \hat{\beta}_{\text{College grad}}\) for that comparison (or change the reference group).
pred_educ <- ggpredict(mod_educ, terms = c("age [20:80]", "education"))
ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
labs(
title = "Predicted Mental Health Days by Age and Education",
subtitle = "Parallel lines: same slopes for age, different intercepts by education",
x = "Age (years)",
y = "Predicted Mentally Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set2")Predicted Mental Health Days by Age and Education Level
These are a series of parallel lines, one for each education level. The slope for age is the same across all groups; only the intercept differs. Each education dummy shifts the intercept up or down relative to the reference group.
relevel() in RWe may want to change the reference group to a category that is more epidemiologically meaningful. For instance, “College graduate” is the largest group and could serve as a natural comparison.
# Change reference group to College graduate
brfss_dv$education_reref <- relevel(brfss_dv$education, ref = "College graduate")
mod_educ_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_reref,
data = brfss_dv)
tidy(mod_educ_reref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: College graduate)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.9948 | 0.6272 | 15.9349 | 0.0000 | 8.7652 | 11.2245 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| education_rerefLess than HS | 1.1429 | 0.4607 | 2.4805 | 0.0132 | 0.2396 | 2.0461 |
| education_rerefHS graduate | 0.5556 | 0.2574 | 2.1586 | 0.0309 | 0.0510 | 1.0601 |
| education_rerefSome college | 1.0139 | 0.2566 | 3.9507 | 0.0001 | 0.5108 | 1.5171 |
tribble(
~Quantity, ~`Ref: Less than HS`, ~`Ref: College graduate`,
"Intercept", round(coef(mod_educ)[1], 3), round(coef(mod_educ_reref)[1], 3),
"Age coefficient", round(coef(mod_educ)[2], 3), round(coef(mod_educ_reref)[2], 3),
"Sex coefficient", round(coef(mod_educ)[3], 3), round(coef(mod_educ_reref)[3], 3),
"Physical health days", round(coef(mod_educ)[4], 3), round(coef(mod_educ_reref)[4], 3),
"Sleep hours", round(coef(mod_educ)[5], 3), round(coef(mod_educ_reref)[5], 3),
"R-squared", round(summary(mod_educ)$r.squared, 4), round(summary(mod_educ_reref)$r.squared, 4),
"Residual SE", round(summary(mod_educ)$sigma, 3), round(summary(mod_educ_reref)$sigma, 3)
) |>
kable(caption = "Comparing Models with Different Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Ref: Less than HS | Ref: College graduate |
|---|---|---|
| Intercept | 11.1380 | 9.9950 |
| Age coefficient | -0.0770 | -0.0770 |
| Sex coefficient | 1.6810 | 1.6810 |
| Physical health days | 0.3110 | 0.3110 |
| Sleep hours | -0.6280 | -0.6280 |
| R-squared | 0.1553 | 0.1553 |
| Residual SE | 7.2690 | 7.2690 |
What changes:
What stays the same:
This is a critical point: Changing the reference group does not change the model’s fit or predictions. It only changes the interpretation of the dummy variable coefficients.
# Verify that predicted values are identical
pred_orig <- predict(mod_educ)
pred_reref <- predict(mod_educ_reref)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
If we include \(k\) dummy variables and an intercept for a variable with \(k\) categories, the columns of the design matrix \(X\) are linearly dependent. Specifically:
\[\text{Intercept} = D_1 + D_2 + \cdots + D_k\]
where \(D_1, \ldots, D_k\) are the \(k\) dummy variables (one for each category). This means the matrix \(X^TX\) is singular and cannot be inverted, so the OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^TY\) does not exist.
This is called the dummy variable trap.
| Obs | Intercept | A | B | C | A + B + C |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 0 | 1 |
| 2 | 1 | 0 | 1 | 0 | 1 |
| 3 | 1 | 0 | 0 | 1 | 1 |
| 4 | 1 | 1 | 0 | 0 | 1 |
Solutions:
- 1 in the formula and include all \(k\) dummies. Then each coefficient is the
group mean (adjusted for other predictors) rather than a difference from
a reference.# Model without intercept: all k dummies included
mod_no_int <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education - 1,
data = brfss_dv)
tidy(mod_no_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model Without Intercept: All k Education Dummies Included",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexMale | 11.1377 | 0.7390 | 15.0709 | 0.0000 | 9.6889 | 12.5865 |
| sexFemale | 12.8190 | 0.7524 | 17.0365 | 0.0000 | 11.3439 | 14.2941 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| educationHS graduate | -0.5873 | 0.4719 | -1.2445 | 0.2134 | -1.5125 | 0.3379 |
| educationSome college | -0.1289 | 0.4735 | -0.2723 | 0.7854 | -1.0572 | 0.7993 |
| educationCollege graduate | -1.1429 | 0.4607 | -2.4805 | 0.0132 | -2.0461 | -0.2396 |
Caution: Removing the intercept changes the interpretation of \(R^2\) and should only be done when there is a substantive reason. In most epidemiological applications, reference cell coding (the default) is preferred.
When a categorical variable with \(k\) levels enters the model as \(k - 1\) dummies, we cannot assess its overall significance by looking at individual t-tests for each dummy. A single dummy might not be statistically significant on its own, yet the variable as a whole might be.
To test whether education as a whole is associated with the outcome, we use a partial F-test (also called an extra sum of squares F-test):
\[H_0: \beta_{\text{HS grad}} = \beta_{\text{Some college}} = \beta_{\text{College grad}} = 0\] \[H_A: \text{At least one } \beta_j \neq 0\]
This compares the full model (with education) to a reduced model (without education):
# Reduced model (no education)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)
# Partial F-test
f_test <- anova(mod_reduced, mod_educ)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does Education Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education | 4992 | 263744.4 | 3 | 970.7509 | 6.1246 | 4e-04 |
car::Anova() for Type III TestsThe car::Anova() function with type = "III"
provides a convenient way to test the overall significance of each
predictor, including categorical variables:
Anova(mod_educ, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12000.1867 | 1 | 227.1325 | 0e+00 |
| age | 8863.3522 | 1 | 167.7603 | 0e+00 |
| sex | 3469.6448 | 1 | 65.6714 | 0e+00 |
| physhlth_days | 28765.1139 | 1 | 544.4492 | 0e+00 |
| sleep_hrs | 3506.1243 | 1 | 66.3619 | 0e+00 |
| education | 970.7509 | 3 | 6.1246 | 4e-04 |
| Residuals | 263744.4348 | 4992 | NA | NA |
Type I vs. Type III: Type I (sequential) sums of squares depend on the order variables enter the model. Type III (partial) sums of squares test each variable after all others, regardless of order. For unbalanced observational data (the norm in epidemiology), Type III is preferred.
This is what R uses by default (contr.treatment). Each
coefficient represents the difference between a group and the reference
group.
## HS graduate Some college College graduate
## Less than HS 0 0 0
## HS graduate 1 0 0
## Some college 0 1 0
## College graduate 0 0 1
In effect coding (contr.sum), each
coefficient represents the difference between a group’s mean and the
grand mean (the unweighted average of all group means).
This is common in ANOVA contexts.
# Set effect coding
brfss_dv$education_effect <- brfss_dv$education
contrasts(brfss_dv$education_effect) <- contr.sum(4)
mod_effect <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_effect,
data = brfss_dv)
tidy(mod_effect, conf.int = TRUE) |>
mutate(
term = case_when(
str_detect(term, "education_effect1") ~ "Education: Less than HS vs. Grand Mean",
str_detect(term, "education_effect2") ~ "Education: HS graduate vs. Grand Mean",
str_detect(term, "education_effect3") ~ "Education: Some college vs. Grand Mean",
TRUE ~ term
),
across(where(is.numeric), \(x) round(x, 4))
) |>
kable(
caption = "Effect Coding: Each Education Coefficient vs. Grand Mean",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6729 | 0.6172 | 17.2911 | 0.0000 | 9.4628 | 11.8830 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| Education: Less than HS vs. Grand Mean | 0.4648 | 0.3323 | 1.3988 | 0.1619 | -0.1866 | 1.1162 |
| Education: HS graduate vs. Grand Mean | -0.1225 | 0.1939 | -0.6319 | 0.5275 | -0.5026 | 0.2576 |
| Education: Some college vs. Grand Mean | 0.3358 | 0.1946 | 1.7257 | 0.0845 | -0.0457 | 0.7174 |
With effect coding, the intercept is the grand mean (adjusted for covariates), and each education coefficient shows how far that group deviates from the grand mean. The omitted group’s deviation is the negative sum of the others.
When a categorical variable is truly ordinal (like
education), we can test for specific patterns using orthogonal
polynomial contrasts (contr.poly). These decompose the
group differences into linear, quadratic, and cubic trends.
# Ordinal polynomial contrasts
brfss_dv$education_ord <- brfss_dv$education
contrasts(brfss_dv$education_ord) <- contr.poly(4)
mod_ord <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_ord,
data = brfss_dv)
tidy(mod_ord, conf.int = TRUE) |>
mutate(
term = case_when(
str_detect(term, "\\.L$") ~ "Education: Linear trend",
str_detect(term, "\\.Q$") ~ "Education: Quadratic trend",
str_detect(term, "\\.C$") ~ "Education: Cubic trend",
TRUE ~ term
),
across(where(is.numeric), \(x) round(x, 4))
) |>
kable(
caption = "Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6729 | 0.6172 | 17.2911 | 0.0000 | 9.4628 | 11.8830 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| Education: Linear trend | -0.6642 | 0.3158 | -2.1028 | 0.0355 | -1.2833 | -0.0450 |
| Education: Quadratic trend | -0.2133 | 0.2682 | -0.7954 | 0.4264 | -0.7391 | 0.3125 |
| Education: Cubic trend | -0.5630 | 0.2142 | -2.6282 | 0.0086 | -0.9830 | -0.1431 |
Interpretation:
Polynomial contrasts are most useful when the categories have a clear, meaningful order and you want to characterize the shape of the trend rather than compare individual groups to a reference.
| Coding Scheme | R Function | Intercept | Each β represents | Best for |
|---|---|---|---|---|
| Treatment (Reference) | contr.treatment (default) | Reference group mean | Difference from reference group | Group comparisons to baseline |
| Effect (Deviation) | contr.sum | Grand mean | Deviation from grand mean | ANOVA-style analyses |
| Polynomial (Ordinal) | contr.poly | Grand mean | Linear/quadratic/cubic trend | Ordinal variables with ordered levels |
Guidelines for choosing the reference group:
as.factor() Is RequiredIf a categorical variable is stored as numeric in your data (e.g.,
coded 0, 1, 2, 3), R will treat it as continuous by default. You
must use as.factor() or
factor() to tell R it is categorical:
# WRONG: R treats educ_numeric as continuous
mod_wrong <- lm(menthlth_days ~ educ_numeric, data = brfss_dv)
# RIGHT: Convert to factor first
mod_right <- lm(menthlth_days ~ factor(educ_numeric), data = brfss_dv)
# Compare: 1 coefficient (wrong) vs. 3 coefficients (right)
tribble(
~Model, ~`Number of education coefficients`, ~`Degrees of freedom used`,
"Numeric (wrong)", 1, 1,
"Factor (correct)", 3, 3
) |>
kable(caption = "Numeric vs. Factor Treatment of Categorical Variables") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Model | Number of education coefficients | Degrees of freedom used |
|---|---|---|
| Numeric (wrong) | 1 | 1 |
| Factor (correct) | 3 | 3 |
What if we want to compare HS graduates to college graduates, but neither is the reference group? We have two options:
Option 1: Change the reference group with
relevel().
Option 2: Compute the difference manually from the model output.
# Difference between HS graduate and College graduate
# = β_HS_grad - β_College_grad
diff_est <- coef(mod_educ)["educationHS graduate"] - coef(mod_educ)["educationCollege graduate"]
# Use linearHypothesis() for a formal test with SE and p-value
lin_test <- linearHypothesis(mod_educ, "educationHS graduate - educationCollege graduate = 0")
cat("Estimated difference (HS grad - College grad):", round(diff_est, 3), "days\n")## Estimated difference (HS grad - College grad): 0.556 days
## F-statistic: 4.66
## p-value: 0.0309
car::linearHypothesis()is a powerful function for testing any linear combination of coefficients, not just comparisons to the reference group.
| Concept | Key Point |
|---|---|
| Categorical predictors | Cannot be included as raw numeric codes in regression |
| Dummy variables | Binary (0/1) indicators; need \(k - 1\) for \(k\) categories |
| Reference group | The omitted category; all comparisons are relative to it |
| Changing reference | Use relevel(); predictions unchanged, interpretation
changes |
| Partial F-test | Tests whether the categorical variable as a whole is significant |
| Dummy variable trap | Including \(k\) dummies + intercept = perfect multicollinearity |
as.factor() |
Required when categorical variable is stored as numeric |
| Coding schemes | Treatment (default), effect, polynomial — each answers a different question |
| Type III ANOVA | Preferred for unbalanced observational data |
| Linear hypothesis | linearHypothesis() tests comparisons between
non-reference groups |
EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
brfss_dv <- readRDS(
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Dummy Variables/brfss_dv_2020.rds"
)1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation.
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.
End of Lab Activity