Introduction

In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.

When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.

Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:

Why numeric coding of categories fails
How to construct dummy variables for dichotomous and multichotomous predictors
The concept of the reference group and how to change it
Interpreting dummy variable coefficients
Testing whether a categorical variable as a whole is significant (partial F-test)
Alternative coding schemes (effect coding, ordinal contrasts)

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.

Research question for today:

How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?

brfss_full <- read_xpt("~/Downloads/LLCP2020.XPT "
) |>
  clean_names()

brfss_dv <- brfss_full |>
  mutate(
    # Outcome: mentally unhealthy days in past 30
    menthlth_days = case_when(
      menthlth == 88                    ~ 0,
      menthlth >= 1 & menthlth <= 30   ~ as.numeric(menthlth),
      TRUE                             ~ NA_real_
    ),
    # Physical health days
    physhlth_days = case_when(
      physhlth == 88                    ~ 0,
      physhlth >= 1 & physhlth <= 30   ~ as.numeric(physhlth),
      TRUE                             ~ NA_real_
    ),
    # Sleep hours
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14   ~ as.numeric(sleptim1),
      TRUE                             ~ NA_real_
    ),
    # Age
    age = age80,
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Education (6-level raw BRFSS variable EDUCA)
    # 1 = Never attended school or only kindergarten
    # 2 = Grades 1 through 8 (Elementary)
    # 3 = Grades 9 through 11 (Some high school)
    # 4 = Grade 12 or GED (High school graduate)
    # 5 = College 1 year to 3 years (Some college or technical school)
    # 6 = College 4 years or more (College graduate)
    # 9 = Refused
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    # General health status (5-level)
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    # Marital status
    marital_status = factor(case_when(
      marital == 1 ~ "Married",
      marital == 2 ~ "Divorced",
      marital == 3 ~ "Widowed",
      marital == 4 ~ "Separated",
      marital == 5 ~ "Never married",
      marital == 6 ~ "Unmarried couple",
      TRUE         ~ NA_character_
    ), levels = c("Married", "Divorced", "Widowed", "Separated",
                  "Never married", "Unmarried couple")),
    # Store the raw education numeric code for the "naive approach" demonstration
    educ_numeric = case_when(
      educa %in% c(1, 2, 3) ~ 1,
      educa == 4             ~ 2,
      educa == 5             ~ 3,
      educa == 6             ~ 4,
      TRUE                   ~ NA_real_
    )
  ) |>
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age), age >= 18,
    !is.na(sex),
    !is.na(education),
    !is.na(gen_health),
    !is.na(marital_status)
  )

# Reproducible random sample
set.seed(1220)
brfss_dv <- brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health, marital_status, educ_numeric) |>
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_dv,
"~/Downloads/brfss_dv_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_dv), ncol(brfss_dv))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

Descriptive Statistics

brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      physhlth_days  ~ "Physically unhealthy days (past 30)",
      sleep_hrs      ~ "Sleep (hours/night)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      education      ~ "Education level",
      gen_health     ~ "General health status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Physically unhealthy days (past 30)	5,000	3.3 (7.9)
Sleep (hours/night)	5,000	7.0 (1.4)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
Education level	5,000
Less than HS		290 (5.8%)
HS graduate		1,348 (27%)
Some college		1,340 (27%)
College graduate		2,022 (40%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
1Mean (SD); n (%)

ggplot(brfss_dv, aes(x = education, fill = education)) +
  geom_bar(alpha = 0.85) +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Distribution of Education Level",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x = "Education Level",
    y = "Count"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Distribution of Education Level in Analytic Sample

ggplot(brfss_dv, aes(x = education, y = menthlth_days, fill = education)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by Education Level",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "Education Level",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mental Health Days by Education Level

Part 1: Guided Practice — Dummy Variables

1. Categorical Variables: The Problem

1.1 Types of Categorical Variables

Categorical predictor variables come in two forms:

Type	Definition	Examples
Nominal	Categories with no natural ordering	Sex, race/ethnicity, marital status, blood type
Ordinal	Categories with a meaningful order	Education level, income bracket, disease stage, Likert scale

A further distinction is:

Dichotomous (k = 2): Only two categories (e.g., sex: Male/Female)
Multichotomous (k > 2): Three or more categories (e.g., education: 4 levels)

Note that categorical variables can also be created by grouping continuous variables (e.g., age groups from continuous age), though this generally results in a loss of information.

1.2 The Naive Approach: Why Numeric Codes Fail

Suppose education has been coded as: 1 = Less than HS, 2 = HS graduate, 3 = Some college, 4 = College graduate.

If we include this numeric code directly in a regression model, we are assuming:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 \cdot \text{educ_numeric} + \varepsilon\]

This forces the model to assume that the difference in expected \(Y\) between “Less than HS” and “HS graduate” is the same as the difference between “HS graduate” and “Some college,” and the same again between “Some college” and “College graduate.” In other words, we are assuming equally spaced, linear increments.

# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ age + educ_numeric, data = brfss_dv)

tidy(naive_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Naive Model: Education Treated as Continuous",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Naive Model: Education Treated as Continuous
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	9.5601	0.5039	18.9727	8.5723	10.5479
age	-0.0661	0.0063	-10.5135	-0.0784	-0.0538
educ_numeric	-0.7168	0.1158	-6.1917	-0.9437	-0.4898

This model estimates a single coefficient for education, meaning each step up the education ladder is associated with the same change in mentally unhealthy days. This constraint is problematic for two reasons:

For nominal variables (like race or marital status), numeric codes are entirely arbitrary. The “distance” between code 1 and code 2 has no meaning.
Even for ordinal variables (like education), the assumption of equal spacing is rarely justified. The jump from “Less than HS” to “HS graduate” is substantively different from “Some college” to “College graduate.”

Let’s visualize why this matters:

# Compute observed group means
group_means <- brfss_dv |>
  summarise(mean_days = mean(menthlth_days), .by = c(education, educ_numeric))

# Generate predictions from the naive model
pred_naive <- tibble(
  educ_numeric = 1:4,
  predicted = predict(naive_mod, newdata = tibble(age = mean(brfss_dv$age), educ_numeric = 1:4))
)

ggplot() +
  geom_point(data = group_means,
             aes(x = educ_numeric, y = mean_days),
             size = 4, color = "steelblue") +
  geom_line(data = pred_naive,
            aes(x = educ_numeric, y = predicted),
            color = "tomato", linewidth = 1.2, linetype = "dashed") +
  geom_point(data = pred_naive,
             aes(x = educ_numeric, y = predicted),
             size = 3, color = "tomato", shape = 17) +
  scale_x_continuous(
    breaks = 1:4,
    labels = c("Less than HS", "HS graduate", "Some college", "College graduate")
  ) +
  labs(
    title = "Observed Group Means (blue) vs. Naive Linear Fit (red)",
    subtitle = "The naive model forces equal spacing between education levels",
    x = "Education Level",
    y = "Mean Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13)

Naive Linear Fit vs. Actual Group Means by Education

Key takeaway: The blue dots (observed means) do not fall along a straight line. The naive linear model (red) misrepresents the actual pattern. We need a more flexible approach.

2. Dummy (Indicator) Variables

2.1 Definition and the k - 1 Rule

A dummy variable (also called an indicator variable) is a variable that takes on only two possible values:

1 to indicate the presence of a condition
0 to indicate its absence

If a categorical predictor has \(k\) categories, we need exactly \(k - 1\) dummy variables when the model includes an intercept. The omitted category becomes the reference group (also called the control group or baseline group).

Why \(k - 1\) and not \(k\)? Because the intercept already captures the mean for the reference group. Including all \(k\) dummies plus the intercept would create perfect multicollinearity (the dummy variables would sum to equal the intercept column), and the model could not be estimated.

2.2 The Dichotomous Case (k = 2)

The simplest example is a variable with two categories, such as sex.

With \(k = 2\), we need \(2 - 1 = 1\) dummy variable. If we choose Female as the reference group:

\[\text{male} = \begin{cases} 1 & \text{if male} \\ 0 & \text{if female} \end{cases}\]

The regression model becomes:

\[Y = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{male} + \varepsilon\]

For males (\(\text{male} = 1\)): \[E(Y | \text{age}, \text{male}) = (\beta_0 + \beta_2) + \beta_1 \cdot \text{age}\]

For females (\(\text{male} = 0\)): \[E(Y | \text{age}, \text{female}) = \beta_0 + \beta_1 \cdot \text{age}\]

Both groups share the same slope for age but have different intercepts. The coefficient \(\beta_2\) is the expected difference in \(Y\) between males and females, holding age constant.

# Fit model with sex as a dummy variable
mod_sex <- lm(menthlth_days ~ age + sex, data = brfss_dv)

tidy(mod_sex, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Dichotomous Dummy Variable: Sex",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Dichotomous Dummy Variable: Sex
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	6.6262	0.3730	17.7666	5.8951	7.3574
age	-0.0698	0.0063	-11.1011	-0.0821	-0.0575
sexFemale	1.8031	0.2210	8.1585	1.3698	2.2364

b_sex <- round(coef(mod_sex), 3)

Interpretation:

Intercept (6.626): The estimated mean number of mentally unhealthy days for a male of age 0. (This is a mathematical artifact, not a substantive value.)
Age (-0.07): Each additional year of age is associated with a 0.07 day change in mentally unhealthy days, holding sex constant.
Sex: Female (1.803): Compared to males (the reference group), females report an estimated 1.803 more mentally unhealthy days on average, holding age constant.

Note that R automatically creates dummy variables when a factor is included in lm(). It uses alphabetical or level order to set the reference group, which is why Male (the first level) is the reference here.

pred_sex <- ggpredict(mod_sex, terms = c("age [20:80]", "sex"))

ggplot(pred_sex, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Age and Sex",
    subtitle = "Parallel lines: same slope, different intercepts",
    x        = "Age (years)",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Sex",
    fill     = "Sex"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

Parallel Regression Lines: Males vs. Females

Geometrically: Dummy variables produce parallel regression lines. The intercept shifts by \(\beta_2\) for the non-reference group, but the slope remains the same.

3. Multichotomous Dummy Variables (k > 2)

3.1 Constructing the Dummies

Education has \(k = 4\) categories, so we need \(4 - 1 = 3\) dummy variables. If we choose “Less than HS” as the reference group:

\[\text{HS_graduate} = \begin{cases} 1 & \text{if HS graduate} \\ 0 & \text{otherwise} \end{cases}\]

\[\text{Some_college} = \begin{cases} 1 & \text{if Some college} \\ 0 & \text{otherwise} \end{cases}\]

\[\text{College_graduate} = \begin{cases} 1 & \text{if College graduate} \\ 0 & \text{otherwise} \end{cases}\]

The data matrix looks like this:

Dummy Variable Encoding for Education (Reference: Less than HS)
Observation	Education	HS_graduate	Some_college	College_graduate
1	Less than HS	0	0	0
2	HS graduate	1	0	0
3	Some college	0	1	0
4	College graduate	0	0	1
5	Less than HS	0	0	0

Notice that the reference group is identified by having all dummy variables equal to zero.

3.2 The Reference Group

The reference group is the category against which all others are compared. Key points:

The choice of reference group affects the interpretation of coefficients but not the model’s predictive ability
The predicted values are identical regardless of which reference group is chosen
Choose the reference group based on your research question (e.g., the most common category, the “control” condition, or the group of primary interest)
You can always test differences between any pair of groups, not just comparisons to the reference

3.3 Fitting the Model in R

When we include a factor variable in lm(), R automatically creates the dummy variables. The first level of the factor is used as the reference group by default.

# Fit model with education as a factor (R creates dummies automatically)
mod_educ <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education,
               data = brfss_dv)

tidy(mod_educ, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Education Dummy Variables (Reference: Less than HS)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Education Dummy Variables (Reference: Less than HS)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.1377	0.7390	15.0709	0.0000	9.6889	12.5865
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
educationHS graduate	-0.5873	0.4719	-1.2445	0.2134	-1.5125	0.3379
educationSome college	-0.1289	0.4735	-0.2723	0.7854	-1.0572	0.7993
educationCollege graduate	-1.1429	0.4607	-2.4805	0.0132	-2.0461	-0.2396

b_educ <- round(coef(mod_educ), 3)
ci_educ <- round(confint(mod_educ), 3)

3.4 Interpreting Each Dummy Coefficient

The model is:

\[\widehat{\text{Mental Health Days}} = 11.138 + -0.077(\text{Age}) + 1.681(\text{Female}) + 0.311(\text{Phys Days}) + -0.628(\text{Sleep}) + -0.587(\text{HS grad}) + -0.129(\text{Some college}) + -1.143(\text{College grad})\]

Each education coefficient represents the estimated difference in mentally unhealthy days between that group and the reference group (Less than HS), holding all other variables constant:

HS graduate (\(\hat{\beta}\) = -0.587): Compared to those with less than a high school education, HS graduates report an estimated 0.587 fewer mentally unhealthy days, holding age, sex, physical health days, and sleep constant.
Some college (\(\hat{\beta}\) = -0.129): Compared to those with less than a high school education, those with some college report an estimated 0.129 fewer mentally unhealthy days, holding all else constant.
College graduate (\(\hat{\beta}\) = -1.143): Compared to those with less than a high school education, college graduates report an estimated 1.143 fewer mentally unhealthy days, holding all else constant.

Key pattern: All comparisons are made relative to the reference group. The coefficients do NOT directly tell us the difference between, say, HS graduates and college graduates. We would need to compute \(\hat{\beta}_{\text{HS grad}} - \hat{\beta}_{\text{College grad}}\) for that comparison (or change the reference group).

3.5 Visualizing the Parallel Lines

pred_educ <- ggpredict(mod_educ, terms = c("age [20:80]", "education"))

ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Age and Education",
    subtitle = "Parallel lines: same slopes for age, different intercepts by education",
    x        = "Age (years)",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Education",
    fill     = "Education"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set2")

Predicted Mental Health Days by Age and Education Level

These are a series of parallel lines, one for each education level. The slope for age is the same across all groups; only the intercept differs. Each education dummy shifts the intercept up or down relative to the reference group.

4. Changing the Reference Group

4.1 Using `relevel()` in R

We may want to change the reference group to a category that is more epidemiologically meaningful. For instance, “College graduate” is the largest group and could serve as a natural comparison.

# Change reference group to College graduate
brfss_dv$education_reref <- relevel(brfss_dv$education, ref = "College graduate")

mod_educ_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_reref,
                     data = brfss_dv)

tidy(mod_educ_reref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: College graduate)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Same Model, Different Reference Group (Reference: College graduate)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.9948	0.6272	15.9349	0.0000	8.7652	11.2245
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
education_rerefLess than HS	1.1429	0.4607	2.4805	0.0132	0.2396	2.0461
education_rerefHS graduate	0.5556	0.2574	2.1586	0.0309	0.0510	1.0601
education_rerefSome college	1.0139	0.2566	3.9507	0.0001	0.5108	1.5171

4.2 What Changes and What Stays the Same?

tribble(
  ~Quantity, ~`Ref: Less than HS`, ~`Ref: College graduate`,
  "Intercept", round(coef(mod_educ)[1], 3), round(coef(mod_educ_reref)[1], 3),
  "Age coefficient", round(coef(mod_educ)[2], 3), round(coef(mod_educ_reref)[2], 3),
  "Sex coefficient", round(coef(mod_educ)[3], 3), round(coef(mod_educ_reref)[3], 3),
  "Physical health days", round(coef(mod_educ)[4], 3), round(coef(mod_educ_reref)[4], 3),
  "Sleep hours", round(coef(mod_educ)[5], 3), round(coef(mod_educ_reref)[5], 3),
  "R-squared", round(summary(mod_educ)$r.squared, 4), round(summary(mod_educ_reref)$r.squared, 4),
  "Residual SE", round(summary(mod_educ)$sigma, 3), round(summary(mod_educ_reref)$sigma, 3)
) |>
  kable(caption = "Comparing Models with Different Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparing Models with Different Reference Groups
Quantity	Ref: Less than HS	Ref: College graduate
Intercept	11.1380	9.9950
Age coefficient	-0.0770	-0.0770
Sex coefficient	1.6810	1.6810
Physical health days	0.3110	0.3110
Sleep hours	-0.6280	-0.6280
R-squared	0.1553	0.1553
Residual SE	7.2690	7.2690

What changes:

The intercept
The education dummy variable coefficients (they now represent comparisons to the new reference group)

What stays the same:

Coefficients for age, sex, physical health days, and sleep
\(R^2\), Adjusted \(R^2\), and Residual Standard Error
All predicted values
The overall F-test

This is a critical point: Changing the reference group does not change the model’s fit or predictions. It only changes the interpretation of the dummy variable coefficients.

# Verify that predicted values are identical
pred_orig <- predict(mod_educ)
pred_reref <- predict(mod_educ_reref)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(max(abs(pred_orig - pred_reref)),
            cor(pred_orig, pred_reref))
) |>
  kable(caption = "Verification: Predicted Values Are Identical") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Verification: Predicted Values Are Identical
Check	Value
Maximum absolute difference in predictions	0
Correlation between predictions	1

5. The Dummy Variable Trap (Perfect Multicollinearity)

5.1 Why We Cannot Include All k Dummies

If we include \(k\) dummy variables and an intercept for a variable with \(k\) categories, the columns of the design matrix \(X\) are linearly dependent. Specifically:

\[\text{Intercept} = D_1 + D_2 + \cdots + D_k\]

where \(D_1, \ldots, D_k\) are the \(k\) dummy variables (one for each category). This means the matrix \(X^TX\) is singular and cannot be inverted, so the OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^TY\) does not exist.

This is called the dummy variable trap.

The Dummy Variable Trap: Intercept = A + B + C (perfect linear dependence)
Obs	Intercept	A	B	C	A + B + C
1	1	1	0	0	1
2	1	0	1	0	1
3	1	0	0	1	1
4	1	1	0	0	1

Solutions:

Reference cell coding (default in R): Include \(k - 1\) dummies; the omitted category is absorbed into the intercept
Remove the intercept: Fit the model with - 1 in the formula and include all \(k\) dummies. Then each coefficient is the group mean (adjusted for other predictors) rather than a difference from a reference.

# Model without intercept: all k dummies included
mod_no_int <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education - 1,
                 data = brfss_dv)

tidy(mod_no_int, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model Without Intercept: All k Education Dummies Included",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Without Intercept: All k Education Dummies Included
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexMale	11.1377	0.7390	15.0709	0.0000	9.6889	12.5865
sexFemale	12.8190	0.7524	17.0365	0.0000	11.3439	14.2941
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
educationHS graduate	-0.5873	0.4719	-1.2445	0.2134	-1.5125	0.3379
educationSome college	-0.1289	0.4735	-0.2723	0.7854	-1.0572	0.7993
educationCollege graduate	-1.1429	0.4607	-2.4805	0.0132	-2.0461	-0.2396

Caution: Removing the intercept changes the interpretation of \(R^2\) and should only be done when there is a substantive reason. In most epidemiological applications, reference cell coding (the default) is preferred.

6. Testing Whether a Categorical Variable is Significant

6.1 The Partial F-Test (Type I and Type III)

When a categorical variable with \(k\) levels enters the model as \(k - 1\) dummies, we cannot assess its overall significance by looking at individual t-tests for each dummy. A single dummy might not be statistically significant on its own, yet the variable as a whole might be.

To test whether education as a whole is associated with the outcome, we use a partial F-test (also called an extra sum of squares F-test):

\[H_0: \beta_{\text{HS grad}} = \beta_{\text{Some college}} = \beta_{\text{College grad}} = 0\] \[H_A: \text{At least one } \beta_j \neq 0\]

This compares the full model (with education) to a reduced model (without education):

# Reduced model (no education)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)

# Partial F-test
f_test <- anova(mod_reduced, mod_educ)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does Education Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Does Education Improve the Model?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	264715.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education	4992	263744.4	3	970.7509	6.1246	4e-04

6.2 Using `car::Anova()` for Type III Tests

The car::Anova() function with type = "III" provides a convenient way to test the overall significance of each predictor, including categorical variables:

Anova(mod_educ, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Type III ANOVA: Testing Each Predictor’s Contribution
term	sumsq	df	statistic	p.value
(Intercept)	12000.1867	1	227.1325	0e+00
age	8863.3522	1	167.7603	0e+00
sex	3469.6448	1	65.6714	0e+00
physhlth_days	28765.1139	1	544.4492	0e+00
sleep_hrs	3506.1243	1	66.3619	0e+00
education	970.7509	3	6.1246	4e-04
Residuals	263744.4348	4992	NA	NA

Type I vs. Type III: Type I (sequential) sums of squares depend on the order variables enter the model. Type III (partial) sums of squares test each variable after all others, regardless of order. For unbalanced observational data (the norm in epidemiology), Type III is preferred.

7. Contrasts and Alternative Coding Schemes

7.1 Reference (Treatment) Coding — The Default

This is what R uses by default (contr.treatment). Each coefficient represents the difference between a group and the reference group.

contrasts(brfss_dv$education)

##                  HS graduate Some college College graduate
## Less than HS               0            0                0
## HS graduate                1            0                0
## Some college               0            1                0
## College graduate           0            0                1

7.2 Effect (Deviation) Coding

In effect coding (contr.sum), each coefficient represents the difference between a group’s mean and the grand mean (the unweighted average of all group means). This is common in ANOVA contexts.

# Set effect coding
brfss_dv$education_effect <- brfss_dv$education
contrasts(brfss_dv$education_effect) <- contr.sum(4)

mod_effect <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_effect,
                 data = brfss_dv)

tidy(mod_effect, conf.int = TRUE) |>
  mutate(
    term = case_when(
      str_detect(term, "education_effect1") ~ "Education: Less than HS vs. Grand Mean",
      str_detect(term, "education_effect2") ~ "Education: HS graduate vs. Grand Mean",
      str_detect(term, "education_effect3") ~ "Education: Some college vs. Grand Mean",
      TRUE ~ term
    ),
    across(where(is.numeric), \(x) round(x, 4))
  ) |>
  kable(
    caption = "Effect Coding: Each Education Coefficient vs. Grand Mean",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Effect Coding: Each Education Coefficient vs. Grand Mean
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	10.6729	0.6172	17.2911	0.0000	9.4628	11.8830
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
Education: Less than HS vs. Grand Mean	0.4648	0.3323	1.3988	0.1619	-0.1866	1.1162
Education: HS graduate vs. Grand Mean	-0.1225	0.1939	-0.6319	0.5275	-0.5026	0.2576
Education: Some college vs. Grand Mean	0.3358	0.1946	1.7257	0.0845	-0.0457	0.7174

With effect coding, the intercept is the grand mean (adjusted for covariates), and each education coefficient shows how far that group deviates from the grand mean. The omitted group’s deviation is the negative sum of the others.

7.3 Polynomial (Orthogonal) Contrasts for Ordinal Variables

When a categorical variable is truly ordinal (like education), we can test for specific patterns using orthogonal polynomial contrasts (contr.poly). These decompose the group differences into linear, quadratic, and cubic trends.

# Ordinal polynomial contrasts
brfss_dv$education_ord <- brfss_dv$education
contrasts(brfss_dv$education_ord) <- contr.poly(4)

mod_ord <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_ord,
              data = brfss_dv)

tidy(mod_ord, conf.int = TRUE) |>
  mutate(
    term = case_when(
      str_detect(term, "\\.L$") ~ "Education: Linear trend",
      str_detect(term, "\\.Q$") ~ "Education: Quadratic trend",
      str_detect(term, "\\.C$") ~ "Education: Cubic trend",
      TRUE ~ term
    ),
    across(where(is.numeric), \(x) round(x, 4))
  ) |>
  kable(
    caption = "Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	10.6729	0.6172	17.2911	0.0000	9.4628	11.8830
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
Education: Linear trend	-0.6642	0.3158	-2.1028	0.0355	-1.2833	-0.0450
Education: Quadratic trend	-0.2133	0.2682	-0.7954	0.4264	-0.7391	0.3125
Education: Cubic trend	-0.5630	0.2142	-2.6282	0.0086	-0.9830	-0.1431

Interpretation:

The linear contrast tests whether there is a consistent trend (higher education = fewer/more days)
The quadratic contrast tests whether the relationship curves (U-shaped or inverted-U)
The cubic contrast tests for an S-shaped pattern

Polynomial contrasts are most useful when the categories have a clear, meaningful order and you want to characterize the shape of the trend rather than compare individual groups to a reference.

7.4 Coding Scheme Comparison Summary

Summary of Dummy Variable Coding Schemes
Coding Scheme	R Function	Intercept	Each β represents	Best for
Treatment (Reference)	contr.treatment (default)	Reference group mean	Difference from reference group	Group comparisons to baseline
Effect (Deviation)	contr.sum	Grand mean	Deviation from grand mean	ANOVA-style analyses
Polynomial (Ordinal)	contr.poly	Grand mean	Linear/quadratic/cubic trend	Ordinal variables with ordered levels

8. Practical Considerations

8.1 Choosing the Reference Group

Guidelines for choosing the reference group:

Most common category — maximizes the precision of comparisons (largest reference sample)
Control or baseline condition — natural in experimental or clinical settings
Epidemiologically meaningful comparator — the group to which you want to compare all others
Consistency with published literature — facilitates comparison across studies

8.2 When `as.factor()` Is Required

If a categorical variable is stored as numeric in your data (e.g., coded 0, 1, 2, 3), R will treat it as continuous by default. You must use as.factor() or factor() to tell R it is categorical:

# WRONG: R treats educ_numeric as continuous
mod_wrong <- lm(menthlth_days ~ educ_numeric, data = brfss_dv)

# RIGHT: Convert to factor first
mod_right <- lm(menthlth_days ~ factor(educ_numeric), data = brfss_dv)

# Compare: 1 coefficient (wrong) vs. 3 coefficients (right)
tribble(
  ~Model, ~`Number of education coefficients`, ~`Degrees of freedom used`,
  "Numeric (wrong)", 1, 1,
  "Factor (correct)", 3, 3
) |>
  kable(caption = "Numeric vs. Factor Treatment of Categorical Variables") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Numeric vs. Factor Treatment of Categorical Variables
Model	Number of education coefficients	Degrees of freedom used
Numeric (wrong)	1	1
Factor (correct)	3	3

8.3 Comparing Non-Reference Groups

What if we want to compare HS graduates to college graduates, but neither is the reference group? We have two options:

Option 1: Change the reference group with relevel().

Option 2: Compute the difference manually from the model output.

# Difference between HS graduate and College graduate
# = β_HS_grad - β_College_grad
diff_est <- coef(mod_educ)["educationHS graduate"] - coef(mod_educ)["educationCollege graduate"]

# Use linearHypothesis() for a formal test with SE and p-value
lin_test <- linearHypothesis(mod_educ, "educationHS graduate - educationCollege graduate = 0")

cat("Estimated difference (HS grad - College grad):", round(diff_est, 3), "days\n")

## Estimated difference (HS grad - College grad): 0.556 days

cat("F-statistic:", round(lin_test$F[2], 3), "\n")

## F-statistic: 4.66

cat("p-value:", round(lin_test$`Pr(>F)`[2], 4), "\n")

## p-value: 0.0309

car::linearHypothesis() is a powerful function for testing any linear combination of coefficients, not just comparisons to the reference group.

Summary of Key Concepts

Concept	Key Point
Categorical predictors	Cannot be included as raw numeric codes in regression
Dummy variables	Binary (0/1) indicators; need \(k - 1\) for \(k\) categories
Reference group	The omitted category; all comparisons are relative to it
Changing reference	Use `relevel()`; predictions unchanged, interpretation changes
Partial F-test	Tests whether the categorical variable as a whole is significant
Dummy variable trap	Including \(k\) dummies + intercept = perfect multicollinearity
`as.factor()`	Required when categorical variable is stored as numeric
Coding schemes	Treatment (default), effect, polynomial — each answers a different question
Type III ANOVA	Preferred for unbalanced observational data
Linear hypothesis	`linearHypothesis()` tests comparisons between non-reference groups

Part 2: In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026

Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)

brfss_dv_2020 <- readRDS("~/Downloads/epi552/brfss_dv_2020.rds")

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

brfss_dv_2020 |>
  select(menthlth_days, age, sex, gen_health, marital_status) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days:past 30)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      gen_health     ~ "Health status",
      marital_status ~ "Marital status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1: Descriptive Statistics for Lab Variables**")

**Table 1: Descriptive Statistics for Lab Variables**
Characteristic	N	N = 5,000¹
Mentally unhealthy days:past 30)	5,000	3.8 (7.9)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
Health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
Marital status	5,000
Married		2,708 (54%)
Divorced		622 (12%)
Widowed		534 (11%)
Separated		109 (2.2%)
Never married		848 (17%)
Unmarried couple		179 (3.6%)
¹ Mean (SD); n (%)

The sample included 5,000 individuals with a mean age of 54.9 years (SD = 17.5). On average, participants reported 3.8 mentally unhealthy days (SD = 7.9) in the past 30 days. The sample was 54% female and 46% male. Most participants reported very good (36%) or good (29%) general health, while fewer reported fair (10%) or poor (3.7%) health. Over half of the sample (54%) were married.

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

ggplot(brfss_dv_2020, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  labs(
    title = "Mentally Unhealthy Days by Health Status",
    x = "Health Status",
    y = "Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

On average, individuals who reported poor general health had the highest number of mentally unhealthy days. The median and spread of mentally unhealthy days increase as general health worsens from excellent to poor. This pattern is consistent with what we would expect, as individuals with worse overall health tend to experience more mental health challenges.

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

marital_means <- brfss_dv_2020 |>
  group_by(marital_status) |>
  summarise(
    mean_menthlth = mean(menthlth_days, na.rm = TRUE),
    .groups = "drop"
  )

marital_means |>
  kable(
    caption = "Mean Mentally Unhealthy Days by Marital Status",
    digits = 2
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Mean Mentally Unhealthy Days by Marital Status
marital_status	mean_menthlth
Married	3.10
Divorced	4.49
Widowed	2.67
Separated	6.22
Never married	5.28
Unmarried couple	6.07

ggplot(marital_means, aes(x = marital_status, y = mean_menthlth, fill = marital_status)) +
  geom_col(alpha = 0.85) +
  labs(
    title = "Mean Mentally Unhealthy Days by Marital Status",
    x = "Marital Status",
    y = "Mean Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

On average, individuals who were separated reported the highest number of mentally unhealthy days (6.22), while those who were widowed reported the lowest number of mentally unhealthy days (2.67). This suggests that marital status may be associated with differences in mental health, with separated individuals experiencing more mentally unhealthy days compared to other groups.

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

brfss_dv_2020 <- brfss_dv_2020 |>
  mutate(
    gen_health_numeric = case_when(
      gen_health == "Excellent" ~ 1,
      gen_health == "Very good" ~ 2,
      gen_health == "Good"      ~ 3,
      gen_health == "Fair"      ~ 4,
      gen_health == "Poor"      ~ 5,
      TRUE ~ NA_real_
    )
  )

mod_gen_naive <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv_2020)

tidy(mod_gen_naive, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Naive Model: Health Treated as Continuous",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Naive Model: Health Treated as Continuous
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	-0.6718	0.2705	-2.4840	0.013	-1.2021	-0.1416
gen_health_numeric	1.8578	0.1036	17.9259	0.000	1.6547	2.0610

The coefficient for gen_health_numeric is 1.8578. On average, a one-unit increase in general health (representing worse health) is associated with a 1.86 increase in mentally unhealthy days. This means that moving from one category to the next (e.g., Excellent to Very good, or Good to Fair) is associated with about 1.86 more mentally unhealthy days, on average.

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

mod_gen_factor <- lm(menthlth_days ~ gen_health, data = brfss_dv_2020)

tidy(mod_gen_factor, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Correct Model: Health Treated as a Factor",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Correct Model: Health Treated as a Factor
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.1174	0.2332	9.0790	0.0000	1.6602	2.5746
gen_healthVery good	0.5903	0.2941	2.0070	0.0448	0.0137	1.1670
gen_healthGood	1.9535	0.3082	6.3375	0.0000	1.3492	2.5577
gen_healthFair	5.0624	0.4064	12.4572	0.0000	4.2657	5.8590
gen_healthPoor	9.6640	0.6090	15.8678	0.0000	8.4701	10.8580

The factor model uses 4 coefficients instead of 1 because gen_health has 5 categories, and the category, Excellent, is used as the reference group. Each of the remaining categories is compared to this reference group, which results in 4 dummy variables. Compared to the naive model, the factor model allows each category of general health to have a different association with mentally unhealthy days. In contrast, the numeric model assumes that each one-step increase in general health has the same effect. The naive numeric approach may be misleading because it assumes equal spacing between categories. However, the results from the factor model show that the increases are not equal, as the estimated effects become much larger for worse health categories.

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

mod_gen_full <- lm(
  menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
  data = brfss_dv_2020
)

tidy(mod_gen_full, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Full Model with Health Dummy Variables",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Full Model with Health Dummy Variables
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.5930	0.6304	15.2163	0.0000	8.3570	10.8289
age	-0.0867	0.0060	-14.4888	0.0000	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0.0000	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0.0000	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0.0000	-0.7367	-0.4365
gen_healthVery good	0.7899	0.2797	2.8247	0.0048	0.2417	1.3382
gen_healthGood	1.8436	0.2973	6.2020	0.0000	1.2608	2.4264
gen_healthFair	3.3953	0.4180	8.1234	0.0000	2.5759	4.2147
gen_healthPoor	5.3353	0.6829	7.8122	0.0000	3.9965	6.6742

coef(mod_gen_full)

##         (Intercept)                 age           sexFemale       physhlth_days 
##           9.5929825          -0.0866721           1.7253789           0.2314200 
##           sleep_hrs gen_healthVery good      gen_healthGood      gen_healthFair 
##          -0.5865953           0.7899467           1.8436011           3.3952833 
##      gen_healthPoor 
##           5.3353468

menthlth_days=9.5930−0.0867(age)+1.7254(Female)+0.2314(physhlth_days)−0.5866(sleep_hrs)+0.7899(Verygood)+1.8436(Good)+3.3953(Fair)+5.3353(Poor)

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

Very good: On average, individuals with very good health report 0.79 more mentally unhealthy days compared to those with excellent health, holding all other variables constant. Good: On average, individuals with good health report 1.84 more mentally unhealthy days compared to those with excellent health, holding all other variables constant. Fair: On average, individuals with fair health report 3.40 more mentally unhealthy days compared to those with excellent health, holding all other variables constant. Poor: On average, individuals with poor health report 5.34 more mentally unhealthy days compared to those with excellent health, holding all other variables constant.

The group that differs the most from the reference group is the poor health group. This group has the largest estimated coefficient, indicating the greatest increase in mentally unhealthy days compared to those with excellent health.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

gen_coef_plot <- tidy(mod_gen_full, conf.int = TRUE) |>
  filter(str_detect(term, "gen_health")) |>
  mutate(
    term = case_when(
      term == "gen_healthVery good" ~ "Very good vs Excellent",
      term == "gen_healthGood"      ~ "Good vs Excellent",
      term == "gen_healthFair"      ~ "Fair vs Excellent",
      term == "gen_healthPoor"      ~ "Poor vs Excellent",
      TRUE ~ term
    )
  )
ggplot(gen_coef_plot, aes(x = estimate, y = fct_reorder(term, estimate))) +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(
    title = "Health Dummy Variable Coefficients",
    x = "Estimated Coefficient",
    y = ""
  ) +
  theme_minimal(base_size = 13)

The group that differs the most from the reference group is the poor health group. On average, individuals with poor health have the largest increase in mentally unhealthy days compared to those with excellent health, as shown by the largest estimated coefficient and confidence interval farthest from zero.

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

brfss_dv_2020 <- brfss_dv_2020 |>
  mutate(gen_health_goodref = relevel(gen_health, ref = "Good"))

mod_gen_goodref <- lm(
  menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_goodref,
  data = brfss_dv_2020
)

tidy(mod_gen_goodref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Full Model with Good as the Reference Group",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Full Model with Good as the Reference Group
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.4366	0.6298	18.1584	0e+00	10.2019	12.6713
age	-0.0867	0.0060	-14.4888	0e+00	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0e+00	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0e+00	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0e+00	-0.7367	-0.4365
gen_health_goodrefExcellent	-1.8436	0.2973	-6.2020	0e+00	-2.4264	-1.2608
gen_health_goodrefVery good	-1.0537	0.2581	-4.0819	0e+00	-1.5597	-0.5476
gen_health_goodrefFair	1.5517	0.3861	4.0186	1e-04	0.7947	2.3087
gen_health_goodrefPoor	3.4917	0.6506	5.3673	0e+00	2.2164	4.7671

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

tribble(
  ~Term, ~Original_Reference, ~Good_Reference,
  "Age", coef(mod_gen_full)["age"], coef(mod_gen_goodref)["age"],
  "SexFemale", coef(mod_gen_full)["sexFemale"], coef(mod_gen_goodref)["sexFemale"],
  "Physical health days", coef(mod_gen_full)["physhlth_days"], coef(mod_gen_goodref)["physhlth_days"],
  "Sleep hours", coef(mod_gen_full)["sleep_hrs"], coef(mod_gen_goodref)["sleep_hrs"]
) |>
  mutate(across(where(is.numeric), round, 4)) |>
  kable(caption = "Comparison of Non-General Health Coefficients Across Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparison of Non-General Health Coefficients Across Reference Groups
Term	Original_Reference	Good_Reference
Age	-0.0867	-0.0867
SexFemale	1.7254	1.7254
Physical health days	0.2314	0.2314
Sleep hours	-0.5866	-0.5866

The coefficients for age, sex, physical health days, and sleep hours are the same in both models. This is because changing the reference group for a categorical variable only affects the intercept and the dummy variable coefficients for that categorical variable. The relationships between the continuous variables and mentally unhealthy days remain unchanged because they are estimated independently of the reference category. Changing the reference group changes how the general health categories are compared, but it does not affect the overall model fit or the effects of the other variables.

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

pred_orig <- predict(mod_gen_full)
pred_goodref <- predict(mod_gen_goodref)

cor(pred_orig, pred_goodref)

## [1] 1

max(abs(pred_orig - pred_goodref))

## [1] 3.552714e-15

The correlation between the predicted values from the two models is 1, and the difference between them is essentially 0, indicating that the predictions are identical. Changing the reference group does not change the predicted values because the model is mathematically the same. It only changes how the coefficients for the categorical variable are expressed. The underlying relationships and fitted values remain unchanged, using a different baseline category.

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

mod_gen_reduced <- lm(
  menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
  data = brfss_dv_2020
)

tibble(
  Model = c("Reduced model", "Full model"),
  R_squared = c(summary(mod_gen_reduced)$r.squared, summary(mod_gen_full)$r.squared),
  Adjusted_R_squared = c(summary(mod_gen_reduced)$adj.r.squared, summary(mod_gen_full)$adj.r.squared)
) |>
  mutate(across(where(is.numeric), round, 4)) |>
  kable(caption = "Model Fit Statistics: Reduced vs Full Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Fit Statistics: Reduced vs Full Model
Model	R_squared	Adjusted_R_squared
Reduced model	0.1522	0.1515
Full model	0.1694	0.1681

The reduced model has an R^2 of 0.1522 and an adjusted R^2 of 0.1515. The full model has an R^2 of 0.1694 and an adjusted R^2 of 0.1681. The full model explains more variation in mentally unhealthy days compared to the reduced model, showing that adding general health improves the model fit.

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

anova(mod_gen_reduced, mod_gen_full)

## Analysis of Variance Table
## 
## Model 1: menthlth_days ~ age + sex + physhlth_days + sleep_hrs
## Model 2: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4995 264715                                  
## 2   4991 259335  4    5379.8 25.884 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0:β Very good=β Good=β Fair=β Poor=0 HA:At least one of these coefficients is not equal to 0. F-statistic=25.88
degrees of freedom=4 and 4991 p-value < 2.2 × 10^-16. Since the p-value is small, we reject the null hypothesis. This indicates that general health, as a group of variables, significantly improves the model. On average, general health is significantly associated with mentally unhealthy days, making all other variables constant.

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

Anova(mod_gen_full, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), round, 4)) |>
  kable(caption = "Type III ANOVA for Full Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Type III ANOVA for Full Model
term	sumsq	df	statistic	p.value
(Intercept)	12030.737	1	231.5357	0
age	10907.874	1	209.9258	0
sex	3663.847	1	70.5120	0
physhlth_days	10633.920	1	204.6535	0
sleep_hrs	3049.400	1	58.6868	0
gen_health	5379.751	4	25.8838	0
Residuals	259335.435	4991	NA	NA

The Type III ANOVA result for gen_health shows an F-statistic of 25.88 with a p-value of 0 (p < 0.001). This is consistent with the results from the partial F-test. Both tests indicate that general health, as a group of variables, significantly improves the model and is significantly associated with mentally unhealthy days, holding all other variables constant. —

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

Identify which general health groups differ most from the reference
State the direction and approximate magnitude of the association
Appropriately acknowledge the cross-sectional nature of the data
Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

People who reported fair or poor health had noticeably more mentally unhealthy days compared to those with excellent health, with those in poor health having the largest difference. On average, individuals in poorer health reported about 3 to 5 more mentally unhealthy days in the past month. This shows that worse overall health is linked to worse mental health. However, because the data were collected at one point in time, we cannot say that general health causes changes in mental health.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

Dummy Variables in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Muntasir Masum

March 23, 2026