Introduction

In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.

When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.

Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:

Why numeric coding of categories fails
How to construct dummy variables for dichotomous and multichotomous predictors
The concept of the reference group and how to change it
Interpreting dummy variable coefficients
Testing whether a categorical variable as a whole is significant (partial F-test)
Alternative coding schemes (effect coding, ordinal contrasts)

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.

Research question for today:

How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?

brfss_full <- read_xpt(
  "LLCP2020.XPT"
) |>
  clean_names()

brfss_dv <- brfss_full |>
  mutate(
    # Outcome: mentally unhealthy days in past 30
    menthlth_days = case_when(
      menthlth == 88                    ~ 0,
      menthlth >= 1 & menthlth <= 30   ~ as.numeric(menthlth),
      TRUE                             ~ NA_real_
    ),
    # Physical health days
    physhlth_days = case_when(
      physhlth == 88                    ~ 0,
      physhlth >= 1 & physhlth <= 30   ~ as.numeric(physhlth),
      TRUE                             ~ NA_real_
    ),
    # Sleep hours
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14   ~ as.numeric(sleptim1),
      TRUE                             ~ NA_real_
    ),
    # Age
    age = age80,
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Education (6-level raw BRFSS variable EDUCA)
    # 1 = Never attended school or only kindergarten
    # 2 = Grades 1 through 8 (Elementary)
    # 3 = Grades 9 through 11 (Some high school)
    # 4 = Grade 12 or GED (High school graduate)
    # 5 = College 1 year to 3 years (Some college or technical school)
    # 6 = College 4 years or more (College graduate)
    # 9 = Refused
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    # General health status (5-level)
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    # Marital status
    marital_status = factor(case_when(
      marital == 1 ~ "Married",
      marital == 2 ~ "Divorced",
      marital == 3 ~ "Widowed",
      marital == 4 ~ "Separated",
      marital == 5 ~ "Never married",
      marital == 6 ~ "Unmarried couple",
      TRUE         ~ NA_character_
    ), levels = c("Married", "Divorced", "Widowed", "Separated",
                  "Never married", "Unmarried couple")),
    # Store the raw education numeric code for the "naive approach" demonstration
    educ_numeric = case_when(
      educa %in% c(1, 2, 3) ~ 1,
      educa == 4             ~ 2,
      educa == 5             ~ 3,
      educa == 6             ~ 4,
      TRUE                   ~ NA_real_
    )
  ) |>
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age), age >= 18,
    !is.na(sex),
    !is.na(education),
    !is.na(gen_health),
    !is.na(marital_status)
  )

# Reproducible random sample
set.seed(1220)
brfss_dv <- brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health, marital_status, educ_numeric) |>
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_dv,
  "brfss_dv_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_dv), ncol(brfss_dv))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

Descriptive Statistics

brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      physhlth_days  ~ "Physically unhealthy days (past 30)",
      sleep_hrs      ~ "Sleep (hours/night)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      education      ~ "Education level",
      gen_health     ~ "General health status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Physically unhealthy days (past 30)	5,000	3.3 (7.9)
Sleep (hours/night)	5,000	7.0 (1.4)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
Education level	5,000
Less than HS		290 (5.8%)
HS graduate		1,348 (27%)
Some college		1,340 (27%)
College graduate		2,022 (40%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
1Mean (SD); n (%)

ggplot(brfss_dv, aes(x = education, fill = education)) +
  geom_bar(alpha = 0.85) +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Distribution of Education Level",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x = "Education Level",
    y = "Count"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Distribution of Education Level in Analytic Sample

ggplot(brfss_dv, aes(x = education, y = menthlth_days, fill = education)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by Education Level",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "Education Level",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mental Health Days by Education Level

Part 1: Guided Practice — Dummy Variables

1. Categorical Variables: The Problem

1.1 Types of Categorical Variables

Categorical predictor variables come in two forms:

Type	Definition	Examples
Nominal	Categories with no natural ordering	Sex, race/ethnicity, marital status, blood type
Ordinal	Categories with a meaningful order	Education level, income bracket, disease stage, Likert scale

A further distinction is:

Dichotomous (k = 2): Only two categories (e.g., sex: Male/Female)
Multichotomous (k > 2): Three or more categories (e.g., education: 4 levels)

Note that categorical variables can also be created by grouping continuous variables (e.g., age groups from continuous age), though this generally results in a loss of information.

1.2 The Naive Approach: Why Numeric Codes Fail

Suppose education has been coded as: 1 = Less than HS, 2 = HS graduate, 3 = Some college, 4 = College graduate.

If we include this numeric code directly in a regression model, we are assuming:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 \cdot \text{educ_numeric} + \varepsilon\]

This forces the model to assume that the difference in expected \(Y\) between “Less than HS” and “HS graduate” is the same as the difference between “HS graduate” and “Some college,” and the same again between “Some college” and “College graduate.” In other words, we are assuming equally spaced, linear increments.

# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ age + educ_numeric, data = brfss_dv)

tidy(naive_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Naive Model: Education Treated as Continuous",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Naive Model: Education Treated as Continuous
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	9.5601	0.5039	18.9727	8.5723	10.5479
age	-0.0661	0.0063	-10.5135	-0.0784	-0.0538
educ_numeric	-0.7168	0.1158	-6.1917	-0.9437	-0.4898

This model estimates a single coefficient for education, meaning each step up the education ladder is associated with the same change in mentally unhealthy days. This constraint is problematic for two reasons:

For nominal variables (like race or marital status), numeric codes are entirely arbitrary. The “distance” between code 1 and code 2 has no meaning.
Even for ordinal variables (like education), the assumption of equal spacing is rarely justified. The jump from “Less than HS” to “HS graduate” is substantively different from “Some college” to “College graduate.”

Let’s visualize why this matters:

# Compute observed group means
group_means <- brfss_dv |>
  summarise(mean_days = mean(menthlth_days), .by = c(education, educ_numeric))

# Generate predictions from the naive model
pred_naive <- tibble(
  educ_numeric = 1:4,
  predicted = predict(naive_mod, newdata = tibble(age = mean(brfss_dv$age), educ_numeric = 1:4))
)

ggplot() +
  geom_point(data = group_means,
             aes(x = educ_numeric, y = mean_days),
             size = 4, color = "steelblue") +
  geom_line(data = pred_naive,
            aes(x = educ_numeric, y = predicted),
            color = "tomato", linewidth = 1.2, linetype = "dashed") +
  geom_point(data = pred_naive,
             aes(x = educ_numeric, y = predicted),
             size = 3, color = "tomato", shape = 17) +
  scale_x_continuous(
    breaks = 1:4,
    labels = c("Less than HS", "HS graduate", "Some college", "College graduate")
  ) +
  labs(
    title = "Observed Group Means (blue) vs. Naive Linear Fit (red)",
    subtitle = "The naive model forces equal spacing between education levels",
    x = "Education Level",
    y = "Mean Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13)

Naive Linear Fit vs. Actual Group Means by Education

Key takeaway: The blue dots (observed means) do not fall along a straight line. The naive linear model (red) misrepresents the actual pattern. We need a more flexible approach.

2. Dummy (Indicator) Variables

2.1 Definition and the k - 1 Rule

A dummy variable (also called an indicator variable) is a variable that takes on only two possible values:

1 to indicate the presence of a condition
0 to indicate its absence

If a categorical predictor has \(k\) categories, we need exactly \(k - 1\) dummy variables when the model includes an intercept. The omitted category becomes the reference group (also called the control group or baseline group).

Why \(k - 1\) and not \(k\)? Because the intercept already captures the mean for the reference group. Including all \(k\) dummies plus the intercept would create perfect multicollinearity (the dummy variables would sum to equal the intercept column), and the model could not be estimated.

2.2 The Dichotomous Case (k = 2)

The simplest example is a variable with two categories, such as sex.

With \(k = 2\), we need \(2 - 1 = 1\) dummy variable. If we choose Female as the reference group:

\[\text{male} = \begin{cases} 1 & \text{if male} \\ 0 & \text{if female} \end{cases}\]

The regression model becomes:

\[Y = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{male} + \varepsilon\]

For males (\(\text{male} = 1\)): \[E(Y | \text{age}, \text{male}) = (\beta_0 + \beta_2) + \beta_1 \cdot \text{age}\]

For females (\(\text{male} = 0\)): \[E(Y | \text{age}, \text{female}) = \beta_0 + \beta_1 \cdot \text{age}\]

Both groups share the same slope for age but have different intercepts. The coefficient \(\beta_2\) is the expected difference in \(Y\) between males and females, holding age constant.

# Fit model with sex as a dummy variable
mod_sex <- lm(menthlth_days ~ age + sex, data = brfss_dv)

tidy(mod_sex, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Dichotomous Dummy Variable: Sex",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Dichotomous Dummy Variable: Sex
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	6.6262	0.3730	17.7666	5.8951	7.3574
age	-0.0698	0.0063	-11.1011	-0.0821	-0.0575
sexFemale	1.8031	0.2210	8.1585	1.3698	2.2364

b_sex <- round(coef(mod_sex), 3)

Interpretation:

Intercept (6.626): The estimated mean number of mentally unhealthy days for a male of age 0. (This is a mathematical artifact, not a substantive value.)
Age (-0.07): Each additional year of age is associated with a 0.07 day change in mentally unhealthy days, holding sex constant.
Sex: Female (1.803): Compared to males (the reference group), females report an estimated 1.803 more mentally unhealthy days on average, holding age constant.

Note that R automatically creates dummy variables when a factor is included in lm(). It uses alphabetical or level order to set the reference group, which is why Male (the first level) is the reference here.

pred_sex <- ggpredict(mod_sex, terms = c("age [20:80]", "sex"))

ggplot(pred_sex, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Age and Sex",
    subtitle = "Parallel lines: same slope, different intercepts",
    x        = "Age (years)",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Sex",
    fill     = "Sex"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

Parallel Regression Lines: Males vs. Females

Geometrically: Dummy variables produce parallel regression lines. The intercept shifts by \(\beta_2\) for the non-reference group, but the slope remains the same.

3. Multichotomous Dummy Variables (k > 2)

3.1 Constructing the Dummies

Education has \(k = 4\) categories, so we need \(4 - 1 = 3\) dummy variables. If we choose “Less than HS” as the reference group:

\[\text{HS_graduate} = \begin{cases} 1 & \text{if HS graduate} \\ 0 & \text{otherwise} \end{cases}\]

\[\text{Some_college} = \begin{cases} 1 & \text{if Some college} \\ 0 & \text{otherwise} \end{cases}\]

\[\text{College_graduate} = \begin{cases} 1 & \text{if College graduate} \\ 0 & \text{otherwise} \end{cases}\]

The data matrix looks like this:

Dummy Variable Encoding for Education (Reference: Less than HS)
Observation	Education	HS_graduate	Some_college	College_graduate
1	Less than HS	0	0	0
2	HS graduate	1	0	0
3	Some college	0	1	0
4	College graduate	0	0	1
5	Less than HS	0	0	0

Notice that the reference group is identified by having all dummy variables equal to zero.

3.2 The Reference Group

The reference group is the category against which all others are compared. Key points:

The choice of reference group affects the interpretation of coefficients but not the model’s predictive ability
The predicted values are identical regardless of which reference group is chosen
Choose the reference group based on your research question (e.g., the most common category, the “control” condition, or the group of primary interest)
You can always test differences between any pair of groups, not just comparisons to the reference

3.3 Fitting the Model in R

When we include a factor variable in lm(), R automatically creates the dummy variables. The first level of the factor is used as the reference group by default.

# Fit model with education as a factor (R creates dummies automatically)
mod_educ <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education,
               data = brfss_dv)

tidy(mod_educ, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Education Dummy Variables (Reference: Less than HS)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Education Dummy Variables (Reference: Less than HS)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.1377	0.7390	15.0709	0.0000	9.6889	12.5865
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
educationHS graduate	-0.5873	0.4719	-1.2445	0.2134	-1.5125	0.3379
educationSome college	-0.1289	0.4735	-0.2723	0.7854	-1.0572	0.7993
educationCollege graduate	-1.1429	0.4607	-2.4805	0.0132	-2.0461	-0.2396

b_educ <- round(coef(mod_educ), 3)
ci_educ <- round(confint(mod_educ), 3)

3.4 Interpreting Each Dummy Coefficient

The model is:

\[\widehat{\text{Mental Health Days}} = 11.138 + -0.077(\text{Age}) + 1.681(\text{Female}) + 0.311(\text{Phys Days}) + -0.628(\text{Sleep}) + -0.587(\text{HS grad}) + -0.129(\text{Some college}) + -1.143(\text{College grad})\]

Each education coefficient represents the estimated difference in mentally unhealthy days between that group and the reference group (Less than HS), holding all other variables constant:

HS graduate (\(\hat{\beta}\) = -0.587): Compared to those with less than a high school education, HS graduates report an estimated 0.587 fewer mentally unhealthy days, holding age, sex, physical health days, and sleep constant.
Some college (\(\hat{\beta}\) = -0.129): Compared to those with less than a high school education, those with some college report an estimated 0.129 fewer mentally unhealthy days, holding all else constant.
College graduate (\(\hat{\beta}\) = -1.143): Compared to those with less than a high school education, college graduates report an estimated 1.143 fewer mentally unhealthy days, holding all else constant.

Key pattern: All comparisons are made relative to the reference group. The coefficients do NOT directly tell us the difference between, say, HS graduates and college graduates. We would need to compute \(\hat{\beta}_{\text{HS grad}} - \hat{\beta}_{\text{College grad}}\) for that comparison (or change the reference group).

3.5 Visualizing the Parallel Lines

pred_educ <- ggpredict(mod_educ, terms = c("age [20:80]", "education"))

ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Age and Education",
    subtitle = "Parallel lines: same slopes for age, different intercepts by education",
    x        = "Age (years)",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Education",
    fill     = "Education"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set2")

Predicted Mental Health Days by Age and Education Level

These are a series of parallel lines, one for each education level. The slope for age is the same across all groups; only the intercept differs. Each education dummy shifts the intercept up or down relative to the reference group.

4. Changing the Reference Group

4.1 Using `relevel()` in R

We may want to change the reference group to a category that is more epidemiologically meaningful. For instance, “College graduate” is the largest group and could serve as a natural comparison.

# Change reference group to College graduate
brfss_dv$education_reref <- relevel(brfss_dv$education, ref = "College graduate")

mod_educ_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_reref,
                     data = brfss_dv)

tidy(mod_educ_reref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: College graduate)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Same Model, Different Reference Group (Reference: College graduate)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.9948	0.6272	15.9349	0.0000	8.7652	11.2245
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
education_rerefLess than HS	1.1429	0.4607	2.4805	0.0132	0.2396	2.0461
education_rerefHS graduate	0.5556	0.2574	2.1586	0.0309	0.0510	1.0601
education_rerefSome college	1.0139	0.2566	3.9507	0.0001	0.5108	1.5171

4.2 What Changes and What Stays the Same?

tribble(
  ~Quantity, ~`Ref: Less than HS`, ~`Ref: College graduate`,
  "Intercept", round(coef(mod_educ)[1], 3), round(coef(mod_educ_reref)[1], 3),
  "Age coefficient", round(coef(mod_educ)[2], 3), round(coef(mod_educ_reref)[2], 3),
  "Sex coefficient", round(coef(mod_educ)[3], 3), round(coef(mod_educ_reref)[3], 3),
  "Physical health days", round(coef(mod_educ)[4], 3), round(coef(mod_educ_reref)[4], 3),
  "Sleep hours", round(coef(mod_educ)[5], 3), round(coef(mod_educ_reref)[5], 3),
  "R-squared", round(summary(mod_educ)$r.squared, 4), round(summary(mod_educ_reref)$r.squared, 4),
  "Residual SE", round(summary(mod_educ)$sigma, 3), round(summary(mod_educ_reref)$sigma, 3)
) |>
  kable(caption = "Comparing Models with Different Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparing Models with Different Reference Groups
Quantity	Ref: Less than HS	Ref: College graduate
Intercept	11.1380	9.9950
Age coefficient	-0.0770	-0.0770
Sex coefficient	1.6810	1.6810
Physical health days	0.3110	0.3110
Sleep hours	-0.6280	-0.6280
R-squared	0.1553	0.1553
Residual SE	7.2690	7.2690

What changes:

The intercept
The education dummy variable coefficients (they now represent comparisons to the new reference group)

What stays the same:

Coefficients for age, sex, physical health days, and sleep
\(R^2\), Adjusted \(R^2\), and Residual Standard Error
All predicted values
The overall F-test

This is a critical point: Changing the reference group does not change the model’s fit or predictions. It only changes the interpretation of the dummy variable coefficients.

# Verify that predicted values are identical
pred_orig <- predict(mod_educ)
pred_reref <- predict(mod_educ_reref)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(max(abs(pred_orig - pred_reref)),
            cor(pred_orig, pred_reref))
) |>
  kable(caption = "Verification: Predicted Values Are Identical") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Verification: Predicted Values Are Identical
Check	Value
Maximum absolute difference in predictions	0
Correlation between predictions	1

5. The Dummy Variable Trap (Perfect Multicollinearity)

5.1 Why We Cannot Include All k Dummies

If we include \(k\) dummy variables and an intercept for a variable with \(k\) categories, the columns of the design matrix \(X\) are linearly dependent. Specifically:

\[\text{Intercept} = D_1 + D_2 + \cdots + D_k\]

where \(D_1, \ldots, D_k\) are the \(k\) dummy variables (one for each category). This means the matrix \(X^TX\) is singular and cannot be inverted, so the OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^TY\) does not exist.

This is called the dummy variable trap.

The Dummy Variable Trap: Intercept = A + B + C (perfect linear dependence)
Obs	Intercept	A	B	C	A + B + C
1	1	1	0	0	1
2	1	0	1	0	1
3	1	0	0	1	1
4	1	1	0	0	1

Solutions:

Reference cell coding (default in R): Include \(k - 1\) dummies; the omitted category is absorbed into the intercept
Remove the intercept: Fit the model with - 1 in the formula and include all \(k\) dummies. Then each coefficient is the group mean (adjusted for other predictors) rather than a difference from a reference.

# Model without intercept: all k dummies included
mod_no_int <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education - 1,
                 data = brfss_dv)

tidy(mod_no_int, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model Without Intercept: All k Education Dummies Included",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Without Intercept: All k Education Dummies Included
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexMale	11.1377	0.7390	15.0709	0.0000	9.6889	12.5865
sexFemale	12.8190	0.7524	17.0365	0.0000	11.3439	14.2941
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
educationHS graduate	-0.5873	0.4719	-1.2445	0.2134	-1.5125	0.3379
educationSome college	-0.1289	0.4735	-0.2723	0.7854	-1.0572	0.7993
educationCollege graduate	-1.1429	0.4607	-2.4805	0.0132	-2.0461	-0.2396

Caution: Removing the intercept changes the interpretation of \(R^2\) and should only be done when there is a substantive reason. In most epidemiological applications, reference cell coding (the default) is preferred.

6. Testing Whether a Categorical Variable is Significant

6.1 The Partial F-Test (Type I and Type III)

When a categorical variable with \(k\) levels enters the model as \(k - 1\) dummies, we cannot assess its overall significance by looking at individual t-tests for each dummy. A single dummy might not be statistically significant on its own, yet the variable as a whole might be.

To test whether education as a whole is associated with the outcome, we use a partial F-test (also called an extra sum of squares F-test):

\[H_0: \beta_{\text{HS grad}} = \beta_{\text{Some college}} = \beta_{\text{College grad}} = 0\] \[H_A: \text{At least one } \beta_j \neq 0\]

This compares the full model (with education) to a reduced model (without education):

# Reduced model (no education)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)

# Partial F-test
f_test <- anova(mod_reduced, mod_educ)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does Education Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Does Education Improve the Model?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	264715.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education	4992	263744.4	3	970.7509	6.1246	4e-04

6.2 Using `car::Anova()` for Type III Tests

The car::Anova() function with type = "III" provides a convenient way to test the overall significance of each predictor, including categorical variables:

Anova(mod_educ, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Type III ANOVA: Testing Each Predictor’s Contribution
term	sumsq	df	statistic	p.value
(Intercept)	12000.1867	1	227.1325	0e+00
age	8863.3522	1	167.7603	0e+00
sex	3469.6448	1	65.6714	0e+00
physhlth_days	28765.1139	1	544.4492	0e+00
sleep_hrs	3506.1243	1	66.3619	0e+00
education	970.7509	3	6.1246	4e-04
Residuals	263744.4348	4992	NA	NA

Type I vs. Type III: Type I (sequential) sums of squares depend on the order variables enter the model. Type III (partial) sums of squares test each variable after all others, regardless of order. For unbalanced observational data (the norm in epidemiology), Type III is preferred.

7. Contrasts and Alternative Coding Schemes

7.1 Reference (Treatment) Coding — The Default

This is what R uses by default (contr.treatment). Each coefficient represents the difference between a group and the reference group.

contrasts(brfss_dv$education)

##                  HS graduate Some college College graduate
## Less than HS               0            0                0
## HS graduate                1            0                0
## Some college               0            1                0
## College graduate           0            0                1

7.2 Effect (Deviation) Coding

In effect coding (contr.sum), each coefficient represents the difference between a group’s mean and the grand mean (the unweighted average of all group means). This is common in ANOVA contexts.

# Set effect coding
brfss_dv$education_effect <- brfss_dv$education
contrasts(brfss_dv$education_effect) <- contr.sum(4)

mod_effect <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_effect,
                 data = brfss_dv)

tidy(mod_effect, conf.int = TRUE) |>
  mutate(
    term = case_when(
      str_detect(term, "education_effect1") ~ "Education: Less than HS vs. Grand Mean",
      str_detect(term, "education_effect2") ~ "Education: HS graduate vs. Grand Mean",
      str_detect(term, "education_effect3") ~ "Education: Some college vs. Grand Mean",
      TRUE ~ term
    ),
    across(where(is.numeric), \(x) round(x, 4))
  ) |>
  kable(
    caption = "Effect Coding: Each Education Coefficient vs. Grand Mean",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Effect Coding: Each Education Coefficient vs. Grand Mean
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	10.6729	0.6172	17.2911	0.0000	9.4628	11.8830
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
Education: Less than HS vs. Grand Mean	0.4648	0.3323	1.3988	0.1619	-0.1866	1.1162
Education: HS graduate vs. Grand Mean	-0.1225	0.1939	-0.6319	0.5275	-0.5026	0.2576
Education: Some college vs. Grand Mean	0.3358	0.1946	1.7257	0.0845	-0.0457	0.7174

With effect coding, the intercept is the grand mean (adjusted for covariates), and each education coefficient shows how far that group deviates from the grand mean. The omitted group’s deviation is the negative sum of the others.

7.3 Polynomial (Orthogonal) Contrasts for Ordinal Variables

When a categorical variable is truly ordinal (like education), we can test for specific patterns using orthogonal polynomial contrasts (contr.poly). These decompose the group differences into linear, quadratic, and cubic trends.

# Ordinal polynomial contrasts
brfss_dv$education_ord <- brfss_dv$education
contrasts(brfss_dv$education_ord) <- contr.poly(4)

mod_ord <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_ord,
              data = brfss_dv)

tidy(mod_ord, conf.int = TRUE) |>
  mutate(
    term = case_when(
      str_detect(term, "\\.L$") ~ "Education: Linear trend",
      str_detect(term, "\\.Q$") ~ "Education: Quadratic trend",
      str_detect(term, "\\.C$") ~ "Education: Cubic trend",
      TRUE ~ term
    ),
    across(where(is.numeric), \(x) round(x, 4))
  ) |>
  kable(
    caption = "Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	10.6729	0.6172	17.2911	0.0000	9.4628	11.8830
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
Education: Linear trend	-0.6642	0.3158	-2.1028	0.0355	-1.2833	-0.0450
Education: Quadratic trend	-0.2133	0.2682	-0.7954	0.4264	-0.7391	0.3125
Education: Cubic trend	-0.5630	0.2142	-2.6282	0.0086	-0.9830	-0.1431

Interpretation:

The linear contrast tests whether there is a consistent trend (higher education = fewer/more days)
The quadratic contrast tests whether the relationship curves (U-shaped or inverted-U)
The cubic contrast tests for an S-shaped pattern

Polynomial contrasts are most useful when the categories have a clear, meaningful order and you want to characterize the shape of the trend rather than compare individual groups to a reference.

7.4 Coding Scheme Comparison Summary

Summary of Dummy Variable Coding Schemes
Coding Scheme	R Function	Intercept	Each β represents	Best for
Treatment (Reference)	contr.treatment (default)	Reference group mean	Difference from reference group	Group comparisons to baseline
Effect (Deviation)	contr.sum	Grand mean	Deviation from grand mean	ANOVA-style analyses
Polynomial (Ordinal)	contr.poly	Grand mean	Linear/quadratic/cubic trend	Ordinal variables with ordered levels

8. Practical Considerations

8.1 Choosing the Reference Group

Guidelines for choosing the reference group:

Most common category — maximizes the precision of comparisons (largest reference sample)
Control or baseline condition — natural in experimental or clinical settings
Epidemiologically meaningful comparator — the group to which you want to compare all others
Consistency with published literature — facilitates comparison across studies

8.2 When `as.factor()` Is Required

If a categorical variable is stored as numeric in your data (e.g., coded 0, 1, 2, 3), R will treat it as continuous by default. You must use as.factor() or factor() to tell R it is categorical:

# WRONG: R treats educ_numeric as continuous
mod_wrong <- lm(menthlth_days ~ educ_numeric, data = brfss_dv)

# RIGHT: Convert to factor first
mod_right <- lm(menthlth_days ~ factor(educ_numeric), data = brfss_dv)

# Compare: 1 coefficient (wrong) vs. 3 coefficients (right)
tribble(
  ~Model, ~`Number of education coefficients`, ~`Degrees of freedom used`,
  "Numeric (wrong)", 1, 1,
  "Factor (correct)", 3, 3
) |>
  kable(caption = "Numeric vs. Factor Treatment of Categorical Variables") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Numeric vs. Factor Treatment of Categorical Variables
Model	Number of education coefficients	Degrees of freedom used
Numeric (wrong)	1	1
Factor (correct)	3	3

8.3 Comparing Non-Reference Groups

What if we want to compare HS graduates to college graduates, but neither is the reference group? We have two options:

Option 1: Change the reference group with relevel().

Option 2: Compute the difference manually from the model output.

# Difference between HS graduate and College graduate
# = β_HS_grad - β_College_grad
diff_est <- coef(mod_educ)["educationHS graduate"] - coef(mod_educ)["educationCollege graduate"]

# Use linearHypothesis() for a formal test with SE and p-value
lin_test <- linearHypothesis(mod_educ, "educationHS graduate - educationCollege graduate = 0")

cat("Estimated difference (HS grad - College grad):", round(diff_est, 3), "days\n")

## Estimated difference (HS grad - College grad): 0.556 days

cat("F-statistic:", round(lin_test$F[2], 3), "\n")

## F-statistic: 4.66

cat("p-value:", round(lin_test$`Pr(>F)`[2], 4), "\n")

## p-value: 0.0309

car::linearHypothesis() is a powerful function for testing any linear combination of coefficients, not just comparisons to the reference group.

Summary of Key Concepts

Concept	Key Point
Categorical predictors	Cannot be included as raw numeric codes in regression
Dummy variables	Binary (0/1) indicators; need \(k - 1\) for \(k\) categories
Reference group	The omitted category; all comparisons are relative to it
Changing reference	Use `relevel()`; predictions unchanged, interpretation changes
Partial F-test	Tests whether the categorical variable as a whole is significant
Dummy variable trap	Including \(k\) dummies + intercept = perfect multicollinearity
`as.factor()`	Required when categorical variable is stored as numeric
Coding schemes	Treatment (default), effect, polynomial — each answers a different question
Type III ANOVA	Preferred for unbalanced observational data
Linear hypothesis	`linearHypothesis()` tests comparisons between non-reference groups

Part 2: In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026

Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

SVariable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)

brfss_dv <- readRDS(
  "brfss_dv_2020.rds"
)

#Task 1a
library(tidyverse)
library(gtsummary)

# Descriptive statistics table
brfss_dv %>%
  select(menthlth_days, age, sex, gen_health, marital_status) %>%
  tbl_summary(
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 2
  )

Characteristic	N = 5,000¹
menthlth_days	3.77 (7.90)
IMPUTED AGE VALUE COLLAPSED ABOVE 80	54.86 (17.53)
sex
Male	2,303 (46%)
Female	2,697 (54%)
gen_health
Excellent	1,065 (21%)
Very good	1,803 (36%)
Good	1,426 (29%)
Fair	523 (10%)
Poor	183 (3.7%)
marital_status
Married	2,708 (54%)
Divorced	622 (12%)
Widowed	534 (11%)
Separated	109 (2.2%)
Never married	848 (17%)
Unmarried couple	179 (3.6%)
¹ Mean (SD); n (%)

#Task 1b
ggplot(brfss_dv, aes(x = factor(gen_health,
                                levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
                     y = menthlth_days)) +
  geom_boxplot() +
  labs(
    title = "Mentally Unhealthy Days by General Health",
    x = "General Health",
    y = "Mentally Unhealthy Days (past 30 days)"
  ) +
  theme_minimal()

#task 1c
brfss_dv %>%
  group_by(marital_status) %>%
  summarise(
    mean_menthlth_days = mean(menthlth_days, na.rm = TRUE)
  ) %>%
  arrange(desc(mean_menthlth_days))

## # A tibble: 6 × 2
##   marital_status   mean_menthlth_days
##   <fct>                         <dbl>
## 1 Separated                      6.22
## 2 Unmarried couple               6.07
## 3 Never married                  5.28
## 4 Divorced                       4.49
## 5 Married                        3.10
## 6 Widowed                        2.67

#task 2a
brfss_dv <- brfss_dv %>%
  mutate(
    gen_health_numeric = case_when(
      gen_health == "Excellent" ~ 1,
      gen_health == "Very good" ~ 2,
      gen_health == "Good" ~ 3,
      gen_health == "Fair" ~ 4,
      gen_health == "Poor" ~ 5
    )
  )

model1 <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)

summary(model1)

## 
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = brfss_dv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6173 -4.9016 -3.0438 -0.0438 28.8140 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.6718     0.2705  -2.484    0.013 *  
## gen_health_numeric   1.8578     0.1036  17.926   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.661 on 4998 degrees of freedom
## Multiple R-squared:  0.06041,    Adjusted R-squared:  0.06022 
## F-statistic: 321.3 on 1 and 4998 DF,  p-value: < 2.2e-16

#Task 2b
model2 <- lm(menthlth_days ~ gen_health, data = brfss_dv)
summary(model2)

## 
## Call:
## lm(formula = menthlth_days ~ gen_health, data = brfss_dv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7814  -4.0708  -2.7077  -0.1174  27.8826 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.1174     0.2332   9.079  < 2e-16 ***
## gen_healthVery good   0.5903     0.2941   2.007   0.0448 *  
## gen_healthGood        1.9535     0.3082   6.337 2.54e-10 ***
## gen_healthFair        5.0624     0.4064  12.457  < 2e-16 ***
## gen_healthPoor        9.6640     0.6090  15.868  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.611 on 4995 degrees of freedom
## Multiple R-squared:  0.07334,    Adjusted R-squared:  0.07259 
## F-statistic: 98.83 on 4 and 4995 DF,  p-value: < 2.2e-16

#Task 3a 
model3 <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
             data = brfss_dv)

summary(model3)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs + 
##     gen_health, data = brfss_dv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5175  -3.5549  -1.6999   0.4316  31.3631 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          9.592982   0.630441  15.216  < 2e-16 ***
## age                 -0.086672   0.005982 -14.489  < 2e-16 ***
## sexFemale            1.725379   0.205472   8.397  < 2e-16 ***
## physhlth_days        0.231420   0.016177  14.306  < 2e-16 ***
## sleep_hrs           -0.586595   0.076572  -7.661 2.21e-14 ***
## gen_healthVery good  0.789947   0.279661   2.825  0.00475 ** 
## gen_healthGood       1.843601   0.297260   6.202 6.03e-10 ***
## gen_healthFair       3.395283   0.417964   8.123 5.66e-16 ***
## gen_healthPoor       5.335347   0.682949   7.812 6.80e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1681 
## F-statistic: 127.3 on 8 and 4991 DF,  p-value: < 2.2e-16

#Task 3c
library(broom)
library(ggplot2)
library(dplyr)

# Extract coefficients + CIs
coef_data <- tidy(model3, conf.int = TRUE) %>%
  filter(grepl("gen_health", term))

# Clean labels
coef_data <- coef_data %>%
  mutate(term = case_when(
    term == "gen_healthVery good" ~ "Very good",
    term == "gen_healthGood" ~ "Good",
    term == "gen_healthFair" ~ "Fair",
    term == "gen_healthPoor" ~ "Poor",
    TRUE ~ term
  ))

# Plot
ggplot(coef_data, aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(
    title = "Effect of General Health on Mentally Unhealthy Days",
    x = "Coefficient (Difference from Excellent)",
    y = "General Health Category"
  ) +
  theme_minimal()

#Task 4a

brfss_dv <- brfss_dv %>%
  mutate(gen_health = relevel(gen_health, ref = "Good"))

model4 <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
             data = brfss_dv)

summary(model4)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs + 
##     gen_health, data = brfss_dv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5175  -3.5549  -1.6999   0.4316  31.3631 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         11.436584   0.629825  18.158  < 2e-16 ***
## age                 -0.086672   0.005982 -14.489  < 2e-16 ***
## sexFemale            1.725379   0.205472   8.397  < 2e-16 ***
## physhlth_days        0.231420   0.016177  14.306  < 2e-16 ***
## sleep_hrs           -0.586595   0.076572  -7.661 2.21e-14 ***
## gen_healthExcellent -1.843601   0.297260  -6.202 6.03e-10 ***
## gen_healthVery good -1.053654   0.258126  -4.082 4.54e-05 ***
## gen_healthFair       1.551682   0.386128   4.019 5.94e-05 ***
## gen_healthPoor       3.491746   0.650560   5.367 8.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1681 
## F-statistic: 127.3 on 8 and 4991 DF,  p-value: < 2.2e-16

#Task 4c
#Get predictions from both models
pred1 <- predict(model3)
pred2 <- predict(model4)
#Compute correlation
cor(pred1, pred2)

## [1] 1

#Task 5a
model_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
                    data = brfss_dv)

summary(model_reduced)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs, 
##     data = brfss_dv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.314  -3.520  -1.805   0.159  32.013 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.504082   0.612374  17.153  < 2e-16 ***
## age           -0.077335   0.005962 -12.972  < 2e-16 ***
## sexFemale      1.686413   0.207440   8.130 5.38e-16 ***
## physhlth_days  0.317247   0.013210  24.015  < 2e-16 ***
## sleep_hrs     -0.633026   0.077171  -8.203 2.96e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.28 on 4995 degrees of freedom
## Multiple R-squared:  0.1522, Adjusted R-squared:  0.1515 
## F-statistic: 224.2 on 4 and 4995 DF,  p-value: < 2.2e-16

summary(model_reduced)$r.squared

## [1] 0.1521948

summary(model_reduced)$adj.r.squared

## [1] 0.1515159

summary(model3)$r.squared

## [1] 0.1694246

summary(model3)$adj.r.squared

## [1] 0.1680933

#Task 5b - Run the partial F-test
anova(model_reduced, model3)

## Analysis of Variance Table
## 
## Model 1: menthlth_days ~ age + sex + physhlth_days + sleep_hrs
## Model 2: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4995 264715                                  
## 2   4991 259335  4    5379.8 25.884 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Task 5c - Run Type III ANOVA
library(car)

Anova(model3, type = "III")

## Anova Table (Type III tests)
## 
## Response: menthlth_days
##               Sum Sq   Df F value    Pr(>F)    
## (Intercept)    12031    1 231.536 < 2.2e-16 ***
## age            10908    1 209.926 < 2.2e-16 ***
## sex             3664    1  70.512 < 2.2e-16 ***
## physhlth_days  10634    1 204.654 < 2.2e-16 ***
## sleep_hrs       3049    1  58.687 2.207e-14 ***
## gen_health      5380    4  25.884 < 2.2e-16 ***
## Residuals     259335 4991                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

The analytic sample included 5,000 respondents. On average, participants reported 3.77 mentally unhealthy days in the past 30 days (SD = 7.90). The mean age of respondents was 54.86 years (SD = 17.53).

In terms of sex, the sample was slightly majority female (54%), with 46% male participants.

Self-reported general health was most commonly rated as “very good” (36%), followed by “good” (29%) and “excellent” (21%). Smaller proportions reported “fair” (10%) or “poor” (3.7%) health.

Regarding marital status, over half of respondents were married (54%). The remaining participants were never married (17%), divorced (12%), widowed (11%), unmarried couples (3.6%), and separated (2.2%).

Overall, the sample is predominantly middle-aged to older adults, slightly more female, with most individuals reporting relatively good general health and being married.

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

The group reporting the most mentally unhealthy days is the “Poor” general health group, which has the highest median and overall distribution of mentally unhealthy days.

The pattern shows a clear gradient: as general health worsens (from Excellent → Very good → Good → Fair → Poor), the number of mentally unhealthy days increases. Individuals in the “Excellent” category report very few mentally unhealthy days, while those in “Fair” and especially “Poor” report substantially more.

This pattern is consistent with expectations. Poorer self-rated general health is typically associated with greater physical and mental health burdens, which would lead to more mentally unhealthy days. The increasing spread in the worse health categories also suggests greater variability in mental health among those groups.

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

The marital status group with the highest mean number of mentally unhealthy days is Separated (6.22 days), while the group with the lowest mean is Widowed (2.67 days).

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

The coefficient for gen_health_numeric is 1.8578.

This means that for each one-unit increase in general health (i.e., moving from a better to a worse health category, such as from “Excellent” to “Very good”), the number of mentally unhealthy days increases by approximately 1.86 days on average.

This suggests a positive association between poorer self-rated general health and an increase in mentally unhealthy days.

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

Why does the factor version use 4 coefficients instead of 1?

The factor model uses 4 coefficients because gen_health has 5 categories, and one category (in this case, “Excellent”) is used as the reference group. The remaining four categories (Very good, Good, Fair, Poor) each get their own coefficient, representing the difference in mean mentally unhealthy days compared to the reference group.

Comparison of the two models

The numeric model estimates a single coefficient (1.8578), assuming that each step from one health category to the next increases mentally unhealthy days by the same amount.

In contrast, the factor model estimates separate effects:

Very good: +0.59 days
Good: +1.95 days
Fair: +5.06 days
Poor: +9.66 days

This shows that the increase in mentally unhealthy days is not linear, the jump from “Fair” to “Poor” is much larger than from “Excellent” to “Very good.”

Additionally, the factor model has a slightly higher R² (0.073 vs. 0.060), indicating a better fit to the data.

Why the numeric approach may be misleading

The numeric approach assumes that the categories of general health are equally spaced, meaning each step has the same effect on mentally unhealthy days. However, the factor model shows this is not true, the increases vary substantially across categories.

Because of this, the numeric model oversimplifies the relationship and may underestimate or overestimate effects for certain groups. The factor model is more appropriate because it allows for nonlinear differences between categories.

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

menthlth_days_hat = 9.5930
- 0.0867(age)
+ 1.7254(Female)
+ 0.2314(physhlth_days)
- 0.5866(sleep_hrs)
+ 0.7899(Very good)
+ 1.8436(Good)
+ 3.3953(Fair)
+ 5.3353(Poor)

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

In this model, “Excellent” general health is the reference group, so all coefficients are interpreted relative to individuals reporting excellent health.

Very good: Individuals reporting very good health have, on average, 0.79 more mentally unhealthy days compared to those in excellent health, holding all other variables constant.

Good: Individuals reporting good health have, on average, 1.84 more mentally unhealthy days compared to those in excellent health, holding all other variables constant.

Fair: Individuals reporting fair health have, on average, 3.40 more mentally unhealthy days compared to those in excellent health, holding all other variables constant.

Poor: Individuals reporting poor health have, on average, 5.34 more mentally unhealthy days compared to those in excellent health, holding all other variables constant.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

The Poor general health group differs the most from the reference group (Excellent health), as it has the largest positive coefficient. This indicates that individuals reporting poor health have the greatest increase in mentally unhealthy days compared to those in excellent health, holding all other variables constant.

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

The coefficients for the continuous variables (such as age, physical health days, and sleep hours) are the same in both models (original reference = Excellent vs. new reference = Good). This is because changing the reference group for a categorical variable (gen_health) only affects the intercept and the dummy variable coefficients for that categorical variable. It does not affect the slopes of the continuous predictors, which represent the independent relationship between each continuous variable and the outcome. Therefore, the continuous variable coefficients remain unchanged because their estimates are not dependent on the choice of reference category for gen_health.

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

The correlation between the predicted values from the two models is approximately 1, indicating that the predictions are identical.

Changing the reference group does not change the predicted values because it only reparameterizes the model, it shifts how the coefficients are expressed (i.e., the intercept and dummy variable coefficients), but the underlying fitted values remain the same. In other words, both models represent the same relationships in the data, just using a different baseline category, so they produce identical predictions.

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

Reduced Model

R^2 = 0.1522 Adjusted R^2 = 0.1515

Full Model

R^2 = 0.1694
Adjusted R^2 = 0.1681

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

Null hypothesis (H₀): All coefficients for gen_health = 0
(gen_health does not improve the model)
Alternative hypothesis (H₁): At least one gen_health coefficient ≠ 0
(gen_health improves the model)

Test Results

F-statistic: 25.88
Degrees of freedom: (4, 4991)
p-value: < 2.2 × 10⁻¹⁶

A partial F-test comparing the reduced and full models yielded an F-statistic of 25.88 with 4 and 4991 degrees of freedom, and a p-value \< 2.2 × 10⁻¹⁶. Since the p-value is extremely small, we reject the null hypothesis and conclude that gen_health as a whole significantly improves the model, even after adjusting for age, sex, physical health days, and sleep hours.

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

F-statistic: 25.88

F-statistic: 25.884

Degrees of freedom: 4

p-value: < 2.2 × 10⁻¹⁶

Using a Type III ANOVA, the overall effect of gen_health is statistically significant (F = 25.884, df = 4, p \< 2.2 × 10⁻¹⁶). This result is consistent with the partial F-test conducted earlier, which also yielded an F-statistic of 25.88 with 4 and 4991 degrees of freedom and a p-value \< 2.2 × 10⁻¹⁶. Both methods test whether all gen_health coefficients are jointly equal to zero, and both lead to the same conclusion that gen_health significantly improves the model.

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

Identify which general health groups differ most from the reference
State the direction and approximate magnitude of the association
Appropriately acknowledge the cross-sectional nature of the data
Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

People who report worse general health tend to experience more mentally unhealthy days. Compared to those in excellent health, individuals in poor health report about 5 more mentally unhealthy days per month, while those in fair health report about 3 more days. Smaller differences are seen for those in good or very good health. Because this data was collected at one point in time, we cannot determine whether poorer health causes worse mental health—only that they are related.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

General health appears to be more strongly associated with mentally unhealthy days than education, as it shows larger differences across categories and contributes more to explaining variation in the outcome. In particular, the gap between excellent and poor health is substantial compared to typical differences seen across education levels. If building a final model, I would prioritize including general health, while considering education if it adds meaningful explanatory value or is important for the research question. Ultimately, the decision would depend on both the strength of association and the conceptual relevance of each variable. In contrast, the education model showed smaller differences in mentally unhealthy days across categories.

End of Lab Activity

Dummy Variables in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Jessica Okoroji

March 23, 2026