Introduction

In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.

When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.

Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:

Why numeric coding of categories fails
How to construct dummy variables for dichotomous and multichotomous predictors
The concept of the reference group and how to change it
Interpreting dummy variable coefficients
Testing whether a categorical variable as a whole is significant (partial F-test)
Alternative coding schemes (effect coding, ordinal contrasts)

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
library(flextable)
library(tibble)


options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.

Research question for today:

How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?

brfss_full <- read_xpt(
  "C:/Users/userp/OneDrive/Рабочий стол/HSTA553/R files/LLCP2020.XPT"
) |>
  clean_names()

brfss_dv <- brfss_full |>
  mutate(
    # Outcome: mentally unhealthy days in past 30
    menthlth_days = case_when(
      menthlth == 88                    ~ 0,
      menthlth >= 1 & menthlth <= 30   ~ as.numeric(menthlth),
      TRUE                             ~ NA_real_
    ),
    # Physical health days
    physhlth_days = case_when(
      physhlth == 88                    ~ 0,
      physhlth >= 1 & physhlth <= 30   ~ as.numeric(physhlth),
      TRUE                             ~ NA_real_
    ),
    # Sleep hours
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14   ~ as.numeric(sleptim1),
      TRUE                             ~ NA_real_
    ),
    # Age
    age = age80,
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Education (6-level raw BRFSS variable EDUCA)
    # 1 = Never attended school or only kindergarten
    # 2 = Grades 1 through 8 (Elementary)
    # 3 = Grades 9 through 11 (Some high school)
    # 4 = Grade 12 or GED (High school graduate)
    # 5 = College 1 year to 3 years (Some college or technical school)
    # 6 = College 4 years or more (College graduate)
    # 9 = Refused
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    # General health status (5-level)
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    # Marital status
    marital_status = factor(case_when(
      marital == 1 ~ "Married",
      marital == 2 ~ "Divorced",
      marital == 3 ~ "Widowed",
      marital == 4 ~ "Separated",
      marital == 5 ~ "Never married",
      marital == 6 ~ "Unmarried couple",
      TRUE         ~ NA_character_
    ), levels = c("Married", "Divorced", "Widowed", "Separated",
                  "Never married", "Unmarried couple")),
    # Store the raw education numeric code for the "naive approach" demonstration
    educ_numeric = case_when(
      educa %in% c(1, 2, 3) ~ 1,
      educa == 4             ~ 2,
      educa == 5             ~ 3,
      educa == 6             ~ 4,
      TRUE                   ~ NA_real_
    )
  ) |>
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age), age >= 18,
    !is.na(sex),
    !is.na(education),
    !is.na(gen_health),
    !is.na(marital_status)
  )

# Reproducible random sample
set.seed(1220)
brfss_dv <- brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health, marital_status, educ_numeric) |>
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_dv,
  "C:/Users/userp/OneDrive/Рабочий стол/HSTA553/R files/brfss_dv_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_dv), ncol(brfss_dv))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

Descriptive Statistics

brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      physhlth_days  ~ "Physically unhealthy days (past 30)",
      sleep_hrs      ~ "Sleep (hours/night)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      education      ~ "Education level",
      gen_health     ~ "General health status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Physically unhealthy days (past 30)	5,000	3.3 (7.9)
Sleep (hours/night)	5,000	7.0 (1.4)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
Education level	5,000
Less than HS		290 (5.8%)
HS graduate		1,348 (27%)
Some college		1,340 (27%)
College graduate		2,022 (40%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
1Mean (SD); n (%)

ggplot(brfss_dv, aes(x = education, fill = education)) +
  geom_bar(alpha = 0.85) +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Distribution of Education Level",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x = "Education Level",
    y = "Count"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Distribution of Education Level in Analytic Sample

ggplot(brfss_dv, aes(x = education, y = menthlth_days, fill = education)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by Education Level",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "Education Level",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mental Health Days by Education Level

Part 1: Guided Practice — Dummy Variables

1. Categorical Variables: The Problem

1.1 Types of Categorical Variables

Categorical predictor variables come in two forms:

Type	Definition	Examples
Nominal	Categories with no natural ordering	Sex, race/ethnicity, marital status, blood type
Ordinal	Categories with a meaningful order	Education level, income bracket, disease stage, Likert scale

A further distinction is:

Dichotomous (k = 2): Only two categories (e.g., sex: Male/Female)
Multichotomous (k > 2): Three or more categories (e.g., education: 4 levels)

Note that categorical variables can also be created by grouping continuous variables (e.g., age groups from continuous age), though this generally results in a loss of information.

1.2 The Naive Approach: Why Numeric Codes Fail

Suppose education has been coded as: 1 = Less than HS, 2 = HS graduate, 3 = Some college, 4 = College graduate.

If we include this numeric code directly in a regression model, we are assuming:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 \cdot \text{educ_numeric} + \varepsilon\]

This forces the model to assume that the difference in expected \(Y\) between “Less than HS” and “HS graduate” is the same as the difference between “HS graduate” and “Some college,” and the same again between “Some college” and “College graduate.” In other words, we are assuming equally spaced, linear increments.

# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ age + educ_numeric, data = brfss_dv)

tidy(naive_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Naive Model: Education Treated as Continuous",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Naive Model: Education Treated as Continuous
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	9.5601	0.5039	18.9727	8.5723	10.5479
age	-0.0661	0.0063	-10.5135	-0.0784	-0.0538
educ_numeric	-0.7168	0.1158	-6.1917	-0.9437	-0.4898

This model estimates a single coefficient for education, meaning each step up the education ladder is associated with the same change in mentally unhealthy days. This constraint is problematic for two reasons:

For nominal variables (like race or marital status), numeric codes are entirely arbitrary. The “distance” between code 1 and code 2 has no meaning.
Even for ordinal variables (like education), the assumption of equal spacing is rarely justified. The jump from “Less than HS” to “HS graduate” is substantively different from “Some college” to “College graduate.”

Let’s visualize why this matters:

# Compute observed group means
group_means <- brfss_dv |>
  summarise(mean_days = mean(menthlth_days), .by = c(education, educ_numeric))

# Generate predictions from the naive model
pred_naive <- tibble(
  educ_numeric = 1:4,
  predicted = predict(naive_mod, newdata = tibble(age = mean(brfss_dv$age), educ_numeric = 1:4))
)

ggplot() +
  geom_point(data = group_means,
             aes(x = educ_numeric, y = mean_days),
             size = 4, color = "steelblue") +
  geom_line(data = pred_naive,
            aes(x = educ_numeric, y = predicted),
            color = "tomato", linewidth = 1.2, linetype = "dashed") +
  geom_point(data = pred_naive,
             aes(x = educ_numeric, y = predicted),
             size = 3, color = "tomato", shape = 17) +
  scale_x_continuous(
    breaks = 1:4,
    labels = c("Less than HS", "HS graduate", "Some college", "College graduate")
  ) +
  labs(
    title = "Observed Group Means (blue) vs. Naive Linear Fit (red)",
    subtitle = "The naive model forces equal spacing between education levels",
    x = "Education Level",
    y = "Mean Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13)

Naive Linear Fit vs. Actual Group Means by Education

Key takeaway: The blue dots (observed means) do not fall along a straight line. The naive linear model (red) misrepresents the actual pattern. We need a more flexible approach.

2. Dummy (Indicator) Variables

2.1 Definition and the k - 1 Rule

A dummy variable (also called an indicator variable) is a variable that takes on only two possible values:

1 to indicate the presence of a condition
0 to indicate its absence

If a categorical predictor has \(k\) categories, we need exactly \(k - 1\) dummy variables when the model includes an intercept. The omitted category becomes the reference group (also called the control group or baseline group).

Why \(k - 1\) and not \(k\)? Because the intercept already captures the mean for the reference group. Including all \(k\) dummies plus the intercept would create perfect multicollinearity (the dummy variables would sum to equal the intercept column), and the model could not be estimated.

2.2 The Dichotomous Case (k = 2)

The simplest example is a variable with two categories, such as sex.

With \(k = 2\), we need \(2 - 1 = 1\) dummy variable. If we choose Female as the reference group:

\[\text{male} = \begin{cases} 1 & \text{if male} \\ 0 & \text{if female} \end{cases}\]

The regression model becomes:

\[Y = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{male} + \varepsilon\]

For males (\(\text{male} = 1\)): \[E(Y | \text{age}, \text{male}) = (\beta_0 + \beta_2) + \beta_1 \cdot \text{age}\]

For females (\(\text{male} = 0\)): \[E(Y | \text{age}, \text{female}) = \beta_0 + \beta_1 \cdot \text{age}\]

Both groups share the same slope for age but have different intercepts. The coefficient \(\beta_2\) is the expected difference in \(Y\) between males and females, holding age constant.

# Fit model with sex as a dummy variable
mod_sex <- lm(menthlth_days ~ age + sex, data = brfss_dv)

tidy(mod_sex, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Dichotomous Dummy Variable: Sex",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Dichotomous Dummy Variable: Sex
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	6.6262	0.3730	17.7666	5.8951	7.3574
age	-0.0698	0.0063	-11.1011	-0.0821	-0.0575
sexFemale	1.8031	0.2210	8.1585	1.3698	2.2364

b_sex <- round(coef(mod_sex), 3)

Interpretation:

Intercept (6.626): The estimated mean number of mentally unhealthy days for a male of age 0. (This is a mathematical artifact, not a substantive value.)
Age (-0.07): Each additional year of age is associated with a 0.07 day change in mentally unhealthy days, holding sex constant.
Sex: Female (1.803): Compared to males (the reference group), females report an estimated 1.803 more mentally unhealthy days on average, holding age constant.

Note that R automatically creates dummy variables when a factor is included in lm(). It uses alphabetical or level order to set the reference group, which is why Male (the first level) is the reference here.

pred_sex <- ggpredict(mod_sex, terms = c("age [20:80]", "sex"))

ggplot(pred_sex, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Age and Sex",
    subtitle = "Parallel lines: same slope, different intercepts",
    x        = "Age (years)",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Sex",
    fill     = "Sex"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

Parallel Regression Lines: Males vs. Females

Geometrically: Dummy variables produce parallel regression lines. The intercept shifts by \(\beta_2\) for the non-reference group, but the slope remains the same.

3. Multichotomous Dummy Variables (k > 2)

3.1 Constructing the Dummies

Education has \(k = 4\) categories, so we need \(4 - 1 = 3\) dummy variables. If we choose “Less than HS” as the reference group:

\[\text{HS_graduate} = \begin{cases} 1 & \text{if HS graduate} \\ 0 & \text{otherwise} \end{cases}\]

\[\text{Some_college} = \begin{cases} 1 & \text{if Some college} \\ 0 & \text{otherwise} \end{cases}\]

\[\text{College_graduate} = \begin{cases} 1 & \text{if College graduate} \\ 0 & \text{otherwise} \end{cases}\]

The data matrix looks like this:

Dummy Variable Encoding for Education (Reference: Less than HS)
Observation	Education	HS_graduate	Some_college	College_graduate
1	Less than HS	0	0	0
2	HS graduate	1	0	0
3	Some college	0	1	0
4	College graduate	0	0	1
5	Less than HS	0	0	0

Notice that the reference group is identified by having all dummy variables equal to zero.

3.2 The Reference Group

The reference group is the category against which all others are compared. Key points:

The choice of reference group affects the interpretation of coefficients but not the model’s predictive ability
The predicted values are identical regardless of which reference group is chosen
Choose the reference group based on your research question (e.g., the most common category, the “control” condition, or the group of primary interest)
You can always test differences between any pair of groups, not just comparisons to the reference

3.3 Fitting the Model in R

When we include a factor variable in lm(), R automatically creates the dummy variables. The first level of the factor is used as the reference group by default.

# Fit model with education as a factor (R creates dummies automatically)
mod_educ <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education,
               data = brfss_dv)

tidy(mod_educ, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Education Dummy Variables (Reference: Less than HS)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Education Dummy Variables (Reference: Less than HS)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.1377	0.7390	15.0709	0.0000	9.6889	12.5865
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
educationHS graduate	-0.5873	0.4719	-1.2445	0.2134	-1.5125	0.3379
educationSome college	-0.1289	0.4735	-0.2723	0.7854	-1.0572	0.7993
educationCollege graduate	-1.1429	0.4607	-2.4805	0.0132	-2.0461	-0.2396

b_educ <- round(coef(mod_educ), 3)
ci_educ <- round(confint(mod_educ), 3)

3.4 Interpreting Each Dummy Coefficient

The model is:

\[\widehat{\text{Mental Health Days}} = 11.138 + -0.077(\text{Age}) + 1.681(\text{Female}) + 0.311(\text{Phys Days}) + -0.628(\text{Sleep}) + -0.587(\text{HS grad}) + -0.129(\text{Some college}) + -1.143(\text{College grad})\]

Each education coefficient represents the estimated difference in mentally unhealthy days between that group and the reference group (Less than HS), holding all other variables constant:

HS graduate (\(\hat{\beta}\) = -0.587): Compared to those with less than a high school education, HS graduates report an estimated 0.587 fewer mentally unhealthy days, holding age, sex, physical health days, and sleep constant.
Some college (\(\hat{\beta}\) = -0.129): Compared to those with less than a high school education, those with some college report an estimated 0.129 fewer mentally unhealthy days, holding all else constant.
College graduate (\(\hat{\beta}\) = -1.143): Compared to those with less than a high school education, college graduates report an estimated 1.143 fewer mentally unhealthy days, holding all else constant.

Key pattern: All comparisons are made relative to the reference group. The coefficients do NOT directly tell us the difference between, say, HS graduates and college graduates. We would need to compute \(\hat{\beta}_{\text{HS grad}} - \hat{\beta}_{\text{College grad}}\) for that comparison (or change the reference group).

3.5 Visualizing the Parallel Lines

pred_educ <- ggpredict(mod_educ, terms = c("age [20:80]", "education"))

ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Age and Education",
    subtitle = "Parallel lines: same slopes for age, different intercepts by education",
    x        = "Age (years)",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Education",
    fill     = "Education"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set2")

Predicted Mental Health Days by Age and Education Level

These are a series of parallel lines, one for each education level. The slope for age is the same across all groups; only the intercept differs. Each education dummy shifts the intercept up or down relative to the reference group.

4. Changing the Reference Group

4.1 Using `relevel()` in R

We may want to change the reference group to a category that is more epidemiologically meaningful. For instance, “College graduate” is the largest group and could serve as a natural comparison.

# Change reference group to College graduate
brfss_dv$education_reref <- relevel(brfss_dv$education, ref = "College graduate")

mod_educ_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_reref,
                     data = brfss_dv)

tidy(mod_educ_reref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: College graduate)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Same Model, Different Reference Group (Reference: College graduate)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.9948	0.6272	15.9349	0.0000	8.7652	11.2245
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
education_rerefLess than HS	1.1429	0.4607	2.4805	0.0132	0.2396	2.0461
education_rerefHS graduate	0.5556	0.2574	2.1586	0.0309	0.0510	1.0601
education_rerefSome college	1.0139	0.2566	3.9507	0.0001	0.5108	1.5171

4.2 What Changes and What Stays the Same?

tribble(
  ~Quantity, ~`Ref: Less than HS`, ~`Ref: College graduate`,
  "Intercept", round(coef(mod_educ)[1], 3), round(coef(mod_educ_reref)[1], 3),
  "Age coefficient", round(coef(mod_educ)[2], 3), round(coef(mod_educ_reref)[2], 3),
  "Sex coefficient", round(coef(mod_educ)[3], 3), round(coef(mod_educ_reref)[3], 3),
  "Physical health days", round(coef(mod_educ)[4], 3), round(coef(mod_educ_reref)[4], 3),
  "Sleep hours", round(coef(mod_educ)[5], 3), round(coef(mod_educ_reref)[5], 3),
  "R-squared", round(summary(mod_educ)$r.squared, 4), round(summary(mod_educ_reref)$r.squared, 4),
  "Residual SE", round(summary(mod_educ)$sigma, 3), round(summary(mod_educ_reref)$sigma, 3)
) |>
  kable(caption = "Comparing Models with Different Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparing Models with Different Reference Groups
Quantity	Ref: Less than HS	Ref: College graduate
Intercept	11.1380	9.9950
Age coefficient	-0.0770	-0.0770
Sex coefficient	1.6810	1.6810
Physical health days	0.3110	0.3110
Sleep hours	-0.6280	-0.6280
R-squared	0.1553	0.1553
Residual SE	7.2690	7.2690

What changes:

The intercept
The education dummy variable coefficients (they now represent comparisons to the new reference group)

What stays the same:

Coefficients for age, sex, physical health days, and sleep
\(R^2\), Adjusted \(R^2\), and Residual Standard Error
All predicted values
The overall F-test

This is a critical point: Changing the reference group does not change the model’s fit or predictions. It only changes the interpretation of the dummy variable coefficients.

# Verify that predicted values are identical
pred_orig <- predict(mod_educ)
pred_reref <- predict(mod_educ_reref)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(max(abs(pred_orig - pred_reref)),
            cor(pred_orig, pred_reref))
) |>
  kable(caption = "Verification: Predicted Values Are Identical") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Verification: Predicted Values Are Identical
Check	Value
Maximum absolute difference in predictions	0
Correlation between predictions	1

5. The Dummy Variable Trap (Perfect Multicollinearity)

5.1 Why We Cannot Include All k Dummies

If we include \(k\) dummy variables and an intercept for a variable with \(k\) categories, the columns of the design matrix \(X\) are linearly dependent. Specifically:

\[\text{Intercept} = D_1 + D_2 + \cdots + D_k\]

where \(D_1, \ldots, D_k\) are the \(k\) dummy variables (one for each category). This means the matrix \(X^TX\) is singular and cannot be inverted, so the OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^TY\) does not exist.

This is called the dummy variable trap.

The Dummy Variable Trap: Intercept = A + B + C (perfect linear dependence)
Obs	Intercept	A	B	C	A + B + C
1	1	1	0	0	1
2	1	0	1	0	1
3	1	0	0	1	1
4	1	1	0	0	1

Solutions:

Reference cell coding (default in R): Include \(k - 1\) dummies; the omitted category is absorbed into the intercept
Remove the intercept: Fit the model with - 1 in the formula and include all \(k\) dummies. Then each coefficient is the group mean (adjusted for other predictors) rather than a difference from a reference.

# Model without intercept: all k dummies included
mod_no_int <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education - 1,
                 data = brfss_dv)

tidy(mod_no_int, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model Without Intercept: All k Education Dummies Included",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Without Intercept: All k Education Dummies Included
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexMale	11.1377	0.7390	15.0709	0.0000	9.6889	12.5865
sexFemale	12.8190	0.7524	17.0365	0.0000	11.3439	14.2941
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
educationHS graduate	-0.5873	0.4719	-1.2445	0.2134	-1.5125	0.3379
educationSome college	-0.1289	0.4735	-0.2723	0.7854	-1.0572	0.7993
educationCollege graduate	-1.1429	0.4607	-2.4805	0.0132	-2.0461	-0.2396

Caution: Removing the intercept changes the interpretation of \(R^2\) and should only be done when there is a substantive reason. In most epidemiological applications, reference cell coding (the default) is preferred.

6. Testing Whether a Categorical Variable is Significant

6.1 The Partial F-Test (Type I and Type III)

When a categorical variable with \(k\) levels enters the model as \(k - 1\) dummies, we cannot assess its overall significance by looking at individual t-tests for each dummy. A single dummy might not be statistically significant on its own, yet the variable as a whole might be.

To test whether education as a whole is associated with the outcome, we use a partial F-test (also called an extra sum of squares F-test):

\[H_0: \beta_{\text{HS grad}} = \beta_{\text{Some college}} = \beta_{\text{College grad}} = 0\] \[H_A: \text{At least one } \beta_j \neq 0\]

This compares the full model (with education) to a reduced model (without education):

# Reduced model (no education)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)

# Partial F-test
f_test <- anova(mod_reduced, mod_educ)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does Education Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Does Education Improve the Model?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	264715.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education	4992	263744.4	3	970.7509	6.1246	4e-04

6.2 Using `car::Anova()` for Type III Tests

The car::Anova() function with type = "III" provides a convenient way to test the overall significance of each predictor, including categorical variables:

Anova(mod_educ, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Type III ANOVA: Testing Each Predictor’s Contribution
term	sumsq	df	statistic	p.value
(Intercept)	12000.1867	1	227.1325	0e+00
age	8863.3522	1	167.7603	0e+00
sex	3469.6448	1	65.6714	0e+00
physhlth_days	28765.1139	1	544.4492	0e+00
sleep_hrs	3506.1243	1	66.3619	0e+00
education	970.7509	3	6.1246	4e-04
Residuals	263744.4348	4992	NA	NA

Type I vs. Type III: Type I (sequential) sums of squares depend on the order variables enter the model. Type III (partial) sums of squares test each variable after all others, regardless of order. For unbalanced observational data (the norm in epidemiology), Type III is preferred.

7. Contrasts and Alternative Coding Schemes

7.1 Reference (Treatment) Coding — The Default

This is what R uses by default (contr.treatment). Each coefficient represents the difference between a group and the reference group.

contrasts(brfss_dv$education)

##                  HS graduate Some college College graduate
## Less than HS               0            0                0
## HS graduate                1            0                0
## Some college               0            1                0
## College graduate           0            0                1

7.2 Effect (Deviation) Coding

In effect coding (contr.sum), each coefficient represents the difference between a group’s mean and the grand mean (the unweighted average of all group means). This is common in ANOVA contexts.

# Set effect coding
brfss_dv$education_effect <- brfss_dv$education
contrasts(brfss_dv$education_effect) <- contr.sum(4)

mod_effect <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_effect,
                 data = brfss_dv)

tidy(mod_effect, conf.int = TRUE) |>
  mutate(
    term = case_when(
      str_detect(term, "education_effect1") ~ "Education: Less than HS vs. Grand Mean",
      str_detect(term, "education_effect2") ~ "Education: HS graduate vs. Grand Mean",
      str_detect(term, "education_effect3") ~ "Education: Some college vs. Grand Mean",
      TRUE ~ term
    ),
    across(where(is.numeric), \(x) round(x, 4))
  ) |>
  kable(
    caption = "Effect Coding: Each Education Coefficient vs. Grand Mean",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Effect Coding: Each Education Coefficient vs. Grand Mean
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	10.6729	0.6172	17.2911	0.0000	9.4628	11.8830
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
Education: Less than HS vs. Grand Mean	0.4648	0.3323	1.3988	0.1619	-0.1866	1.1162
Education: HS graduate vs. Grand Mean	-0.1225	0.1939	-0.6319	0.5275	-0.5026	0.2576
Education: Some college vs. Grand Mean	0.3358	0.1946	1.7257	0.0845	-0.0457	0.7174

With effect coding, the intercept is the grand mean (adjusted for covariates), and each education coefficient shows how far that group deviates from the grand mean. The omitted group’s deviation is the negative sum of the others.

7.3 Polynomial (Orthogonal) Contrasts for Ordinal Variables

When a categorical variable is truly ordinal (like education), we can test for specific patterns using orthogonal polynomial contrasts (contr.poly). These decompose the group differences into linear, quadratic, and cubic trends.

# Ordinal polynomial contrasts
brfss_dv$education_ord <- brfss_dv$education
contrasts(brfss_dv$education_ord) <- contr.poly(4)

mod_ord <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_ord,
              data = brfss_dv)

tidy(mod_ord, conf.int = TRUE) |>
  mutate(
    term = case_when(
      str_detect(term, "\\.L$") ~ "Education: Linear trend",
      str_detect(term, "\\.Q$") ~ "Education: Quadratic trend",
      str_detect(term, "\\.C$") ~ "Education: Cubic trend",
      TRUE ~ term
    ),
    across(where(is.numeric), \(x) round(x, 4))
  ) |>
  kable(
    caption = "Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	10.6729	0.6172	17.2911	0.0000	9.4628	11.8830
age	-0.0772	0.0060	-12.9522	0.0000	-0.0888	-0.0655
sexFemale	1.6813	0.2075	8.1038	0.0000	1.2745	2.0880
physhlth_days	0.3112	0.0133	23.3334	0.0000	0.2850	0.3373
sleep_hrs	-0.6281	0.0771	-8.1463	0.0000	-0.7793	-0.4770
Education: Linear trend	-0.6642	0.3158	-2.1028	0.0355	-1.2833	-0.0450
Education: Quadratic trend	-0.2133	0.2682	-0.7954	0.4264	-0.7391	0.3125
Education: Cubic trend	-0.5630	0.2142	-2.6282	0.0086	-0.9830	-0.1431

Interpretation:

The linear contrast tests whether there is a consistent trend (higher education = fewer/more days)
The quadratic contrast tests whether the relationship curves (U-shaped or inverted-U)
The cubic contrast tests for an S-shaped pattern

Polynomial contrasts are most useful when the categories have a clear, meaningful order and you want to characterize the shape of the trend rather than compare individual groups to a reference.

8. Practical Considerations

8.1 Choosing the Reference Group

Guidelines for choosing the reference group:

Most common category — maximizes the precision of comparisons (largest reference sample)
Control or baseline condition — natural in experimental or clinical settings
Epidemiologically meaningful comparator — the group to which you want to compare all others
Consistency with published literature — facilitates comparison across studies

8.2 When `as.factor()` Is Required

If a categorical variable is stored as numeric in your data (e.g., coded 0, 1, 2, 3), R will treat it as continuous by default. You must use as.factor() or factor() to tell R it is categorical:

# WRONG: R treats educ_numeric as continuous
mod_wrong <- lm(menthlth_days ~ educ_numeric, data = brfss_dv)

# RIGHT: Convert to factor first
mod_right <- lm(menthlth_days ~ factor(educ_numeric), data = brfss_dv)

# Compare: 1 coefficient (wrong) vs. 3 coefficients (right)
tribble(
  ~Model, ~`Number of education coefficients`, ~`Degrees of freedom used`,
  "Numeric (wrong)", 1, 1,
  "Factor (correct)", 3, 3
) |>
  kable(caption = "Numeric vs. Factor Treatment of Categorical Variables") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Numeric vs. Factor Treatment of Categorical Variables
Model	Number of education coefficients	Degrees of freedom used
Numeric (wrong)	1	1
Factor (correct)	3	3

8.3 Comparing Non-Reference Groups

What if we want to compare HS graduates to college graduates, but neither is the reference group? We have two options:

Option 1: Change the reference group with relevel().

Option 2: Compute the difference manually from the model output.

# Difference between HS graduate and College graduate
# = β_HS_grad - β_College_grad
diff_est <- coef(mod_educ)["educationHS graduate"] - coef(mod_educ)["educationCollege graduate"]

# Use linearHypothesis() for a formal test with SE and p-value
lin_test <- linearHypothesis(mod_educ, "educationHS graduate - educationCollege graduate = 0")

cat("Estimated difference (HS grad - College grad):", round(diff_est, 3), "days\n")

## Estimated difference (HS grad - College grad): 0.556 days

cat("F-statistic:", round(lin_test$F[2], 3), "\n")

## F-statistic: 4.66

cat("p-value:", round(lin_test$`Pr(>F)`[2], 4), "\n")

## p-value: 0.0309

car::linearHypothesis() is a powerful function for testing any linear combination of coefficients, not just comparisons to the reference group.

Summary of Key Concepts

Concept	Key Point
Categorical predictors	Cannot be included as raw numeric codes in regression
Dummy variables	Binary (0/1) indicators; need \(k - 1\) for \(k\) categories
Reference group	The omitted category; all comparisons are relative to it
Changing reference	Use `relevel()`; predictions unchanged, interpretation changes
Partial F-test	Tests whether the categorical variable as a whole is significant
Dummy variable trap	Including \(k\) dummies + intercept = perfect multicollinearity
`as.factor()`	Required when categorical variable is stored as numeric
Coding schemes	Treatment (default), effect, polynomial — each answers a different question
Type III ANOVA	Preferred for unbalanced observational data
Linear hypothesis	`linearHypothesis()` tests comparisons between non-reference groups

Part 2: In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 26, 2026

Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)

brfss_dv <- readRDS(
  "C:/Users/userp/OneDrive/Рабочий стол/HSTA553/R files/brfss_dv_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

brfss_dv <- readRDS(
  "C:/Users/userp/OneDrive/Рабочий стол/HSTA553/R files/brfss_dv_2020.rds"
)

table1 <- brfss_dv %>%
  select(menthlth_days, age, sex, gen_health, marital_status) %>%
  tbl_summary(
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    )
  )

table1

Characteristic	N = 5,000¹
menthlth_days	4 (8)
IMPUTED AGE VALUE COLLAPSED ABOVE 80	55 (18)
sex
Male	2,303 (46%)
Female	2,697 (54%)
gen_health
Excellent	1,065 (21%)
Very good	1,803 (36%)
Good	1,426 (29%)
Fair	523 (10%)
Poor	183 (3.7%)
marital_status
Married	2,708 (54%)
Divorced	622 (12%)
Widowed	534 (11%)
Separated	109 (2.2%)
Never married	848 (17%)
Unmarried couple	179 (3.6%)
¹ Mean (SD); n (%)

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days)) +
  geom_boxplot() +
  labs(
    x = "General Health",
    y = "Mentally Unhealthy Days (past 30)",
    title = "Mentally Unhealthy Days by General Health Status"
  )

#Individuals reporting poor general health have the highest number of mentally unhealthy days. #This suggests a strong association between general health status and mental health, where poorer overall health is linked to more frequent mental health challenges.

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

marital_summary <- brfss_dv %>%
  group_by(marital_status) %>%
  summarize(
    mean_menthlth = mean(menthlth_days, na.rm = TRUE)
  )

marital_summary

## # A tibble: 6 × 2
##   marital_status   mean_menthlth
##   <fct>                    <dbl>
## 1 Married                   3.10
## 2 Divorced                  4.49
## 3 Widowed                   2.67
## 4 Separated                 6.22
## 5 Never married             5.28
## 6 Unmarried couple          6.07

ggplot(marital_summary, aes(x = marital_status, y = mean_menthlth)) +
  geom_col() +
  labs(
    x = "Marital Status",
    y = "Mean Mentally Unhealthy Days",
    title = "Mean Mentally Unhealthy Days by Marital Status"
  )

#The marital status group with the highest mean number of mentally unhealthy days is separated individuals (6.22 days), while the lowest mean is observed among widowed individuals (2.67 days).

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

brfss_dv <- brfss_dv %>%
  mutate(
    gen_health_numeric = case_when(
      gen_health == "Excellent" ~ 1,
      gen_health == "Very good" ~ 2,
      gen_health == "Good" ~ 3,
      gen_health == "Fair" ~ 4,
      gen_health == "Poor" ~ 5
    )
  )

model_gen <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)
summary(model_gen)

## 
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = brfss_dv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6173 -4.9016 -3.0438 -0.0438 28.8140 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.6718     0.2705  -2.484    0.013 *  
## gen_health_numeric   1.8578     0.1036  17.926   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.661 on 4998 degrees of freedom
## Multiple R-squared:  0.06041,    Adjusted R-squared:  0.06022 
## F-statistic: 321.3 on 1 and 4998 DF,  p-value: < 2.2e-16

#The coefficient for gen_health_numeric is 1.86, indicating that for each one-category worsening in general health (e.g., from Excellent to Very good, or Good to Fair), the number of mentally unhealthy days increases by an average of 1.86 days.

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

model_factor <- lm(menthlth_days ~ gen_health, data = brfss_dv)
summary(model_factor)

## 
## Call:
## lm(formula = menthlth_days ~ gen_health, data = brfss_dv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7814  -4.0708  -2.7077  -0.1174  27.8826 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.1174     0.2332   9.079  < 2e-16 ***
## gen_healthVery good   0.5903     0.2941   2.007   0.0448 *  
## gen_healthGood        1.9535     0.3082   6.337 2.54e-10 ***
## gen_healthFair        5.0624     0.4064  12.457  < 2e-16 ***
## gen_healthPoor        9.6640     0.6090  15.868  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.611 on 4995 degrees of freedom
## Multiple R-squared:  0.07334,    Adjusted R-squared:  0.07259 
## F-statistic: 98.83 on 4 and 4995 DF,  p-value: < 2.2e-16

#The factor version of the model uses four coefficients because gen_health has five categories, and dummy-variable regression requires one fewer coefficient than the number of categories, with one category serving as the reference group. Each coefficient represents the difference in mentally unhealthy days between a given category and the reference group.

#In contrast, the numeric model uses a single coefficient and assumes a linear relationship, where each one-category worsening in health has the same effect on mentally unhealthy days. This assumption may be misleading because the differences between categories are not necessarily equal. The factor model is more flexible, as it allows each health category to have a distinct effect without imposing a linear structure.

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

model_full <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
                 data = brfss_dv)
summary(model_full)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs + 
##     gen_health, data = brfss_dv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5175  -3.5549  -1.6999   0.4316  31.3631 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          9.592982   0.630441  15.216  < 2e-16 ***
## age                 -0.086672   0.005982 -14.489  < 2e-16 ***
## sexFemale            1.725379   0.205472   8.397  < 2e-16 ***
## physhlth_days        0.231420   0.016177  14.306  < 2e-16 ***
## sleep_hrs           -0.586595   0.076572  -7.661 2.21e-14 ***
## gen_healthVery good  0.789947   0.279661   2.825  0.00475 ** 
## gen_healthGood       1.843601   0.297260   6.202 6.03e-10 ***
## gen_healthFair       3.395283   0.417964   8.123 5.66e-16 ***
## gen_healthPoor       5.335347   0.682949   7.812 6.80e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1681 
## F-statistic: 127.3 on 8 and 4991 DF,  p-value: < 2.2e-16

#The fitted regression model is: Predicted mentally unhealthy days = 9.593 − 0.0867(age) + 1.725(sexFemale) + 0.2314(physically unhealthy days) − 0.5866(sleep hours). 0.7899(gen health = Very good) + 1.8436(gen health = Good) 3.3953(gen health = Fair) + 5.3353(gen health = Poor)

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

#All comparisons are made relative to individuals in “Excellent” general health. #gen_health = Very good Individuals reporting very good health have, on average, 0.79 more mentally unhealthy days than those in excellent health, holding all other variables constant. #gen_health = Good Individuals reporting good health have, on average, 1.84 more mentally unhealthy days than those in excellent health, holding all other variables constant. #gen_health = Fair Individuals reporting fair health have, on average, 3.40 more mentally unhealthy days than those in excellent health, holding all other variables constant. #gen_health = Poor Individuals reporting poor health have, on average, 5.34 more mentally unhealthy days than those in excellent health, holding all other variables constant. #These results show a clear gradient: as general health worsens from excellent to poor, the number of mentally unhealthy days increases substantially, holding all other variables constant.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

coef_data <- tidy(model_full, conf.int = TRUE) %>%
  filter(grepl("gen_health", term))

coef_data <- coef_data %>%
  mutate(
    term = gsub("gen_health", "", term)
  )

ggplot(coef_data, aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(
    x = "Estimated Difference in Mentally Unhealthy Days",
    y = "General Health Category",
    title = "Effect of General Health on Mentally Unhealthy Days"
  )

#The group that differs most from the reference group (Excellent health) is the Poor health group, which has the largest estimated increase in mentally unhealthy days (about +5.34 days), holding all other variables constant.

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

brfss_dv$gen_health <- relevel(brfss_dv$gen_health, ref = "Good")
model_refit <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
                  data = brfss_dv)

summary(model_refit)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs + 
##     gen_health, data = brfss_dv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5175  -3.5549  -1.6999   0.4316  31.3631 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         11.436584   0.629825  18.158  < 2e-16 ***
## age                 -0.086672   0.005982 -14.489  < 2e-16 ***
## sexFemale            1.725379   0.205472   8.397  < 2e-16 ***
## physhlth_days        0.231420   0.016177  14.306  < 2e-16 ***
## sleep_hrs           -0.586595   0.076572  -7.661 2.21e-14 ***
## gen_healthExcellent -1.843601   0.297260  -6.202 6.03e-10 ***
## gen_healthVery good -1.053654   0.258126  -4.082 4.54e-05 ***
## gen_healthFair       1.551682   0.386128   4.019 5.94e-05 ***
## gen_healthPoor       3.491746   0.650560   5.367 8.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1681 
## F-statistic: 127.3 on 8 and 4991 DF,  p-value: < 2.2e-16

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

#The coefficients for the continuous variables (age, physically unhealthy days, and sleep hours) and the binary variable sex are the same in both models. This is because changing the reference group for a categorical variable does not affect the estimated relationships of the other predictors with the outcome. Instead, it only changes how the categorical variable is parameterized. The intercept and the gen_health coefficients differ between the models because they are defined relative to different reference groups, but the overall model fit and predicted values remain unchanged.

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

pred_old <- predict(model_full)    
pred_new <- predict(model_refit) 
cor(pred_old, pred_new)

## [1] 1

#Changing the reference group does not change the predicted values because it only changes how the categorical variable is coded, not the underlying model. The same information is still included in the model, just expressed differently. As a result, the regression coefficients are rearranged, but they produce the same fitted values for each observation. In other words, the model is mathematically equivalent under different reference group choices, so predictions remain unchanged.

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

model_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
                    data = brfss_dv)

summary(model_reduced)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs, 
##     data = brfss_dv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.314  -3.520  -1.805   0.159  32.013 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.504082   0.612374  17.153  < 2e-16 ***
## age           -0.077335   0.005962 -12.972  < 2e-16 ***
## sexFemale      1.686413   0.207440   8.130 5.38e-16 ***
## physhlth_days  0.317247   0.013210  24.015  < 2e-16 ***
## sleep_hrs     -0.633026   0.077171  -8.203 2.96e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.28 on 4995 degrees of freedom
## Multiple R-squared:  0.1522, Adjusted R-squared:  0.1515 
## F-statistic: 224.2 on 4 and 4995 DF,  p-value: < 2.2e-16

summary(model_reduced)$r.squared

## [1] 0.1521948

summary(model_reduced)$adj.r.squared

## [1] 0.1515159

summary(model_full)$r.squared

## [1] 0.1694246

summary(model_full)$adj.r.squared

## [1] 0.1680933

#The reduced model, which excludes general health, explains about 15.2% of the variation in mentally unhealthy days. In contrast, the full model including gen_health explains about 16.9% of the variation. The increase in both reported \(R^2\) and adjusted \(R^2\) indicates that adding gen_health improves the model’s explanatory power, even after accounting for the number of predictors.

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

anova(model_reduced, model_full)

## Analysis of Variance Table
## 
## Model 1: menthlth_days ~ age + sex + physhlth_days + sleep_hrs
## Model 2: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4995 264715                                  
## 2   4991 259335  4    5379.8 25.884 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#We reject the null hypothesis. The partial F-test shows that adding gen_health significantly improves the model (F(4, 4991) = 25.88, p < 0.001). This indicates that general health status, as a group of variables, provides significant additional explanatory power for predicting mentally unhealthy days beyond age, sex, physical health, and sleep.

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

car::Anova(model_full, type = "III")

## Anova Table (Type III tests)
## 
## Response: menthlth_days
##               Sum Sq   Df F value    Pr(>F)    
## (Intercept)    12031    1 231.536 < 2.2e-16 ***
## age            10908    1 209.926 < 2.2e-16 ***
## sex             3664    1  70.512 < 2.2e-16 ***
## physhlth_days  10634    1 204.654 < 2.2e-16 ***
## sleep_hrs       3049    1  58.687 2.207e-14 ***
## gen_health      5380    4  25.884 < 2.2e-16 ***
## Residuals     259335 4991                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#The Type III ANOVA test for gen_health produces the same F-statistic (F(4, 4991) = 25.884) and p-value (p < 0.001) as the partial F-test comparing the reduced and full models. This indicates that both methods are testing the same hypothesis: whether the set of gen_health coefficients is jointly equal to zero after adjusting for the other variables in the model. Because the models are nested and include the same predictors, the two approaches yield identical results.

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

Identify which general health groups differ most from the reference
State the direction and approximate magnitude of the association
Appropriately acknowledge the cross-sectional nature of the data
Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

#People who reported worse overall health also tended to report more mentally unhealthy days. Compared to those in excellent health, individuals in poor health had the largest difference, averaging about 5 more days per month of poor mental health, while those in fair health had about 3 more days. Even people in good or very good health reported slightly more mentally unhealthy days than those in excellent health. Because this information comes from a single point in time, it shows a relationship between general health and mental health, but it does not tell us whether one causes the other.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

#General health appears to be much more strongly associated with mentally unhealthy days than education. The differences across general health groups are large (for example, people in poor health report several more unhealthy days per month compared to those in excellent health), while the differences across education levels are relatively small. If building a final model, I would prioritize including general health because it explains more variation in mental health and shows a clearer pattern. Education could still be included if it adds additional insight, but its contribution is likely more modest compared to general health

End of Lab Activity

Lab: Dummy Variables in Regression

Lab: EPI 553 — Principles of Statistical Inference II (Spring 2026)

Nursultan

due March 26, 2026