Introduction

In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.

When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.

Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covered:

Why numeric coding of categories fails
How to construct dummy variables for dichotomous and multichotomous predictors
The concept of the reference group and how to change it
Interpreting dummy variable coefficients
Testing whether a categorical variable as a whole is significant (partial F-test)
Alternative coding schemes (effect coding, ordinal contrasts)

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
library(ggplot2)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset.

brfss_full <- read_xpt(
  "/Users/nataliasmall/Downloads/EPI 553/LLCP2020.XPT "
) |>
  clean_names()

brfss_dv <- brfss_full |>
  mutate(
    # Outcome: mentally unhealthy days in past 30
    menthlth_days = case_when(
      menthlth == 88                    ~ 0,
      menthlth >= 1 & menthlth <= 30   ~ as.numeric(menthlth),
      TRUE                             ~ NA_real_
    ),
    # Physical health days
    physhlth_days = case_when(
      physhlth == 88                    ~ 0,
      physhlth >= 1 & physhlth <= 30   ~ as.numeric(physhlth),
      TRUE                             ~ NA_real_
    ),
    # Sleep hours
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14   ~ as.numeric(sleptim1),
      TRUE                             ~ NA_real_
    ),
    # Age
    age = age80,
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Education (6-level raw BRFSS variable EDUCA)
    # 1 = Never attended school or only kindergarten
    # 2 = Grades 1 through 8 (Elementary)
    # 3 = Grades 9 through 11 (Some high school)
    # 4 = Grade 12 or GED (High school graduate)
    # 5 = College 1 year to 3 years (Some college or technical school)
    # 6 = College 4 years or more (College graduate)
    # 9 = Refused
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    # General health status (5-level)
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    # Marital status
    marital_status = factor(case_when(
      marital == 1 ~ "Married",
      marital == 2 ~ "Divorced",
      marital == 3 ~ "Widowed",
      marital == 4 ~ "Separated",
      marital == 5 ~ "Never married",
      marital == 6 ~ "Unmarried couple",
      TRUE         ~ NA_character_
    ), levels = c("Married", "Divorced", "Widowed", "Separated",
                  "Never married", "Unmarried couple")),
    # Store the raw education numeric code for the "naive approach" demonstration
    educ_numeric = case_when(
      educa %in% c(1, 2, 3) ~ 1,
      educa == 4             ~ 2,
      educa == 5             ~ 3,
      educa == 6             ~ 4,
      TRUE                   ~ NA_real_
    )
  ) |>
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age), age >= 18,
    !is.na(sex),
    !is.na(education),
    !is.na(gen_health),
    !is.na(marital_status)
  )

# Reproducible random sample
set.seed(1220)
brfss_dv <- brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health, marital_status, educ_numeric) |>
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_dv,
  "/Users/nataliasmall/Downloads/EPI 553/brfss_dv_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_dv), ncol(brfss_dv))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026

Instructions

In this lab, practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically.

Data for the Lab

We are using the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

brfss_dv |>
  select(menthlth_days, age, sex, gen_health, marital_status
         ) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      gen_health     ~ "General health status",
      marital_status ~ "Marital status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
Marital status	5,000
Married		2,708 (54%)
Divorced		622 (12%)
Widowed		534 (11%)
Separated		109 (2.2%)
Never married		848 (17%)
Unmarried couple		179 (3.6%)
1Mean (SD); n (%)

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by General Health Status",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "General Health Status",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mental Health Days by General Health Status

Individuals with poor general health status report the most mentally unhealthy days. This pattern appears consistent with what I expected regarding general health status’ influence on mental health (i.e individuals with poorer general health status will also have poor mental health, increasing amount of mental health days).

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

# Summary statistics by Marital Status
mean_mhd <- brfss_dv %>%
  group_by(marital_status) %>%
  summarise(
    n = n(),
    Mean = mean(menthlth_days)
  )

mean_mhd %>% 
  kable(digits = 2, 
        caption = "Descriptive Statistics — Mentally Unhealthy Days by Marital Status")

Descriptive Statistics — Mentally Unhealthy Days by Marital Status
marital_status	n	Mean
Married	2708	3.10
Divorced	622	4.49
Widowed	534	2.67
Separated	109	6.22
Never married	848	5.28
Unmarried couple	179	6.07

# Grouped bar chart: Mentally Unhealthy Days by Marital Status
ggplot(mean_mhd, aes(x = marital_status, y = Mean, fill = marital_status)) +
  geom_col(alpha = 0.85) + 
  geom_text(aes(label = round(Mean, 2)), vjust = -0.3) + 
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mean Mentally Unhealthy Days by Marital Status",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x = "Marital Status",
    y = "Mean Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Distribution of Marital Status in Analytic Sample

Marital status group with the highest mean: “Separated” Marital status group with the lowest mean: “Widowed”

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

# Create a numeric version of gen_health coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5
brfss_dv <- brfss_dv |>
  mutate(
    gen_health_numeric = case_when(
      gen_health == "Excellent"   ~ 1,
      gen_health == "Very good"   ~ 2,
      gen_health == "Good"        ~ 3,
      gen_health == "Fair"        ~ 4,
      gen_health == "Poor"        ~ 5,
      TRUE                        ~ NA_real_
    )
)

# Fit simple regression model: menthlth_days ~ gen_health_numeric
model_slr <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)

# Tidy coefficient table
tidy(model_slr, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Simple Linear Regression: menthlth_days ~ gen_health_numeric (BRFSS 2020)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Simple Linear Regression: menthlth_days ~ gen_health_numeric (BRFSS 2020)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	-0.6718	0.2705	-2.4840	0.013	-1.2021	-0.1416
gen_health_numeric	1.8578	0.1036	17.9259	0.000	1.6547	2.0610

The coefficient is 1.8578. For every one-unit increase in the gen_health_numeric score, the number of mentally unhealthy days is predicted to increase by 1.8578 days.

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

# Fit the same model but treating `gen_health` as a factor: menthlth_days ~ gen_health
factor_model_slr <- lm(menthlth_days ~ gen_health, data = brfss_dv)

# Tidy coefficient table
tidy(factor_model_slr, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Simple Linear Regression: gen_health treated as a factor",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Simple Linear Regression: gen_health treated as a factor
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.1174	0.2332	9.0790	0.0000	1.6602	2.5746
gen_healthVery good	0.5903	0.2941	2.0070	0.0448	0.0137	1.1670
gen_healthGood	1.9535	0.3082	6.3375	0.0000	1.3492	2.5577
gen_healthFair	5.0624	0.4064	12.4572	0.0000	4.2657	5.8590
gen_healthPoor	9.6640	0.6090	15.8678	0.0000	8.4701	10.8580

The factor version uses 4 coefficients instead of 1 because the regression is now comparing each category to the reference category. The numeric approach is misleading because general health status is an ordinal variable and the assumption of equal spacing is rarely justified. The jump from “Excellent” to “Very Good” is substantively different from “Poor” to “Fair”.

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

# Fit the model with gen_health as a factor: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
model_full <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv)

# Tidy coefficient table
tidy(model_full, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with gen_health as a factor: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with gen_health as a factor: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.5930	0.6304	15.2163	0.0000	8.3570	10.8289
age	-0.0867	0.0060	-14.4888	0.0000	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0.0000	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0.0000	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0.0000	-0.7367	-0.4365
gen_healthVery good	0.7899	0.2797	2.8247	0.0048	0.2417	1.3382
gen_healthGood	1.8436	0.2973	6.2020	0.0000	1.2608	2.4264
gen_healthFair	3.3953	0.4180	8.1234	0.0000	2.5759	4.2147
gen_healthPoor	5.3353	0.6829	7.8122	0.0000	3.9965	6.6742

Fitted regression equation: menthlth_days = 9.5930 + -0.0867(age) + 1.7254(sexFemale) + 0.2314(physhlth_days) + -0.5866(sleep_hrs) + 0.7899(gen_healthVery good) + 1.8436(gen_healthGood) + 3.3953(gen_healthFair) + 5.3353(gen_healthPoor)

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

-gen_healthVery good (0.7899): Compared to those with excellent general health status, those with very good general health status report an estimated 0.7899 additional mentally unhealthy days, holding all other variables constant (95% CI: 0.2417 to 1.3382). -gen_healthGood (1.8436): Compared to those with excellent general health status, those with good general health status report an estimated 1.8436 additional mentally unhealthy days, holding all other variables constant (95% CI: 1.2608 to 2.4264). -gen_healthFair (3.3953): Compared to those with excellent general health status, those with fair general health status report an estimated 3.3953 additional mentally unhealthy days, holding all other variables constant (95% CI: 2.5759 to 4.2147). -gen_healthPoor (5.3353): Compared to those with excellent general health status, those with poor general health status report an estimated 5.3353 additional mentally unhealthy days, holding all other variables constant (95% CI: 3.9965 to 6.6742).

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

# Create a coefficient plot
tidy(model_full, conf.int = TRUE) |>
  filter(str_detect(term, "gen_health")) |>
  mutate(
    term = str_replace(term, "gen_health", ""),
    term = factor(term, levels = c("Very good", "Good", "Fair", "Poor"))
  ) |>
  
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_point(size = 4) +
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 1) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  scale_color_brewer(palette = "Paired", direction = -1) +
  labs(
    title    = "Coefficient Plot for Gen_health",
    subtitle = "Reference: Gen_health = Excellent",
    x        = "Difference in Mentally Unhealthy Days",
    y        = "General Health Category"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

The poor general health status group differs most from the reference group, reporting an average of over 5 additional mentally unhealthy days compared to those with ‘Excellent’ health.

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

# Change reference group to Good
brfss_dv$gen_reref <- relevel(brfss_dv$gen_health, ref = "Good")

mod_gen_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_reref, data = brfss_dv)

tidy(mod_gen_reref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: Good)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Same Model, Different Reference Group (Reference: Good)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.4366	0.6298	18.1584	0e+00	10.2019	12.6713
age	-0.0867	0.0060	-14.4888	0e+00	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0e+00	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0e+00	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0e+00	-0.7367	-0.4365
gen_rerefExcellent	-1.8436	0.2973	-6.2020	0e+00	-2.4264	-1.2608
gen_rerefVery good	-1.0537	0.2581	-4.0819	0e+00	-1.5597	-0.5476
gen_rerefFair	1.5517	0.3861	4.0186	1e-04	0.7947	2.3087
gen_rerefPoor	3.4917	0.6506	5.3673	0e+00	2.2164	4.7671

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

tribble(
  ~Quantity, ~`Ref: Excellent`, ~`Ref: Good`,
  "Intercept", round(coef(model_full)[1], 3), round(coef(mod_gen_reref)[1], 3),
  "Age coefficient", round(coef(model_full)[2], 3), round(coef(mod_gen_reref)[2], 3),
  "Sex coefficient", round(coef(model_full)[3], 3), round(coef(mod_gen_reref)[3], 3),
  "Physical health days", round(coef(model_full)[4], 3), round(coef(mod_gen_reref)[4], 3),
  "Sleep hours", round(coef(model_full)[5], 3), round(coef(mod_gen_reref)[5], 3),
  "R-squared", round(summary(model_full)$r.squared, 4), round(summary(mod_gen_reref)$r.squared, 4),
  "Residual SE", round(summary(model_full)$sigma, 3), round(summary(mod_gen_reref)$sigma, 3)
) |>
  kable(caption = "Comparing Models with Different Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparing Models with Different Reference Groups
Quantity	Ref: Excellent	Ref: Good
Intercept	9.5930	11.4370
Age coefficient	-0.0870	-0.0870
Sex coefficient	1.7250	1.7250
Physical health days	0.2310	0.2310
Sleep hours	-0.5870	-0.5870
R-squared	0.1694	0.1694
Residual SE	7.2080	7.2080

The education and other continuous variable coefficients between the two models are exactly the same. This is because changing the reference group only shifts the intercept without changing the underlying math of the entire model; changing the reference of one variable should not affect the relationship between mental health and the other variables. The relationship between mental health and continuous variables stay exactly the same regardless of which health category is indicated as the reference point. However, the intercept did change, as well as the coefficients for the Gen_Health variable. This was expected since the reference category changed.

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

# Verify that predicted values are identical
pred_orig <- predict(model_full)
pred_reref <- predict(mod_gen_reref)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(max(abs(pred_orig - pred_reref)),
            cor(pred_orig, pred_reref))
) |>
  kable(caption = "Verification: Predicted Values Are Identical") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Verification: Predicted Values Are Identical
Check	Value
Maximum absolute difference in predictions	0
Correlation between predictions	1

Since correlation between predictions is 1 and maximum absolute difference is 0, predicted values from both models are identical. Changing the reference group does not change the model’s fit or predictions. It only changes the interpretation of the dummy variable coefficients.

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

# Reduced model (no gen_health)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)

summary(mod_reduced)$r.squared

## [1] 0.1521948

summary(mod_reduced)$adj.r.squared

## [1] 0.1515159

summary(model_full)$r.squared

## [1] 0.1694246

summary(model_full)$adj.r.squared

## [1] 0.1680933

Reduced Model R²: 0.1521948 Reduced Model Adjusted R²: 0.1515159 Full Model R²: 0.1694246 Full Model Adjusted R²: 0.1680933

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

# Partial F-test
f_test <- anova(mod_reduced, model_full)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does General Health Status Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Does General Health Status Improve the Model?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	264715.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health	4991	259335.4	4	5379.751	25.8838	0

Null hypothesis: Gen_health does not significantly improve the model (i.e the coefficients for all categories of gen_health are equal to zero) Alternative hypothesis: Gen_health significantly improves the model (i.e at least one category of gen_health has a coefficient significantly different from zero) F-statistic: 25.8838 Degrees of freedom: 4 Residual degrees of freedom: 4991 P-value: 0 (< 0.001) Conclusion: Since p<0.05 we reject the null hypothesis, indicating that there is strong evidence that gen_health significantly improves the model, holding all other variables constant.

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

Anova(model_full, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Type III ANOVA: Testing Each Predictor’s Contribution
term	sumsq	df	statistic	p.value
(Intercept)	12030.737	1	231.5357	0
age	10907.874	1	209.9258	0
sex	3663.847	1	70.5120	0
physhlth_days	10633.920	1	204.6535	0
sleep_hrs	3049.400	1	58.6868	0
gen_health	5379.751	4	25.8838	0
Residuals	259335.435	4991	NA	NA

Yes, they are consistent. The F-statistic, 25.8838, and the p-value, 0, are the same in both tests.

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

Identify which general health groups differ most from the reference
State the direction and approximate magnitude of the association
Appropriately acknowledge the cross-sectional nature of the data
Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

Compared to those in excellent health, individuals who rate their health as “Poor” show the largest difference, reporting over 5 additional mentally unhealthy days per month on average. As general health status declines, people tend to report increases in mentally unhealthy days. It is important to note that since this data was collected at a single point in time, we cannot say whether declining general health causes mental health issues or if the two simply tend to occur together.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

General health is the stronger predictor because its categories show much larger differences in mental health days. For example, having Poor health is associated with 5.3 additional mentally unhealthy days compared to the reference, whereas the lowest Education level is only associated with about 1 additional day. When choosing which to include when building a final model, I would pick the one that explains a larger portion of the differences, general health status, which is confirmed by its higher R² value and F-statistic.

End of Lab Activity

Dummy Variables in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Natalia Small

March 23, 2026