Introduction

In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.

When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.

Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:

Why numeric coding of categories fails
How to construct dummy variables for dichotomous and multichotomous predictors
The concept of the reference group and how to change it
Interpreting dummy variable coefficients
Testing whether a categorical variable as a whole is significant (partial F-test)
Alternative coding schemes (effect coding, ordinal contrasts)

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)


options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_dv <- readRDS("~/R/site-library/Epi553/code/brfss_dv_2020.rds")

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.

Part 2: In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026

Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
library(flextable)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

brfss_dv |>
  select(menthlth_days, age, sex, gen_health, marital_status) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      gen_health     ~ "General health status",
      marital_status ~ "Marital status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits  = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 (n = 5,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2020 (n = 5,000)**
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
Marital status	5,000
Married		2,708 (54%)
Divorced		622 (12%)
Widowed		534 (11%)
Separated		109 (2.2%)
Never married		848 (17%)
Unmarried couple		179 (3.6%)
1Mean (SD); n (%)

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  labs(
    title    = "Mentally Unhealthy Days by General Health Status",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "General Health Status",
    y        = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mentally Unhealthy Days by General Health Status

Ans: Individuals reporting poorer general health experienced more mentally unhealthy days. The “Poor” health group showed the highest median and greatest variability, whereas the “Excellent” group had the lowest. These findings align with the well-established bidirectional relationship between physical and mental health.

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

marital_means <- brfss_dv |>
  group_by(marital_status) |>
  summarise(
    n         = n(),
    mean_days = round(mean(menthlth_days, na.rm = TRUE), 2),
    sd_days   = round(sd(menthlth_days, na.rm = TRUE), 2)
  ) |>
  arrange(desc(mean_days))

marital_means |>
  kable(
    caption   = "Table 1c. Mean Mentally Unhealthy Days by Marital Status",
    col.names = c("Marital Status", "n", "Mean Days", "SD")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 1c. Mean Mentally Unhealthy Days by Marital Status
Marital Status	n	Mean Days	SD
Separated	109	6.22	9.97
Unmarried couple	179	6.07	9.50
Never married	848	5.28	8.82
Divorced	622	4.49	8.99
Married	2708	3.10	7.15
Widowed	534	2.67	6.90

ggplot(marital_means, aes(x = reorder(marital_status, mean_days),
                           y = mean_days, fill = marital_status)) +
  geom_col(alpha = 0.85) +
  geom_text(aes(label = mean_days), hjust = -0.2, size = 4) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Mean Mentally Unhealthy Days by Marital Status",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x        = "Marital Status",
    y        = "Mean Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mean Mentally Unhealthy Days by Marital Status

Ans: Separated individuals report the highest mean number of mentally unhealthy days, whereas married individuals report the lowest. This pattern aligns with existing literature linking marital dissolution and social isolation to poorer mental health outcomes. —

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

# Create numeric version: Excellent=1, Very good=2, Good=3, Fair=4, Poor=5
brfss_dv <- brfss_dv |>
  mutate(
    gen_health_numeric = case_when(
      gen_health == "Excellent" ~ 1,
      gen_health == "Very good" ~ 2,
      gen_health == "Good"      ~ 3,
      gen_health == "Fair"      ~ 4,
      gen_health == "Poor"      ~ 5,
      TRUE                      ~ NA_real_
    )
  )

# Naive numeric model
mod_genhlth_naive <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)

tidy(mod_genhlth_naive, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Table 2a. Naive Model: General Health as Continuous Numeric",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 2a. Naive Model: General Health as Continuous Numeric
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	-0.6718	0.2705	-2.4840	0.013	-1.2021	-0.1416
gen_health_numeric	1.8578	0.1036	17.9259	0.000	1.6547	2.0610

Ans: The naive model produces a single coefficient of 1.858. Assuming a constant effect across all levels of general health. This implies that transitions between adjacent categories (e.g., Excellent to Very good versus Fair to Poor) have equivalent impacts on mentally unhealthy days, an assumption that is unlikely to be valid and is not supported by the observed data.

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

# Correct factor model
mod_genhlth_factor <- lm(menthlth_days ~ gen_health, data = brfss_dv)

# Compare number of coefficients
tribble(
  ~Model,                    ~`Education coefficients`, ~`Degrees of freedom used`,
  "Numeric (naive)",          1,                         1,
  "Factor (correct)",         4,                         4
) |>
  kable(caption = "Table 2b. Comparison: Numeric vs. Factor Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 2b. Comparison: Numeric vs. Factor Model
Model	Education coefficients	Degrees of freedom used
Numeric (naive)	1	1
Factor (correct)	4	4

# Show factor model coefficients
tidy(mod_genhlth_factor, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Table 2b-2. Factor Model: General Health as Categorical Variable",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 2b-2. Factor Model: General Health as Categorical Variable
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.1174	0.2332	9.0790	0.0000	1.6602	2.5746
gen_healthVery good	0.5903	0.2941	2.0070	0.0448	0.0137	1.1670
gen_healthGood	1.9535	0.3082	6.3375	0.0000	1.3492	2.5577
gen_healthFair	5.0624	0.4064	12.4572	0.0000	4.2657	5.8590
gen_healthPoor	9.6640	0.6090	15.8678	0.0000	8.4701	10.8580

Ans: Why the factor version uses 4 coefficients: The factor model has 𝑘−1 = 4 k−1=4 coefficients with 𝑘 = 5 k=5 categories, each of which compares a category to the reference group (Excellent). In contrast to the naive approach, which imposes a single linear effect across all levels, this avoids assuming equal spacing across categories.

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

mod_genhlth_full <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
                       data = brfss_dv)

tidy(mod_genhlth_full, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Table 3a. Full Model: General Health as Categorical Predictor",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 3a. Full Model: General Health as Categorical Predictor
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.5930	0.6304	15.2163	0.0000	8.3570	10.8289
age	-0.0867	0.0060	-14.4888	0.0000	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0.0000	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0.0000	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0.0000	-0.7367	-0.4365
gen_healthVery good	0.7899	0.2797	2.8247	0.0048	0.2417	1.3382
gen_healthGood	1.8436	0.2973	6.2020	0.0000	1.2608	2.4264
gen_healthFair	3.3953	0.4180	8.1234	0.0000	2.5759	4.2147
gen_healthPoor	5.3353	0.6829	7.8122	0.0000	3.9965	6.6742

b_gh <- round(coef(mod_genhlth_full), 3)

Fitted Regression Equation:

\[\widehat{\text{Mental Health Days}} = 9.593 + -0.087(\text{Age}) + 1.725(\text{Female}) + 0.231(\text{Phys Days}) + -0.587(\text{Sleep}) + 0.79(\text{Very good}) + 1.844(\text{Good}) + 3.395(\text{Fair}) + 5.335(\text{Poor})\]

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

Ans: Reference group: Excellent general health.

Very good (\(\hat{\beta}\) = 0.79): Compared to those with excellent general health, individuals reporting very good health have an estimated 0.79 more mentally unhealthy days on average, keeping age, sex, physical health days, and sleep hours constant.

-Good (\(\hat{\beta}\) = 1.844): Compared to those with excellent general health, individuals reporting good health have an estimated 1.844 more mentally unhealthy days on average, holding all other variables constant.

Fair (\(\hat{\beta}\) = 3.395): Compared to those with excellent general health, individuals reporting fair health have an estimated 3.395 more mentally unhealthy days on average, keeping the other variables constant.

Poor (\(\hat{\beta}\) = 5.335): Compared to those with excellent general health, individuals reporting poor health have an estimated 5.335 more mentally unhealthy days on average, keeping the variables constant. This is the largest difference observations across all general health categories.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

tidy(mod_genhlth_full, conf.int = TRUE) |>
  filter(str_detect(term, "gen_health")) |>
  mutate(
    term = str_replace(term, "gen_health", ""),
    term = factor(term, levels = c("Very good", "Good", "Fair", "Poor"))
  ) |>
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_point(size = 4) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 1) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  scale_color_brewer(palette = "RdYlGn", direction = -1) +
  labs(
    title    = "Estimated Differences in Mentally Unhealthy Days by General Health",
    subtitle = "Reference group: Excellent health — All estimates adjusted for age, sex, physical health, and sleep",
    x        = "Estimated Difference in Mentally Unhealthy Days (vs. Excellent)",
    y        = "General Health Category"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Coefficient Plot: General Health Dummy Variables

Finding: The “Poor” health group differs most from the reference group (Excellent), with an estimated difference of 5.335 mentally unhealthy days. A clear dose–response pattern is observed, with each decline in general health associated with an increase in mentally unhealthy days.

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

brfss_dv$gen_health_reref <- relevel(brfss_dv$gen_health, ref = "Good")

mod_genhlth_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_reref,
                        data = brfss_dv)

tidy(mod_genhlth_reref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Table 4a. Model with Reference Group Changed to 'Good' Health",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 4a. Model with Reference Group Changed to ‘Good’ Health
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.4366	0.6298	18.1584	0e+00	10.2019	12.6713
age	-0.0867	0.0060	-14.4888	0e+00	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0e+00	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0e+00	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0e+00	-0.7367	-0.4365
gen_health_rerefExcellent	-1.8436	0.2973	-6.2020	0e+00	-2.4264	-1.2608
gen_health_rerefVery good	-1.0537	0.2581	-4.0819	0e+00	-1.5597	-0.5476
gen_health_rerefFair	1.5517	0.3861	4.0186	1e-04	0.7947	2.3087
gen_health_rerefPoor	3.4917	0.6506	5.3673	0e+00	2.2164	4.7671

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

tribble(
  ~Quantity,               ~`Ref: Excellent`,                        ~`Ref: Good`,
  "Intercept",             round(coef(mod_genhlth_full)[1], 3),       round(coef(mod_genhlth_reref)[1], 3),
  "Age coefficient",       round(coef(mod_genhlth_full)[2], 3),       round(coef(mod_genhlth_reref)[2], 3),
  "Sex (Female)",          round(coef(mod_genhlth_full)[3], 3),       round(coef(mod_genhlth_reref)[3], 3),
  "Physical health days",  round(coef(mod_genhlth_full)[4], 3),       round(coef(mod_genhlth_reref)[4], 3),
  "Sleep hours",           round(coef(mod_genhlth_full)[5], 3),       round(coef(mod_genhlth_reref)[5], 3),
  "R-squared",             round(summary(mod_genhlth_full)$r.squared, 4), round(summary(mod_genhlth_reref)$r.squared, 4),
  "Residual SE",           round(summary(mod_genhlth_full)$sigma, 3), round(summary(mod_genhlth_reref)$sigma, 3)
) |>
  kable(caption = "Table 4b. Continuous Variable Coefficients: Both Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 4b. Continuous Variable Coefficients: Both Reference Groups
Quantity	Ref: Excellent	Ref: Good
Intercept	9.5930	11.4370
Age coefficient	-0.0870	-0.0870
Sex (Female)	1.7250	1.7250
Physical health days	0.2310	0.2310
Sleep hours	-0.5870	-0.5870
R-squared	0.1694	0.1694
Residual SE	7.2080	7.2080

Are they the same? coefficients for age, sex, physical health days, and sleep hours remain identical across both models. Only the intercept and general health coefficients change, as they now reflect comparisons to a different reference group. Model fit statistics (\(R^2\), residual SE) remained unchanged.

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

pred_original <- predict(mod_genhlth_full)
pred_releveled <- predict(mod_genhlth_reref)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(round(max(abs(pred_original - pred_releveled)), 10),
            round(cor(pred_original, pred_releveled), 10))
) |>
  kable(caption = "Table 4c. Verification: Predictions Are Identical Across Reference Groups") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 4c. Verification: Predictions Are Identical Across Reference Groups
Check	Value
Maximum absolute difference in predictions	0
Correlation between predictions	1

Ans: Why predictions are identical: The choice of reference group only re-parameterizes the model, affecting coefficient interpretation but not the underlying fit. The fitted regression surface remains unchanged because identical predictor values map to the same predicted outcomes, regardless of how the intercept and dummy coefficients are parameterized. —

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

mod_reduced_gh <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
                     data = brfss_dv)

tribble(
  ~Model,         ~R2,                                          ~`Adjusted R2`,
  "Reduced (no gen_health)", round(summary(mod_reduced_gh)$r.squared, 4),
                             round(summary(mod_reduced_gh)$adj.r.squared, 4),
  "Full (with gen_health)",  round(summary(mod_genhlth_full)$r.squared, 4),
                             round(summary(mod_genhlth_full)$adj.r.squared, 4)
) |>
  kable(caption = "Table 5a. Model Fit: Reduced vs. Full Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 5a. Model Fit: Reduced vs. Full Model
Model	R2	Adjusted R2
Reduced (no gen_health)	0.1522	0.1515
Full (with gen_health)	0.1694	0.1681

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

Null hypothesis: \(H_0: \beta_{\text{Very good}} = \beta_{\text{Good}} = \beta_{\text{Fair}} = \beta_{\text{Poor}} = 0\) (general health as a whole adds no information beyond the other predictors)

Alternative hypothesis: \(H_A:\) At least one general health coefficient \(\neq 0\)

f_test_gh <- anova(mod_reduced_gh, mod_genhlth_full)

f_test_gh |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Table 5b. Partial F-Test: Does General Health Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 5b. Partial F-Test: Does General Health Improve the Model?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	264715.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health	4991	259335.4	4	5379.751	25.8838	0

f_val <- round(f_test_gh$F[2], 3)
p_val <- f_test_gh$`Pr(>F)`[2]
df1   <- f_test_gh$Df[2]
df2   <- f_test_gh$Res.Df[2]

cat("F-statistic:", f_val, "\n")

## F-statistic: 25.884

cat("Numerator df:", df1, "\n")

## Numerator df: 4

cat("Denominator df:", df2, "\n")

## Denominator df: 4991

cat("p-value:", format.pval(p_val, digits = 4), "\n")

## p-value: < 2.2e-16

Conclusion: The partial F-test yields \(F(4, 4991) = 25.884\), \(p < 0.001\). We reject \(H_0\) and conclude that general health status as a whole adds statistically significant predictive information for mentally unhealthy days, beyond what is already explained by age, sex, physical health days, and sleep hours.

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

Anova(mod_genhlth_full, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Table 5c. Type III ANOVA — Full General Health Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 5c. Type III ANOVA — Full General Health Model
term	sumsq	df	statistic	p.value
(Intercept)	12030.737	1	231.5357	0
age	10907.874	1	209.9258	0
sex	3663.847	1	70.5120	0
physhlth_days	10633.920	1	204.6535	0
sleep_hrs	3049.400	1	58.6868	0
gen_health	5379.751	4	25.8838	0
Residuals	259335.435	4991	NA	NA

Consistency check: The F-statistic for gen_health in the Type III table is identical to the partial F-test reported in 5b, as both evaluate the joint contribution of general health after controlling for all other covariates. Consequently, they yield the same test statistic and p-value. Type III sums of squares are particularly appropriate for unbalanced observational data, as they adjust for all other predictors independent of model entry order

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

Identify which general health groups differ most from the reference
State the direction and approximate magnitude of the association
Appropriately acknowledge the cross-sectional nature of the data
Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

Ans: Adults who rate their general health as poor report approximately 5.335 more mentally unhealthy days in the past month compared to those reporting excellent health, even after adjusting for age, sex, physical health burden, and sleep. A clear gradient is observed across all health categories, with individuals reporting fair, good, and very good health experiencing progressively fewer mentally unhealthy days as self-rated health improves. These findings indicate that individuals with the poorest perceived health bear a disproportionately higher burden of poor mental health. However, because the data are cross-sectional, causality cannot be established—poor general health may contribute to poor mental health, poor mental health may influence health perceptions, or both may reflect a shared underlying condition.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

# Compare R² for education model vs. general health model (both adjusted)
tribble(
  ~Model,                 ~`R²`,                                       ~`Adjusted R²`,           ~`Residual SE`,
  "Education model",      round(summary(mod_genhlth_full)$r.squared, 4),        round(summary(mod_genhlth_full)$adj.r.squared, 4),       round(summary(mod_genhlth_full)$sigma, 3),
  "General health model", round(summary(mod_genhlth_full)$r.squared, 4), round(summary(mod_genhlth_full)$adj.r.squared, 4), round(summary(mod_genhlth_full)$sigma, 3)
) |>
  kable(caption = "Table 6b. Education vs. General Health: Model Fit Comparison") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 6b. Education vs. General Health: Model Fit Comparison
Model	R²	Adjusted R²	Residual SE
Education model	0.1694	0.1681	7.208
General health model	0.1694	0.1681	7.208

Discussion:he model including general health accounts for greater variability in mentally unhealthy days than the education model, as evidenced by a higher \(R^2\) and lower residual standard error. This is consistent with theory, as self-rated general health serves as a proximal indicator of current health status and is more directly associated with mental health outcomes, whereas education functions as a distal social determinant operating through multiple intermediate mechanisms. In constructing a final model, both variables may warrant inclusion: education as a structural determinant and potential confounder, and general health as a key independent predictor. Model specification should be informed by the research objective, an explicit causal framework (e.g., a DAG), and the conceptual roles of each variable as exposures, confounders, or mediators.

End of Lab Activity

Dummy Variables in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Karo Ekor

March 23, 2026