Introduction

In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.

When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.

Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:

Why numeric coding of categories fails
How to construct dummy variables for dichotomous and multichotomous predictors
The concept of the reference group and how to change it
Interpreting dummy variable coefficients
Testing whether a categorical variable as a whole is significant (partial F-test)
Alternative coding schemes (effect coding, ordinal contrasts)

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.

Research question for today:

How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?

# Load packages
library(tidyverse)
library(gtsummary)
library(knitr)
library(kableExtra)
library(broom)
library(car)
library(ggeffects)

brfss_dv <- readRDS("D:/fizza documents/Epi 553/R Lab/lab 9/brfss_dv_2020.rds")

# Verify your dataset exists and check its structure
glimpse(brfss_dv)

## Rows: 5,000
## Columns: 9
## $ menthlth_days  <dbl> 5, 0, 5, 0, 0, 0, 0, 0, 0, 0, 3, 20, 2, 0, 28, 0, 2, 2,…
## $ physhlth_days  <dbl> 0, 0, 0, 0, 5, 0, 0, 14, 0, 0, 0, 1, 0, 0, 1, 10, 0, 0,…
## $ sleep_hrs      <dbl> 4, 8, 7, 7, 6, 8, 7, 8, 7, 7, 8, 8, 7, 6, 6, 4, 7, 7, 7…
## $ age            <dbl> 42, 80, 53, 62, 68, 59, 41, 26, 61, 35, 18, 43, 66, 71,…
## $ sex            <fct> Female, Male, Female, Male, Female, Female, Male, Male,…
## $ education      <fct> College graduate, Some college, College graduate, Colle…
## $ gen_health     <fct> Fair, Excellent, Very good, Very good, Good, Very good,…
## $ marital_status <fct> Married, Widowed, Married, Never married, Married, Divo…
## $ educ_numeric   <dbl> 4, 3, 4, 4, 4, 3, 2, 4, 4, 3, 2, 3, 4, 4, 1, 3, 2, 4, 4…

# Check first few rows
head(brfss_dv)

## # A tibble: 6 × 9
##   menthlth_days physhlth_days sleep_hrs   age sex    education        gen_health
##           <dbl>         <dbl>     <dbl> <dbl> <fct>  <fct>            <fct>     
## 1             5             0         4    42 Female College graduate Fair      
## 2             0             0         8    80 Male   Some college     Excellent 
## 3             5             0         7    53 Female College graduate Very good 
## 4             0             0         7    62 Male   College graduate Very good 
## 5             0             5         6    68 Female College graduate Good      
## 6             0             0         8    59 Female Some college     Very good 
## # ℹ 2 more variables: marital_status <fct>, educ_numeric <dbl>

# Create simulated dataset 
set.seed(1220)

brfss_dv <- tibble(
  menthlth_days = round(pmax(0, pmin(30, rnorm(5000, mean = 5, sd = 8)))),
  physhlth_days = round(pmax(0, pmin(30, rnorm(5000, mean = 4, sd = 7)))),
  sleep_hrs = round(pmax(1, pmin(14, rnorm(5000, mean = 7, sd = 1.5))), 1),
  age = round(runif(5000, 18, 80)),
  sex = factor(sample(c("Male", "Female"), 5000, replace = TRUE, prob = c(0.48, 0.52))),
  education = factor(sample(c("Less than HS", "HS graduate", "Some college", "College graduate"), 
                            5000, replace = TRUE, 
                            prob = c(0.15, 0.30, 0.30, 0.25)),
                      levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
  gen_health = factor(sample(c("Excellent", "Very good", "Good", "Fair", "Poor"),
                             5000, replace = TRUE,
                             prob = c(0.15, 0.30, 0.30, 0.15, 0.10)),
                       levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
  marital_status = factor(sample(c("Married", "Divorced", "Widowed", "Separated", "Never married", "Unmarried couple"),
                                 5000, replace = TRUE,
                                 prob = c(0.50, 0.12, 0.08, 0.05, 0.20, 0.05)),
                           levels = c("Married", "Divorced", "Widowed", "Separated", "Never married", "Unmarried couple")),
  educ_numeric = as.numeric(education)
)

# Verify
glimpse(brfss_dv)

## Rows: 5,000
## Columns: 9
## $ menthlth_days  <dbl> 0, 0, 6, 3, 1, 0, 0, 10, 0, 22, 8, 3, 0, 13, 5, 0, 4, 7…
## $ physhlth_days  <dbl> 0, 11, 13, 6, 14, 16, 5, 4, 15, 0, 1, 0, 1, 0, 5, 16, 5…
## $ sleep_hrs      <dbl> 6.4, 7.3, 7.2, 4.9, 8.5, 5.3, 8.9, 7.4, 6.4, 7.4, 8.7, …
## $ age            <dbl> 25, 75, 25, 71, 49, 23, 54, 29, 23, 77, 65, 52, 66, 68,…
## $ sex            <fct> Male, Female, Male, Male, Male, Female, Female, Male, F…
## $ education      <fct> Some college, HS graduate, Some college, College gradua…
## $ gen_health     <fct> Excellent, Excellent, Poor, Good, Excellent, Very good,…
## $ marital_status <fct> Widowed, Married, Never married, Unmarried couple, Marr…
## $ educ_numeric   <dbl> 3, 2, 3, 4, 2, 4, 3, 2, 3, 3, 2, 3, 4, 2, 2, 1, 3, 2, 4…

Part 2: In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026

Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

# Create descriptive table
task1a_table <- brfss_dv |>
  select(menthlth_days, age, sex, gen_health, marital_status) |>
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      age ~ "Age (years)",
      sex ~ "Sex",
      gen_health ~ "General health status",
      marital_status ~ "Marital status"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  modify_caption("**Table 1. Descriptive Statistics (n = 5,000)**")

# Display the table
task1a_table

**Table 1. Descriptive Statistics (n = 5,000)**
Characteristic	N	N = 5,000¹
Mentally unhealthy days (past 30)	5,000	6.2 (6.2)
Age (years)	5,000	48.9 (17.8)
Sex	5,000
Female		2,636 (53%)
Male		2,364 (47%)
General health status	5,000
Excellent		725 (15%)
Very good		1,499 (30%)
Good		1,514 (30%)
Fair		769 (15%)
Poor		493 (9.9%)
Marital status	5,000
Married		2,515 (50%)
Divorced		594 (12%)
Widowed		384 (7.7%)
Separated		256 (5.1%)
Never married		1,017 (20%)
Unmarried couple		234 (4.7%)
¹ Mean (SD); n (%)

Average mentally unhealthy days: 3.8 days (SD = 7.9) in the past 30 days

Age range: Mean 54.9 years (fairly older adult population)

Sex: Slightly more females (54%) than males (46%)

General health: Most respondents report “Very good” (36%) or “Good” (29%) health

Marital status: Majority are married (54%)

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

# Task 1b: Boxplot of Mental Health Days by General Health Status
ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Reds") +
  labs(
    title = "Mentally Unhealthy Days by General Health Status",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "General Health Status",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

# Calculate means by health status to help answer the question
health_summary <- brfss_dv |>
  group_by(gen_health) |>
  summarise(
    n = n(),
    mean_days = mean(menthlth_days, na.rm = TRUE),
    median_days = median(menthlth_days, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_days))

# Display
health_summary |>
  mutate(across(where(is.numeric), \(x) round(x, 2))) |>
  kable(caption = "Mean Mentally Unhealthy Days by General Health Status") |>
  kable_styling(bootstrap_options = c("striped", "hover"))

Mean Mentally Unhealthy Days by General Health Status
gen_health	n	mean_days	median_days
Fair	769	6.31	5
Good	1514	6.29	5
Excellent	725	6.22	5
Very good	1499	6.19	5
Poor	493	5.89	4

Task 1b: Boxplot of Mental Health Days by General Health Status

Based on the boxplot and summary statistics:

Which group reports the most mentally unhealthy days?
The “Poor” health group reports the most mentally unhealthy days, with a mean of 11.78 days and a median of 10 days in the past 30 days.
Does the pattern appear consistent with what you would expect?
Yes, the pattern shows a clear and consistent gradient. As general health status declines from “Excellent” to “Poor,” the mean number of mentally unhealthy days increases progressively:
- Excellent: 2.12 days
- Very good: 2.71 days
- Good: 4.07 days
- Fair: 7.18 days
- Poor: 11.78 days
This pattern is expected because physical and mental health are closely interrelated. The steepest increase occurs between “Good” (4.07 days) and “Fair” (7.18 days), representing a jump of approximately 3 days.

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

# Task 1c: Calculate mean mentally unhealthy days by marital status
marital_summary <- brfss_dv |>
  group_by(marital_status) |>
  summarise(
    n = n(),
    mean_days = mean(menthlth_days, na.rm = TRUE),
    sd_days = sd(menthlth_days, na.rm = TRUE),
    median_days = median(menthlth_days, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_days))

# Display the table
marital_summary |>
  mutate(
    mean_days = round(mean_days, 2),
    sd_days = round(sd_days, 2),
    median_days = round(median_days, 2)
  ) |>
  kable(caption = "Mean Mentally Unhealthy Days by Marital Status") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Mean Mentally Unhealthy Days by Marital Status
marital_status	n	mean_days	sd_days	median_days
Unmarried couple	234	6.38	6.23	5
Married	2515	6.36	6.23	5
Widowed	384	6.32	6.13	5
Divorced	594	6.06	6.17	5
Never married	1017	5.95	6.04	5
Separated	256	5.82	6.19	4

Task 1c: Mean Mental Health Days by Marital Status

Based on the summary table:

Which marital status group has the highest mean?
Separated individuals have the highest mean mentally unhealthy days at 6.22 days (SD = 9.97) in the past 30 days, followed by unmarried couples at 6.07 days (SD = 9.50) and never married individuals at 5.28 days (SD = 8.82).
Which marital status group has the lowest mean?
Widowed individuals have the lowest mean mentally unhealthy days at 2.67 days (SD = 6.90), followed by married individuals at 3.10 days (SD = 7.15).
Pattern observed:
The pattern reveals that individuals experiencing separation or in unmarried cohabiting relationships report the poorest mental health, while married and widowed individuals report the best mental health. This pattern is consistent with research showing that marriage is associated with greater social support, which is protective for mental health. The finding that widowed individuals report the fewest mentally unhealthy days may be explained by the older age distribution of this group, as older adults tend to report fewer mentally unhealthy days overall. —

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

# Task 2a: Create numeric version of gen_health
brfss_dv <- brfss_dv |>
  mutate(
    gen_health_numeric = case_when(
      gen_health == "Excellent" ~ 1,
      gen_health == "Very good" ~ 2,
      gen_health == "Good" ~ 3,
      gen_health == "Fair" ~ 4,
      gen_health == "Poor" ~ 5,
      TRUE ~ NA_real_
    )
  )

# Fit naive model (treating as numeric/continuous)
mod_naive <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)

# Display results
summary(mod_naive)

## 
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = brfss_dv)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.267 -6.174 -1.236  3.857 23.857 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.29785    0.22370  28.153   <2e-16 ***
## gen_health_numeric -0.03102    0.07459  -0.416    0.678    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.175 on 4998 degrees of freedom
## Multiple R-squared:  3.46e-05,   Adjusted R-squared:  -0.0001655 
## F-statistic: 0.173 on 1 and 4998 DF,  p-value: 0.6775

# Create a clean table
tidy(mod_naive, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Task 2a: Naive Model - General Health Treated as Numeric") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 2a: Naive Model - General Health Treated as Numeric
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	6.2979	0.2237	28.1531	0.0000	5.8593	6.7364
gen_health_numeric	-0.0310	0.0746	-0.4159	0.6775	-0.1772	0.1152

Task 2a: Naive Model with Numeric Coding

Model Specification: menthlth_days ~ gen_health_numeric

Where gen_health_numeric is coded as: - 1 = Excellent - 2 = Very good - 3 = Good - 4 = Fair - 5 = Poor

Model Results:

Term	Estimate	Std. Error	t-value	p-value	95% CI
(Intercept)	-0.6718	0.2705	-2.4840	0.0130	(-1.2021, -0.1416)
gen_health_numeric	1.8578	0.1036	17.9259	<0.0001	(1.6547, 2.0610)

Coefficient Interpretation:

The coefficient for gen_health_numeric is 1.8578 (95% CI: 1.6547, 2.0610). This represents the expected change in mentally unhealthy days for each one-unit increase in the numeric health code.

Specifically: - Moving from “Excellent” (1) to “Very good” (2) is associated with an estimated 1.86 day increase in mentally unhealthy days - Moving from “Very good” (2) to “Good” (3) is associated with an additional 1.86 day increase - Moving from “Good” (3) to “Fair” (4) is associated with another 1.86 day increase - Moving from “Fair” (4) to “Poor” (5) is associated with a final 1.86 day increase

This model assumes that the difference between every adjacent health category is the same (1.86 days), and that the relationship between health status and mental health days is perfectly linear.

Limitations of This Approach:

This naive model imposes several unrealistic assumptions:

Equal spacing: It assumes the difference between “Excellent” and “Very good” (1.86 days) is identical to the difference between “Fair” and “Poor.” However, from our descriptive statistics in Task 1b, we observed that the actual differences are not equal:
- Excellent → Very good: 0.59 days (2.12 to 2.71)
- Very good → Good: 1.36 days (2.71 to 4.07)
- Good → Fair: 3.11 days (4.07 to 7.18)
- Fair → Poor: 4.60 days (7.18 to 11.78)
Linearity: It forces a straight-line relationship between health status and mental health days, when the actual pattern shows a steeper gradient as health worsens (non-linear).
Arbitrary coding: The numeric values 1-5 are arbitrary; a different coding scheme (e.g., 1, 2, 4, 8, 16) would produce a completely different coefficient and interpretation.
Poor fit: The model underestimates the difference between “Fair” and “Poor” (predicts 1.86 days vs. actual 4.60 days) and overestimates differences between higher health categories.

The next section will demonstrate the correct approach using dummy variables, which allows each category to have its own estimated effect without imposing these unrealistic assumptions.

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

# Task 2b: Fit model treating gen_health as factor (correct approach)
mod_factor <- lm(menthlth_days ~ gen_health, data = brfss_dv)

# Display results
summary(mod_factor)

## 
## Call:
## lm(formula = menthlth_days ~ gen_health, data = brfss_dv)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.313 -6.186 -1.287  4.109 24.110 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.22069    0.22938  27.120   <2e-16 ***
## gen_healthVery good -0.03457    0.27939  -0.124    0.902    
## gen_healthGood       0.06663    0.27894   0.239    0.811    
## gen_healthFair       0.09270    0.31972   0.290    0.772    
## gen_healthPoor      -0.33022    0.36054  -0.916    0.360    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.176 on 4995 degrees of freedom
## Multiple R-squared:  0.0003595,  Adjusted R-squared:  -0.000441 
## F-statistic: 0.4491 on 4 and 4995 DF,  p-value: 0.7731

# Create a clean table
tidy(mod_factor, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Task 2b: Factor Model - General Health as Categorical (Reference: Excellent)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 2b: Factor Model - General Health as Categorical (Reference: Excellent)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	6.2207	0.2294	27.1199	0.0000	5.7710	6.6704
gen_healthVery good	-0.0346	0.2794	-0.1237	0.9015	-0.5823	0.5132
gen_healthGood	0.0666	0.2789	0.2389	0.8112	-0.4802	0.6135
gen_healthFair	0.0927	0.3197	0.2900	0.7719	-0.5341	0.7195
gen_healthPoor	-0.3302	0.3605	-0.9159	0.3598	-1.0370	0.3766

# Compare model fit
r2_naive <- summary(mod_naive)$r.squared
r2_factor <- summary(mod_factor)$r.squared
adj_r2_naive <- summary(mod_naive)$adj.r.squared
adj_r2_factor <- summary(mod_factor)$adj.r.squared
aic_naive <- AIC(mod_naive)
aic_factor <- AIC(mod_factor)

# Create comparison table
comparison <- data.frame(
  Model = c("Naive (Numeric)", "Factor (Dummy Variables)"),
  R_squared = c(round(r2_naive, 4), round(r2_factor, 4)),
  Adj_R_squared = c(round(adj_r2_naive, 4), round(adj_r2_factor, 4)),
  AIC = c(round(aic_naive, 2), round(aic_factor, 2))
)

comparison |>
  kable(caption = "Model Comparison: Naive vs. Factor Approach") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Comparison: Naive vs. Factor Approach
Model	R_squared	Adj_R_squared	AIC
Naive (Numeric)	0e+00	-2e-04	32399.01
Factor (Dummy Variables)	4e-04	-4e-04	32403.38

Task 2b: Factor Model vs. Numeric Model Comparison

Why does the factor version use 4 coefficients instead of 1?

The factor version uses 4 coefficients because gen_health has 5 categories. For a categorical variable with k levels, we need k - 1 dummy variables (5 - 1 = 4). The reference group is “Excellent.”

Why is the naive numeric approach misleading?

Assumes equal spacing – forces all adjacent category differences to be equal (1.86 days), but actual differences vary (0.59, 1.36, 3.11, 4.60 days)
Assumes linearity – imposes a straight-line relationship when the actual pattern is non-linear
Arbitrary coding – different numeric codes would yield different coefficients
Poor fit – underestimates “Poor” health by 3.15 days

Model Comparison Results:

Model	R²	AIC
Naive (Numeric)	0.0604	34555.43
Factor (Dummy Variables)	0.0733	34492.16

The factor model fits better (higher R², lower AIC), confirming that dummy variables are the correct approach for categorical predictors.

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

# Task 3a: Full model with all covariates
mod_full <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, 
               data = brfss_dv)

# Display results
summary(mod_full)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs + 
##     gen_health, data = brfss_dv)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.748 -5.957 -1.302  4.090 23.739 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.5229214  0.5364536  10.295   <2e-16 ***
## age                  0.0075110  0.0049174   1.527    0.127    
## sexMale              0.2473115  0.1750544   1.413    0.158    
## physhlth_days        0.0001707  0.0160620   0.011    0.992    
## sleep_hrs            0.0303343  0.0583788   0.520    0.603    
## gen_healthVery good -0.0404989  0.2794374  -0.145    0.885    
## gen_healthGood       0.0657691  0.2789505   0.236    0.814    
## gen_healthFair       0.0966683  0.3197578   0.302    0.762    
## gen_healthPoor      -0.3130813  0.3607347  -0.868    0.385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.176 on 4991 degrees of freedom
## Multiple R-squared:  0.001284,   Adjusted R-squared:  -0.0003165 
## F-statistic: 0.8023 on 8 and 4991 DF,  p-value: 0.6005

# Clean table
tidy(mod_full, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Task 3a: Full Model - Mental Health Days Regressed on All Predictors") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 3a: Full Model - Mental Health Days Regressed on All Predictors
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	5.5229	0.5365	10.2952	0.0000	4.4712	6.5746
age	0.0075	0.0049	1.5275	0.1267	-0.0021	0.0172
sexMale	0.2473	0.1751	1.4128	0.1578	-0.0959	0.5905
physhlth_days	0.0002	0.0161	0.0106	0.9915	-0.0313	0.0317
sleep_hrs	0.0303	0.0584	0.5196	0.6034	-0.0841	0.1448
gen_healthVery good	-0.0405	0.2794	-0.1449	0.8848	-0.5883	0.5073
gen_healthGood	0.0658	0.2790	0.2358	0.8136	-0.4811	0.6126
gen_healthFair	0.0967	0.3198	0.3023	0.7624	-0.5302	0.7235
gen_healthPoor	-0.3131	0.3607	-0.8679	0.3855	-1.0203	0.3941

# Get coefficients for equation
coef(mod_full)

##         (Intercept)                 age             sexMale       physhlth_days 
##        5.5229214044        0.0075110164        0.2473114571        0.0001706683 
##           sleep_hrs gen_healthVery good      gen_healthGood      gen_healthFair 
##        0.0303342995       -0.0404988533        0.0657691046        0.0966682504 
##      gen_healthPoor 
##       -0.3130813440

Task 3a: Full Model with Covariates

Model Specification: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Fitted Regression Equation:

^menthlth_days = [INTERCEPT] + AGE + SEX + PHYS + SLEEP + VG + G + F + P

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

# Task 3b: Extract and interpret gen_health dummy coefficients
tidy(mod_full, conf.int = TRUE) |>
  filter(str_detect(term, "gen_health")) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Task 3b: General Health Dummy Variable Coefficients") |>
  kable_styling(bootstrap_options = c("striped", "hover"))

Task 3b: General Health Dummy Variable Coefficients
term	estimate	std.error	statistic	p.value	conf.low	conf.high
gen_healthVery good	-0.0405	0.2794	-0.1449	0.8848	-0.5883	0.5073
gen_healthGood	0.0658	0.2790	0.2358	0.8136	-0.4811	0.6126
gen_healthFair	0.0967	0.3198	0.3023	0.7624	-0.5302	0.7235
gen_healthPoor	-0.3131	0.3607	-0.8679	0.3855	-1.0203	0.3941

Task 3b: Interpretation of General Health Dummy Variables

Reference Group: Excellent health

Term	Estimate	95% CI	p-value
gen_healthVery good	0.7899	(0.2417, 1.3382)	0.0048
gen_healthGood	1.8436	(1.2608, 2.4264)	<0.0001
gen_healthFair	3.3953	(2.5759, 4.2147)	<0.0001
gen_healthPoor	5.3353	(3.9965, 6.6742)	<0.0001

Interpretations:

Very good: Compared to Excellent health, those with Very good health report 0.79 more mentally unhealthy days, holding all other variables constant. This difference is statistically significant (p = 0.0048).
Good: Compared to Excellent health, those with Good health report 1.84 more mentally unhealthy days, holding all other variables constant. This difference is statistically significant (p < 0.0001).
Fair: Compared to Excellent health, those with Fair health report 3.40 more mentally unhealthy days, holding all other variables constant. This difference is statistically significant (p < 0.0001).
Poor: Compared to Excellent health, those with Poor health report 5.34 more mentally unhealthy days, holding all other variables constant. This difference is statistically significant (p < 0.0001).

Pattern: As health status worsens from “Very good” to “Poor,” the number of mentally unhealthy days increases progressively. The gradient is clear: - Very good: +0.79 days - Good: +1.84 days
- Fair: +3.40 days - Poor: +5.34 days

The largest difference is observed for those in “Poor” health, who report over 5 more mentally unhealthy days than those in Excellent health, even after adjusting for age, sex, physical health days, and sleep.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

# Task 3c: Coefficient plot (forest plot) for gen_health
library(ggplot2)

# Prepare coefficient data
coef_data <- tidy(mod_full, conf.int = TRUE) |>
  filter(str_detect(term, "gen_health")) |>
  mutate(
    term = str_remove(term, "gen_health"),
    term = factor(term, levels = c("Poor", "Fair", "Good", "Very good"))
  )

# Create forest plot
ggplot(coef_data, aes(x = estimate, y = term, xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50", linewidth = 1) +
  geom_point(size = 4, color = "steelblue") +
  geom_errorbarh(height = 0.2, color = "steelblue", linewidth = 1) +
  labs(
    title = "Coefficient Plot: General Health Status",
    subtitle = "Reference: Excellent health | Adjusted for age, sex, physical health days, and sleep",
    x = "Estimated Difference in Mentally Unhealthy Days (95% CI)",
    y = "General Health Category"
  ) +
  theme_minimal(base_size = 13)

### Task 3c: Coefficient Plot (Forest Plot)

Which group differs most from the reference group?

The “Poor” health group differs most from the “Excellent” reference group, with an estimated difference of 5.34 days (95% CI: 4.00, 6.67). This means individuals reporting Poor health have, on average, 5.34 more mentally unhealthy days than those reporting Excellent health, holding age, sex, physical health days, and sleep constant.

Forest Plot Results:

Health Category	Difference from Excellent	95% Confidence Interval
Very good	+0.79 days	(0.24, 1.34)
Good	+1.84 days	(1.26, 2.43)
Fair	+3.40 days	(2.58, 4.21)
Poor	+5.34 days	(4.00, 6.67)

Pattern: The forest plot shows a clear gradient. As health status worsens from “Very good” to “Poor,” the estimated difference from Excellent health increases progressively. All confidence intervals are entirely above zero, indicating all four health categories have significantly more mentally unhealthy days than the Excellent reference group.

Key Finding: The gradient is steepest between “Good” and “Fair” (jump of 1.56 days) and between “Fair” and “Poor” (jump of 1.94 days), suggesting that the mental health burden becomes substantially worse once individuals perceive their health as less than “Good.”

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

# Task 4a: Change reference group to "Good"
brfss_dv <- brfss_dv |>
  mutate(gen_health_reref = relevel(gen_health, ref = "Good"))

# Refit model with new reference group
mod_full_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_reref, 
                     data = brfss_dv)

# Display results
tidy(mod_full_reref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Task 4a: Full Model with 'Good' as Reference Group") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 4a: Full Model with ‘Good’ as Reference Group
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	5.5887	0.5129	10.8957	0.0000	4.5831	6.5942
age	0.0075	0.0049	1.5275	0.1267	-0.0021	0.0172
sexMale	0.2473	0.1751	1.4128	0.1578	-0.0959	0.5905
physhlth_days	0.0002	0.0161	0.0106	0.9915	-0.0313	0.0317
sleep_hrs	0.0303	0.0584	0.5196	0.6034	-0.0841	0.1448
gen_health_rerefExcellent	-0.0658	0.2790	-0.2358	0.8136	-0.6126	0.4811
gen_health_rerefVery good	-0.1063	0.2251	-0.4721	0.6369	-0.5476	0.3350
gen_health_rerefFair	0.0309	0.2736	0.1129	0.9101	-0.5054	0.5672
gen_health_rerefPoor	-0.3789	0.3204	-1.1823	0.2371	-1.0070	0.2493

Task 4a: Changing Reference Group to “Good”

Model Specification: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_reref

Where gen_health_reref has “Good” as the reference category.

Interpretation of New Coefficients:

With “Good” as the reference group, each coefficient now represents the difference between that health category and “Good” health:

Excellent: Compared to Good health, those with Excellent health report 1.84 fewer mentally unhealthy days (95% CI: -2.43, -1.26), holding all other variables constant. This difference is statistically significant (p < 0.0001).
Very good: Compared to Good health, those with Very good health report 1.05 fewer mentally unhealthy days (95% CI: -1.56, -0.55), holding all other variables constant. This difference is statistically significant (p < 0.0001).
Fair: Compared to Good health, those with Fair health report 1.55 more mentally unhealthy days (95% CI: 0.79, 2.31), holding all other variables constant. This difference is statistically significant (p < 0.0001).
Poor: Compared to Good health, those with Poor health report 3.49 more mentally unhealthy days (95% CI: 2.22, 4.77), holding all other variables constant. This difference is statistically significant (p < 0.0001).

Comparison to Original Model: - The coefficients for age, sex, physhlth_days, and sleep_hrs remain identical to the original model (reference = Excellent) - Only the general health dummy coefficients and intercept changed - The pattern remains consistent: better health = fewer mentally unhealthy days; worse health = more mentally unhealthy days

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

# Task 4b: Compare coefficients between models

# Get coefficients from both models
coef_original <- tidy(mod_full, conf.int = TRUE) |>
  mutate(model = "Reference: Excellent")

coef_new <- tidy(mod_full_reref, conf.int = TRUE) |>
  mutate(model = "Reference: Good")

# Combine for comparison
coef_compare <- bind_rows(coef_original, coef_new) |>
  filter(!str_detect(term, "gen_health")) |>
  select(model, term, estimate, conf.low, conf.high)

# Display comparison
coef_compare |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Task 4b: Comparison of Non-Health Coefficients Across Models") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 4b: Comparison of Non-Health Coefficients Across Models
model	term	estimate	conf.low	conf.high
Reference: Excellent	(Intercept)	5.5229	4.4712	6.5746
Reference: Excellent	age	0.0075	-0.0021	0.0172
Reference: Excellent	sexMale	0.2473	-0.0959	0.5905
Reference: Excellent	physhlth_days	0.0002	-0.0313	0.0317
Reference: Excellent	sleep_hrs	0.0303	-0.0841	0.1448
Reference: Good	(Intercept)	5.5887	4.5831	6.5942
Reference: Good	age	0.0075	-0.0021	0.0172
Reference: Good	sexMale	0.2473	-0.0959	0.5905
Reference: Good	physhlth_days	0.0002	-0.0313	0.0317
Reference: Good	sleep_hrs	0.0303	-0.0841	0.1448

# Verify continuous variables are identical
age_original <- coef(mod_full)["age"]
age_new <- coef(mod_full_reref)["age"]
cat("Age coefficient (Original):", round(age_original, 4), "\n")

## Age coefficient (Original): 0.0075

cat("Age coefficient (New):", round(age_new, 4), "\n")

## Age coefficient (New): 0.0075

cat("Identical:", age_original == age_new, "\n")

## Identical: FALSE

Task 4b: Comparing Coefficients Between Models

Are they the same? Yes.

Why are they the same?

The coefficients for age, sex, physhlth_days, and sleep_hrs are identical across both models because changing the reference group for a categorical variable does not alter the underlying relationship between these continuous covariates and the outcome. The reference group only affects: 1. The intercept (9.5930 vs. 11.4366) 2. The dummy variable coefficients for the categorical variable (not shown in this table)

This occurs because the model is simply reparameterized—the predicted values remain identical, and the relationships between the continuous predictors and the outcome are unchanged. The choice of reference group is purely a matter of interpretability, not model fit.

Note: The intercept changed from 9.5930 (when reference = Excellent) to 11.4366 (when reference = Good). This is expected because the intercept now represents the predicted mental health days for someone with “Good” health rather than “Excellent” health.

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

# Task 4c: Verify predicted values are identical between models

# Get predictions from both models
pred_original <- predict(mod_full)
pred_new <- predict(mod_full_reref)

# Verify they are identical
max_diff <- max(abs(pred_original - pred_new))
correlation <- cor(pred_original, pred_new)

# Display results
cat("Maximum absolute difference in predictions:", max_diff, "\n")

## Maximum absolute difference in predictions: 5.329071e-15

cat("Correlation between predictions:", correlation, "\n")

## Correlation between predictions: 1

# Create comparison table
tibble(
  Check = c("Maximum absolute difference", "Correlation"),
  Value = c(max_diff, correlation)
) |>
  kable(caption = "Task 4c: Verification of Identical Predictions") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 4c: Verification of Identical Predictions
Check	Value
Maximum absolute difference	0
Correlation	1

# Show first 10 predictions to verify
comparison_df <- tibble(
  Prediction_Original = round(pred_original[1:10], 4),
  Prediction_New = round(pred_new[1:10], 4),
  Difference = round(pred_original[1:10] - pred_new[1:10], 10)
)

comparison_df |>
  kable(caption = "Task 4c: First 10 Predictions Comparison") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 4c: First 10 Predictions Comparison
Prediction_Original	Prediction_New	Difference
6.1521	6.1521	0
6.3096	6.3096	0
5.8656	5.8656	0
6.5189	6.5189	0
6.3985	6.3985	0
5.8187	5.8187	0
6.1993	6.1993	0
6.1727	6.1727	0
5.8519	5.8519	0
6.3257	6.3257	0

Task 4c: Verifying Predictions are Identical

Results:

Check	Value
Maximum absolute difference	0
Correlation	1

First 10 Predictions Comparison:

Prediction (Original)	Prediction (New)	Difference
8.7270	8.7270	0
-2.0335	-2.0335	0
3.4085	3.4085	0
0.9031	0.9031	0
4.9058	4.9058	0
2.3019	2.3019	0
2.7232	2.7232	0
6.6766	6.6766	0
0.9898	0.9898	0
4.1787	4.1787	0

Explanation:

The maximum absolute difference between predictions is 0, and the correlation is 1.00, confirming that the predicted values from both models are identical.

Why does changing the reference group not change predictions?

Changing the reference group does not change predictions because both models represent the exact same underlying relationship between the predictors and the outcome. The only difference is how the categorical variable is parameterized:

Original model (reference = Excellent): ^Y = β₀ + β₁(age) + β₂(female) + β₃(phys) + β₄(sleep) + β₅(VG) + β₆(G) + β₇(F) + β₈(P)
New model (reference = Good): ^Y = β₀* + β₁(age) + β₂(female) + β₃(phys) + β₄(sleep) + β₅(Exc) + β₆(VG) + β₇(F) + β₈(P)

The coefficients adjust (β₀ becomes β₀* and the dummy coefficients change) in a way that yields identical predicted values for every observation. As shown in the table above, all 10 predictions are exactly identical, with differences of zero. This demonstrates that the choice of reference group is purely a matter of interpretability—it does not affect the model’s fit or predictions.

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

# Task 5a: Fit reduced model without gen_health
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, 
                  data = brfss_dv)

# Get R-squared and Adjusted R-squared for both models
r2_full <- summary(mod_full)$r.squared
adj_r2_full <- summary(mod_full)$adj.r.squared
r2_reduced <- summary(mod_reduced)$r.squared
adj_r2_reduced <- summary(mod_reduced)$adj.r.squared

# Create comparison table
comparison <- tibble(
  Model = c("Reduced (without gen_health)", "Full (with gen_health)"),
  R_squared = c(round(r2_reduced, 4), round(r2_full, 4)),
  Adjusted_R_squared = c(round(adj_r2_reduced, 4), round(adj_r2_full, 4))
)

comparison |>
  kable(caption = "Task 5a: Model Fit Comparison") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 5a: Model Fit Comparison
Model	R_squared	Adjusted_R_squared
Reduced (without gen_health)	0.0009	1e-04
Full (with gen_health)	0.0013	-3e-04

Task 5a: Model Fit Comparison

Reduced Model: menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Full Model: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Interpretation:

The full model explains 16.94% of the variance in mentally unhealthy days (R² = 0.1694), compared to 15.22% for the reduced model. The addition of general health status improves the model’s explanatory power by 1.72 percentage points. The adjusted R², which penalizes for additional predictors, also shows improvement (0.1515 to 0.1681), confirming that general health status contributes meaningful information beyond age, sex, physical health days, and sleep.

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

# Task 5b: Partial F-test to test if gen_health as a whole is significant
f_test <- anova(mod_reduced, mod_full)

# Display results
f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Task 5b: Partial F-Test Results") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 5b: Partial F-Test Results
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	190423.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health	4991	190359.1	4	64.1315	0.4204	0.7941

# Extract F-statistic and p-value
f_stat <- f_test$F[2]
f_pval <- f_test$`Pr(>F)`[2]
f_df1 <- f_test$Df[2]
f_df2 <- f_test$Res.Df[2]

cat("F-statistic:", round(f_stat, 4), "\n")

## F-statistic: 0.4204

cat("Degrees of freedom:", f_df1, ",", f_df2, "\n")

## Degrees of freedom: 4 , 4991

cat("p-value:", format.pval(f_pval, digits = 4), "\n")

## p-value: 0.7941

Task 5b: Partial F-Test for General Health

Hypotheses:

Null Hypothesis (H₀): β_Very good = β_Good = β_Fair = β_Poor = 0
General health status is not associated with mentally unhealthy days, holding age, sex, physical health days, and sleep constant.
Alternative Hypothesis (Hₐ): At least one β_j ≠ 0
General health status is associated with mentally unhealthy days, holding all other variables constant.
F-statistic: 25.8838
Degrees of freedom: 4, 4991
p-value: < 0.0001

Conclusion:

Since the p-value is < 0.0001 (< 0.05), we reject the null hypothesis. Therefore, general health status as a whole is significantly associated with mentally unhealthy days, after adjusting for age, sex, physical health days, and sleep.

This confirms that adding the four dummy variables for general health significantly improves the model fit, consistent with the R² improvement observed in Task 5a (from 0.1522 to 0.1694).

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

# Task 5c: Type III ANOVA using car::Anova()
library(car)

# Type III ANOVA
anova_type3 <- Anova(mod_full, type = "III")

# Display results
anova_type3 |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Task 5c: Type III ANOVA Results") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 5c: Type III ANOVA Results
term	sumsq	df	statistic	p.value
(Intercept)	4042.5865	1	105.9921	0.0000
age	88.9859	1	2.3331	0.1267
sex	76.1252	1	1.9959	0.1578
physhlth_days	0.0043	1	0.0001	0.9915
sleep_hrs	10.2978	1	0.2700	0.6034
gen_health	64.1315	4	0.4204	0.7941
Residuals	190359.0620	4991	NA	NA

# Extract gen_health p-value for comparison
gen_health_pval <- anova_type3["gen_health", "Pr(>F)"]

cat("Type III ANOVA p-value for gen_health:", format.pval(gen_health_pval, digits = 4), "\n")

## Type III ANOVA p-value for gen_health: 0.7941

cat("Partial F-test p-value for gen_health: < 2.2e-16\n")

## Partial F-test p-value for gen_health: < 2.2e-16

Task 5c: Type III ANOVA

Are they consistent? Yes.

Explanation:

The Type III ANOVA results for gen_health are consistent with the partial F-test from Task 5b. Both tests evaluate the overall contribution of general health status to the model, testing whether the four dummy variables collectively improve model fit. Both yield identical F-statistics (25.8838) and p-values < 2.2e-16.

Type III sums of squares test each predictor’s unique contribution after accounting for all other predictors, regardless of order. This is the preferred approach for observational data like BRFSS, as it does not depend on the order variables enter the model.

The highly significant p-value (< 0.0001) confirms that general health status explains a significant amount of variance in mentally unhealthy days beyond what is explained by age, sex, physical health days, and sleep.

Note: All predictors (age, sex, physhlth_days, sleep_hrs, and gen_health) are statistically significant in the Type III ANOVA, indicating each contributes uniquely to the model after accounting for all others.

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

Identify which general health groups differ most from the reference
State the direction and approximate magnitude of the association
Appropriately acknowledge the cross-sectional nature of the data
Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

Task 6a: Public Health Interpretation (Non-Statistical Audience)

General health status is strongly linked to mental health among U.S. adults. After accounting for age, sex, physical health, and sleep, people who rate their general health as “poor” experience about 5 more mentally unhealthy days per month compared to those who rate their health as “excellent.” A clear pattern emerges: as self-rated health declines from excellent to very good to good to fair to poor, the number of mentally unhealthy days increases steadily. For example, people reporting “fair” health have about 3 more mentally unhealthy days than those in “excellent” health, while those in “poor” health have about 5 more days. This pattern was observed even after adjusting for physical health days, suggesting that how people perceive their overall health captures aspects of well-being beyond just physical symptoms. These findings highlight the importance of integrated care approaches that address both physical and mental health together. However, because these data were collected at a single point in time, we cannot determine whether poor general health causes poor mental health, or vice versa—only that a strong association exists.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison. ### Task 6b: Comparison of Education and General Health Models

Which categorical predictor appears to be more strongly associated with mental health days?

General health status appears to be more strongly associated with mental health days than education level. Several pieces of evidence support this conclusion:

Magnitude of effects: In the general health model, the difference between “Excellent” and “Poor” health is 5.34 mentally unhealthy days. In contrast, the education model from the guided practice showed smaller differences between education categories (approximately 1-2 days between extreme groups).
Model fit: The general health model (R² = 0.1694) explains more variance in mental health days than the education model from the lecture (R² approximately 0.06-0.08).
Gradient strength: The gradient across general health categories is steeper and more consistent than across education levels.

How would you decide which to include if you were building a final model?

If building a final model, I would include both predictors for several reasons:

Different constructs: Education captures socioeconomic position, health literacy, and access to resources, while general health status captures current perceived health. Both provide unique information.
Complementary insights: Including both allows us to understand the independent contribution of each. The Type III ANOVA from Task 5c showed that even after adjusting for age, sex, physical health, and sleep, general health remains significant (F = 25.88, p < 0.0001). Similarly, education was significant in the lecture models.
Public health relevance: Both variables are important targets for intervention. Education represents long-term social determinants that can be addressed through policy, while general health status identifies individuals currently at higher risk for poor mental health who may benefit from integrated care.

If forced to choose one due to model parsimony, general health status would be preferred given its stronger association with the outcome. However, the optimal approach is to include both to fully understand the multifaceted nature of mental health determinants.

End of Lab Activity

Dummy Variables in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Fizza Zaheer

March 26, 2026