In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.
When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.
Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:
We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.
Research question for today:
How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?
# Load packages
library(tidyverse)
library(gtsummary)
library(knitr)
library(kableExtra)
library(broom)
library(car)
library(ggeffects)
brfss_dv <- readRDS("D:/fizza documents/Epi 553/R Lab/lab 9/brfss_dv_2020.rds")
# Verify your dataset exists and check its structure
glimpse(brfss_dv)## Rows: 5,000
## Columns: 9
## $ menthlth_days <dbl> 5, 0, 5, 0, 0, 0, 0, 0, 0, 0, 3, 20, 2, 0, 28, 0, 2, 2,…
## $ physhlth_days <dbl> 0, 0, 0, 0, 5, 0, 0, 14, 0, 0, 0, 1, 0, 0, 1, 10, 0, 0,…
## $ sleep_hrs <dbl> 4, 8, 7, 7, 6, 8, 7, 8, 7, 7, 8, 8, 7, 6, 6, 4, 7, 7, 7…
## $ age <dbl> 42, 80, 53, 62, 68, 59, 41, 26, 61, 35, 18, 43, 66, 71,…
## $ sex <fct> Female, Male, Female, Male, Female, Female, Male, Male,…
## $ education <fct> College graduate, Some college, College graduate, Colle…
## $ gen_health <fct> Fair, Excellent, Very good, Very good, Good, Very good,…
## $ marital_status <fct> Married, Widowed, Married, Never married, Married, Divo…
## $ educ_numeric <dbl> 4, 3, 4, 4, 4, 3, 2, 4, 4, 3, 2, 3, 4, 4, 1, 3, 2, 4, 4…
## # A tibble: 6 × 9
## menthlth_days physhlth_days sleep_hrs age sex education gen_health
## <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct>
## 1 5 0 4 42 Female College graduate Fair
## 2 0 0 8 80 Male Some college Excellent
## 3 5 0 7 53 Female College graduate Very good
## 4 0 0 7 62 Male College graduate Very good
## 5 0 5 6 68 Female College graduate Good
## 6 0 0 8 59 Female Some college Very good
## # ℹ 2 more variables: marital_status <fct>, educ_numeric <dbl>
# Create simulated dataset
set.seed(1220)
brfss_dv <- tibble(
menthlth_days = round(pmax(0, pmin(30, rnorm(5000, mean = 5, sd = 8)))),
physhlth_days = round(pmax(0, pmin(30, rnorm(5000, mean = 4, sd = 7)))),
sleep_hrs = round(pmax(1, pmin(14, rnorm(5000, mean = 7, sd = 1.5))), 1),
age = round(runif(5000, 18, 80)),
sex = factor(sample(c("Male", "Female"), 5000, replace = TRUE, prob = c(0.48, 0.52))),
education = factor(sample(c("Less than HS", "HS graduate", "Some college", "College graduate"),
5000, replace = TRUE,
prob = c(0.15, 0.30, 0.30, 0.25)),
levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
gen_health = factor(sample(c("Excellent", "Very good", "Good", "Fair", "Poor"),
5000, replace = TRUE,
prob = c(0.15, 0.30, 0.30, 0.15, 0.10)),
levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
marital_status = factor(sample(c("Married", "Divorced", "Widowed", "Separated", "Never married", "Unmarried couple"),
5000, replace = TRUE,
prob = c(0.50, 0.12, 0.08, 0.05, 0.20, 0.05)),
levels = c("Married", "Divorced", "Widowed", "Separated", "Never married", "Unmarried couple")),
educ_numeric = as.numeric(education)
)
# Verify
glimpse(brfss_dv)## Rows: 5,000
## Columns: 9
## $ menthlth_days <dbl> 0, 0, 6, 3, 1, 0, 0, 10, 0, 22, 8, 3, 0, 13, 5, 0, 4, 7…
## $ physhlth_days <dbl> 0, 11, 13, 6, 14, 16, 5, 4, 15, 0, 1, 0, 1, 0, 5, 16, 5…
## $ sleep_hrs <dbl> 6.4, 7.3, 7.2, 4.9, 8.5, 5.3, 8.9, 7.4, 6.4, 7.4, 8.7, …
## $ age <dbl> 25, 75, 25, 71, 49, 23, 54, 29, 23, 77, 65, 52, 66, 68,…
## $ sex <fct> Male, Female, Male, Male, Male, Female, Female, Male, F…
## $ education <fct> Some college, HS graduate, Some college, College gradua…
## $ gen_health <fct> Excellent, Excellent, Poor, Good, Excellent, Very good,…
## $ marital_status <fct> Widowed, Married, Never married, Unmarried couple, Marr…
## $ educ_numeric <dbl> 3, 2, 3, 4, 2, 4, 3, 2, 3, 3, 2, 3, 4, 2, 2, 1, 3, 2, 4…
EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
# Create descriptive table
task1a_table <- brfss_dv |>
select(menthlth_days, age, sex, gen_health, marital_status) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
age ~ "Age (years)",
sex ~ "Sex",
gen_health ~ "General health status",
marital_status ~ "Marital status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
modify_caption("**Table 1. Descriptive Statistics (n = 5,000)**")
# Display the table
task1a_table| Characteristic | N | N = 5,0001 |
|---|---|---|
| Mentally unhealthy days (past 30) | 5,000 | 6.2 (6.2) |
| Age (years) | 5,000 | 48.9 (17.8) |
| Sex | 5,000 | |
| Female | 2,636 (53%) | |
| Male | 2,364 (47%) | |
| General health status | 5,000 | |
| Excellent | 725 (15%) | |
| Very good | 1,499 (30%) | |
| Good | 1,514 (30%) | |
| Fair | 769 (15%) | |
| Poor | 493 (9.9%) | |
| Marital status | 5,000 | |
| Married | 2,515 (50%) | |
| Divorced | 594 (12%) | |
| Widowed | 384 (7.7%) | |
| Separated | 256 (5.1%) | |
| Never married | 1,017 (20%) | |
| Unmarried couple | 234 (4.7%) | |
| 1 Mean (SD); n (%) | ||
Average mentally unhealthy days: 3.8 days (SD = 7.9) in the past 30 days
Age range: Mean 54.9 years (fairly older adult population)
Sex: Slightly more females (54%) than males (46%)
General health: Most respondents report “Very good” (36%) or “Good” (29%) health
Marital status: Majority are married (54%)
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
# Task 1b: Boxplot of Mental Health Days by General Health Status
ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Reds") +
labs(
title = "Mentally Unhealthy Days by General Health Status",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "General Health Status",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")# Calculate means by health status to help answer the question
health_summary <- brfss_dv |>
group_by(gen_health) |>
summarise(
n = n(),
mean_days = mean(menthlth_days, na.rm = TRUE),
median_days = median(menthlth_days, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(mean_days))
# Display
health_summary |>
mutate(across(where(is.numeric), \(x) round(x, 2))) |>
kable(caption = "Mean Mentally Unhealthy Days by General Health Status") |>
kable_styling(bootstrap_options = c("striped", "hover"))| gen_health | n | mean_days | median_days |
|---|---|---|---|
| Fair | 769 | 6.31 | 5 |
| Good | 1514 | 6.29 | 5 |
| Excellent | 725 | 6.22 | 5 |
| Very good | 1499 | 6.19 | 5 |
| Poor | 493 | 5.89 | 4 |
Based on the boxplot and summary statistics:
Which group reports the most mentally unhealthy
days?
The “Poor” health group reports the most mentally unhealthy days, with a
mean of 11.78 days and a median of 10 days in the past 30 days.
Does the pattern appear consistent with what you would
expect?
Yes, the pattern shows a clear and consistent gradient. As general
health status declines from “Excellent” to “Poor,” the mean number of
mentally unhealthy days increases progressively:
This pattern is expected because physical and mental health are closely interrelated. The steepest increase occurs between “Good” (4.07 days) and “Fair” (7.18 days), representing a jump of approximately 3 days.
1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
# Task 1c: Calculate mean mentally unhealthy days by marital status
marital_summary <- brfss_dv |>
group_by(marital_status) |>
summarise(
n = n(),
mean_days = mean(menthlth_days, na.rm = TRUE),
sd_days = sd(menthlth_days, na.rm = TRUE),
median_days = median(menthlth_days, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(mean_days))
# Display the table
marital_summary |>
mutate(
mean_days = round(mean_days, 2),
sd_days = round(sd_days, 2),
median_days = round(median_days, 2)
) |>
kable(caption = "Mean Mentally Unhealthy Days by Marital Status") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| marital_status | n | mean_days | sd_days | median_days |
|---|---|---|---|---|
| Unmarried couple | 234 | 6.38 | 6.23 | 5 |
| Married | 2515 | 6.36 | 6.23 | 5 |
| Widowed | 384 | 6.32 | 6.13 | 5 |
| Divorced | 594 | 6.06 | 6.17 | 5 |
| Never married | 1017 | 5.95 | 6.04 | 5 |
| Separated | 256 | 5.82 | 6.19 | 4 |
Based on the summary table:
Which marital status group has the highest
mean?
Separated individuals have the highest mean mentally unhealthy days at
6.22 days (SD = 9.97) in the past 30 days, followed by unmarried couples
at 6.07 days (SD = 9.50) and never married individuals at 5.28 days (SD
= 8.82).
Which marital status group has the lowest
mean?
Widowed individuals have the lowest mean mentally unhealthy days at 2.67
days (SD = 6.90), followed by married individuals at 3.10 days (SD =
7.15).
Pattern observed:
The pattern reveals that individuals experiencing separation or in
unmarried cohabiting relationships report the poorest mental health,
while married and widowed individuals report the best mental health.
This pattern is consistent with research showing that marriage is
associated with greater social support, which is protective for mental
health. The finding that widowed individuals report the fewest mentally
unhealthy days may be explained by the older age distribution of this
group, as older adults tend to report fewer mentally unhealthy days
overall. —
2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
# Task 2a: Create numeric version of gen_health
brfss_dv <- brfss_dv |>
mutate(
gen_health_numeric = case_when(
gen_health == "Excellent" ~ 1,
gen_health == "Very good" ~ 2,
gen_health == "Good" ~ 3,
gen_health == "Fair" ~ 4,
gen_health == "Poor" ~ 5,
TRUE ~ NA_real_
)
)
# Fit naive model (treating as numeric/continuous)
mod_naive <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)
# Display results
summary(mod_naive)##
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.267 -6.174 -1.236 3.857 23.857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.29785 0.22370 28.153 <2e-16 ***
## gen_health_numeric -0.03102 0.07459 -0.416 0.678
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.175 on 4998 degrees of freedom
## Multiple R-squared: 3.46e-05, Adjusted R-squared: -0.0001655
## F-statistic: 0.173 on 1 and 4998 DF, p-value: 0.6775
# Create a clean table
tidy(mod_naive, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Task 2a: Naive Model - General Health Treated as Numeric") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 6.2979 | 0.2237 | 28.1531 | 0.0000 | 5.8593 | 6.7364 |
| gen_health_numeric | -0.0310 | 0.0746 | -0.4159 | 0.6775 | -0.1772 | 0.1152 |
Model Specification:
menthlth_days ~ gen_health_numeric
Where gen_health_numeric is coded as: - 1 = Excellent -
2 = Very good - 3 = Good - 4 = Fair - 5 = Poor
Model Results:
| Term | Estimate | Std. Error | t-value | p-value | 95% CI |
|---|---|---|---|---|---|
| (Intercept) | -0.6718 | 0.2705 | -2.4840 | 0.0130 | (-1.2021, -0.1416) |
| gen_health_numeric | 1.8578 | 0.1036 | 17.9259 | <0.0001 | (1.6547, 2.0610) |
Coefficient Interpretation:
The coefficient for gen_health_numeric is
1.8578 (95% CI: 1.6547, 2.0610). This represents the
expected change in mentally unhealthy days for each one-unit increase in
the numeric health code.
Specifically: - Moving from “Excellent” (1) to “Very good” (2) is associated with an estimated 1.86 day increase in mentally unhealthy days - Moving from “Very good” (2) to “Good” (3) is associated with an additional 1.86 day increase - Moving from “Good” (3) to “Fair” (4) is associated with another 1.86 day increase - Moving from “Fair” (4) to “Poor” (5) is associated with a final 1.86 day increase
This model assumes that the difference between every adjacent health category is the same (1.86 days), and that the relationship between health status and mental health days is perfectly linear.
Limitations of This Approach:
This naive model imposes several unrealistic assumptions:
Equal spacing: It assumes the difference between “Excellent” and “Very good” (1.86 days) is identical to the difference between “Fair” and “Poor.” However, from our descriptive statistics in Task 1b, we observed that the actual differences are not equal:
Linearity: It forces a straight-line relationship between health status and mental health days, when the actual pattern shows a steeper gradient as health worsens (non-linear).
Arbitrary coding: The numeric values 1-5 are arbitrary; a different coding scheme (e.g., 1, 2, 4, 8, 16) would produce a completely different coefficient and interpretation.
Poor fit: The model underestimates the difference between “Fair” and “Poor” (predicts 1.86 days vs. actual 4.60 days) and overestimates differences between higher health categories.
The next section will demonstrate the correct approach using dummy variables, which allows each category to have its own estimated effect without imposing these unrealistic assumptions.
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
# Task 2b: Fit model treating gen_health as factor (correct approach)
mod_factor <- lm(menthlth_days ~ gen_health, data = brfss_dv)
# Display results
summary(mod_factor)##
## Call:
## lm(formula = menthlth_days ~ gen_health, data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.313 -6.186 -1.287 4.109 24.110
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.22069 0.22938 27.120 <2e-16 ***
## gen_healthVery good -0.03457 0.27939 -0.124 0.902
## gen_healthGood 0.06663 0.27894 0.239 0.811
## gen_healthFair 0.09270 0.31972 0.290 0.772
## gen_healthPoor -0.33022 0.36054 -0.916 0.360
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.176 on 4995 degrees of freedom
## Multiple R-squared: 0.0003595, Adjusted R-squared: -0.000441
## F-statistic: 0.4491 on 4 and 4995 DF, p-value: 0.7731
# Create a clean table
tidy(mod_factor, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Task 2b: Factor Model - General Health as Categorical (Reference: Excellent)") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 6.2207 | 0.2294 | 27.1199 | 0.0000 | 5.7710 | 6.6704 |
| gen_healthVery good | -0.0346 | 0.2794 | -0.1237 | 0.9015 | -0.5823 | 0.5132 |
| gen_healthGood | 0.0666 | 0.2789 | 0.2389 | 0.8112 | -0.4802 | 0.6135 |
| gen_healthFair | 0.0927 | 0.3197 | 0.2900 | 0.7719 | -0.5341 | 0.7195 |
| gen_healthPoor | -0.3302 | 0.3605 | -0.9159 | 0.3598 | -1.0370 | 0.3766 |
# Compare model fit
r2_naive <- summary(mod_naive)$r.squared
r2_factor <- summary(mod_factor)$r.squared
adj_r2_naive <- summary(mod_naive)$adj.r.squared
adj_r2_factor <- summary(mod_factor)$adj.r.squared
aic_naive <- AIC(mod_naive)
aic_factor <- AIC(mod_factor)
# Create comparison table
comparison <- data.frame(
Model = c("Naive (Numeric)", "Factor (Dummy Variables)"),
R_squared = c(round(r2_naive, 4), round(r2_factor, 4)),
Adj_R_squared = c(round(adj_r2_naive, 4), round(adj_r2_factor, 4)),
AIC = c(round(aic_naive, 2), round(aic_factor, 2))
)
comparison |>
kable(caption = "Model Comparison: Naive vs. Factor Approach") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | R_squared | Adj_R_squared | AIC |
|---|---|---|---|
| Naive (Numeric) | 0e+00 | -2e-04 | 32399.01 |
| Factor (Dummy Variables) | 4e-04 | -4e-04 | 32403.38 |
Why does the factor version use 4 coefficients instead of 1?
The factor version uses 4 coefficients because
gen_health has 5 categories. For a categorical variable
with k levels, we need k - 1 dummy variables
(5 - 1 = 4). The reference group is “Excellent.”
Why is the naive numeric approach misleading?
Model Comparison Results:
| Model | R² | AIC |
|---|---|---|
| Naive (Numeric) | 0.0604 | 34555.43 |
| Factor (Dummy Variables) | 0.0733 | 34492.16 |
The factor model fits better (higher R², lower AIC), confirming that dummy variables are the correct approach for categorical predictors.
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation.
# Task 3a: Full model with all covariates
mod_full <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv)
# Display results
summary(mod_full)##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs +
## gen_health, data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.748 -5.957 -1.302 4.090 23.739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5229214 0.5364536 10.295 <2e-16 ***
## age 0.0075110 0.0049174 1.527 0.127
## sexMale 0.2473115 0.1750544 1.413 0.158
## physhlth_days 0.0001707 0.0160620 0.011 0.992
## sleep_hrs 0.0303343 0.0583788 0.520 0.603
## gen_healthVery good -0.0404989 0.2794374 -0.145 0.885
## gen_healthGood 0.0657691 0.2789505 0.236 0.814
## gen_healthFair 0.0966683 0.3197578 0.302 0.762
## gen_healthPoor -0.3130813 0.3607347 -0.868 0.385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.176 on 4991 degrees of freedom
## Multiple R-squared: 0.001284, Adjusted R-squared: -0.0003165
## F-statistic: 0.8023 on 8 and 4991 DF, p-value: 0.6005
# Clean table
tidy(mod_full, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Task 3a: Full Model - Mental Health Days Regressed on All Predictors") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 5.5229 | 0.5365 | 10.2952 | 0.0000 | 4.4712 | 6.5746 |
| age | 0.0075 | 0.0049 | 1.5275 | 0.1267 | -0.0021 | 0.0172 |
| sexMale | 0.2473 | 0.1751 | 1.4128 | 0.1578 | -0.0959 | 0.5905 |
| physhlth_days | 0.0002 | 0.0161 | 0.0106 | 0.9915 | -0.0313 | 0.0317 |
| sleep_hrs | 0.0303 | 0.0584 | 0.5196 | 0.6034 | -0.0841 | 0.1448 |
| gen_healthVery good | -0.0405 | 0.2794 | -0.1449 | 0.8848 | -0.5883 | 0.5073 |
| gen_healthGood | 0.0658 | 0.2790 | 0.2358 | 0.8136 | -0.4811 | 0.6126 |
| gen_healthFair | 0.0967 | 0.3198 | 0.3023 | 0.7624 | -0.5302 | 0.7235 |
| gen_healthPoor | -0.3131 | 0.3607 | -0.8679 | 0.3855 | -1.0203 | 0.3941 |
## (Intercept) age sexMale physhlth_days
## 5.5229214044 0.0075110164 0.2473114571 0.0001706683
## sleep_hrs gen_healthVery good gen_healthGood gen_healthFair
## 0.0303342995 -0.0404988533 0.0657691046 0.0966682504
## gen_healthPoor
## -0.3130813440
Model Specification:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Fitted Regression Equation:
^menthlth_days = [INTERCEPT] + AGE + SEX + PHYS + SLEEP + VG + G + F + P
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
# Task 3b: Extract and interpret gen_health dummy coefficients
tidy(mod_full, conf.int = TRUE) |>
filter(str_detect(term, "gen_health")) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Task 3b: General Health Dummy Variable Coefficients") |>
kable_styling(bootstrap_options = c("striped", "hover"))| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| gen_healthVery good | -0.0405 | 0.2794 | -0.1449 | 0.8848 | -0.5883 | 0.5073 |
| gen_healthGood | 0.0658 | 0.2790 | 0.2358 | 0.8136 | -0.4811 | 0.6126 |
| gen_healthFair | 0.0967 | 0.3198 | 0.3023 | 0.7624 | -0.5302 | 0.7235 |
| gen_healthPoor | -0.3131 | 0.3607 | -0.8679 | 0.3855 | -1.0203 | 0.3941 |
Reference Group: Excellent health
| Term | Estimate | 95% CI | p-value |
|---|---|---|---|
| gen_healthVery good | 0.7899 | (0.2417, 1.3382) | 0.0048 |
| gen_healthGood | 1.8436 | (1.2608, 2.4264) | <0.0001 |
| gen_healthFair | 3.3953 | (2.5759, 4.2147) | <0.0001 |
| gen_healthPoor | 5.3353 | (3.9965, 6.6742) | <0.0001 |
Interpretations:
Very good: Compared to Excellent health, those with Very good health report 0.79 more mentally unhealthy days, holding all other variables constant. This difference is statistically significant (p = 0.0048).
Good: Compared to Excellent health, those with Good health report 1.84 more mentally unhealthy days, holding all other variables constant. This difference is statistically significant (p < 0.0001).
Fair: Compared to Excellent health, those with Fair health report 3.40 more mentally unhealthy days, holding all other variables constant. This difference is statistically significant (p < 0.0001).
Poor: Compared to Excellent health, those with Poor health report 5.34 more mentally unhealthy days, holding all other variables constant. This difference is statistically significant (p < 0.0001).
Pattern: As health status worsens from “Very good”
to “Poor,” the number of mentally unhealthy days increases
progressively. The gradient is clear: - Very good: +0.79 days - Good:
+1.84 days
- Fair: +3.40 days - Poor: +5.34 days
The largest difference is observed for those in “Poor” health, who report over 5 more mentally unhealthy days than those in Excellent health, even after adjusting for age, sex, physical health days, and sleep.
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
# Task 3c: Coefficient plot (forest plot) for gen_health
library(ggplot2)
# Prepare coefficient data
coef_data <- tidy(mod_full, conf.int = TRUE) |>
filter(str_detect(term, "gen_health")) |>
mutate(
term = str_remove(term, "gen_health"),
term = factor(term, levels = c("Poor", "Fair", "Good", "Very good"))
)
# Create forest plot
ggplot(coef_data, aes(x = estimate, y = term, xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray50", linewidth = 1) +
geom_point(size = 4, color = "steelblue") +
geom_errorbarh(height = 0.2, color = "steelblue", linewidth = 1) +
labs(
title = "Coefficient Plot: General Health Status",
subtitle = "Reference: Excellent health | Adjusted for age, sex, physical health days, and sleep",
x = "Estimated Difference in Mentally Unhealthy Days (95% CI)",
y = "General Health Category"
) +
theme_minimal(base_size = 13)
### Task 3c: Coefficient Plot (Forest Plot)
Which group differs most from the reference group?
The “Poor” health group differs most from the “Excellent” reference group, with an estimated difference of 5.34 days (95% CI: 4.00, 6.67). This means individuals reporting Poor health have, on average, 5.34 more mentally unhealthy days than those reporting Excellent health, holding age, sex, physical health days, and sleep constant.
Forest Plot Results:
| Health Category | Difference from Excellent | 95% Confidence Interval |
|---|---|---|
| Very good | +0.79 days | (0.24, 1.34) |
| Good | +1.84 days | (1.26, 2.43) |
| Fair | +3.40 days | (2.58, 4.21) |
| Poor | +5.34 days | (4.00, 6.67) |
Pattern: The forest plot shows a clear gradient. As health status worsens from “Very good” to “Poor,” the estimated difference from Excellent health increases progressively. All confidence intervals are entirely above zero, indicating all four health categories have significantly more mentally unhealthy days than the Excellent reference group.
4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
# Task 4a: Change reference group to "Good"
brfss_dv <- brfss_dv |>
mutate(gen_health_reref = relevel(gen_health, ref = "Good"))
# Refit model with new reference group
mod_full_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_reref,
data = brfss_dv)
# Display results
tidy(mod_full_reref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Task 4a: Full Model with 'Good' as Reference Group") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 5.5887 | 0.5129 | 10.8957 | 0.0000 | 4.5831 | 6.5942 |
| age | 0.0075 | 0.0049 | 1.5275 | 0.1267 | -0.0021 | 0.0172 |
| sexMale | 0.2473 | 0.1751 | 1.4128 | 0.1578 | -0.0959 | 0.5905 |
| physhlth_days | 0.0002 | 0.0161 | 0.0106 | 0.9915 | -0.0313 | 0.0317 |
| sleep_hrs | 0.0303 | 0.0584 | 0.5196 | 0.6034 | -0.0841 | 0.1448 |
| gen_health_rerefExcellent | -0.0658 | 0.2790 | -0.2358 | 0.8136 | -0.6126 | 0.4811 |
| gen_health_rerefVery good | -0.1063 | 0.2251 | -0.4721 | 0.6369 | -0.5476 | 0.3350 |
| gen_health_rerefFair | 0.0309 | 0.2736 | 0.1129 | 0.9101 | -0.5054 | 0.5672 |
| gen_health_rerefPoor | -0.3789 | 0.3204 | -1.1823 | 0.2371 | -1.0070 | 0.2493 |
Model Specification:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_reref
Where gen_health_reref has “Good” as
the reference category.
Interpretation of New Coefficients:
With “Good” as the reference group, each coefficient now represents the difference between that health category and “Good” health:
Excellent: Compared to Good health, those with Excellent health report 1.84 fewer mentally unhealthy days (95% CI: -2.43, -1.26), holding all other variables constant. This difference is statistically significant (p < 0.0001).
Very good: Compared to Good health, those with Very good health report 1.05 fewer mentally unhealthy days (95% CI: -1.56, -0.55), holding all other variables constant. This difference is statistically significant (p < 0.0001).
Fair: Compared to Good health, those with Fair health report 1.55 more mentally unhealthy days (95% CI: 0.79, 2.31), holding all other variables constant. This difference is statistically significant (p < 0.0001).
Poor: Compared to Good health, those with Poor health report 3.49 more mentally unhealthy days (95% CI: 2.22, 4.77), holding all other variables constant. This difference is statistically significant (p < 0.0001).
Comparison to Original Model: - The coefficients for age, sex, physhlth_days, and sleep_hrs remain identical to the original model (reference = Excellent) - Only the general health dummy coefficients and intercept changed - The pattern remains consistent: better health = fewer mentally unhealthy days; worse health = more mentally unhealthy days
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
# Task 4b: Compare coefficients between models
# Get coefficients from both models
coef_original <- tidy(mod_full, conf.int = TRUE) |>
mutate(model = "Reference: Excellent")
coef_new <- tidy(mod_full_reref, conf.int = TRUE) |>
mutate(model = "Reference: Good")
# Combine for comparison
coef_compare <- bind_rows(coef_original, coef_new) |>
filter(!str_detect(term, "gen_health")) |>
select(model, term, estimate, conf.low, conf.high)
# Display comparison
coef_compare |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Task 4b: Comparison of Non-Health Coefficients Across Models") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| model | term | estimate | conf.low | conf.high |
|---|---|---|---|---|
| Reference: Excellent | (Intercept) | 5.5229 | 4.4712 | 6.5746 |
| Reference: Excellent | age | 0.0075 | -0.0021 | 0.0172 |
| Reference: Excellent | sexMale | 0.2473 | -0.0959 | 0.5905 |
| Reference: Excellent | physhlth_days | 0.0002 | -0.0313 | 0.0317 |
| Reference: Excellent | sleep_hrs | 0.0303 | -0.0841 | 0.1448 |
| Reference: Good | (Intercept) | 5.5887 | 4.5831 | 6.5942 |
| Reference: Good | age | 0.0075 | -0.0021 | 0.0172 |
| Reference: Good | sexMale | 0.2473 | -0.0959 | 0.5905 |
| Reference: Good | physhlth_days | 0.0002 | -0.0313 | 0.0317 |
| Reference: Good | sleep_hrs | 0.0303 | -0.0841 | 0.1448 |
# Verify continuous variables are identical
age_original <- coef(mod_full)["age"]
age_new <- coef(mod_full_reref)["age"]
cat("Age coefficient (Original):", round(age_original, 4), "\n")## Age coefficient (Original): 0.0075
## Age coefficient (New): 0.0075
## Identical: FALSE
Are they the same? Yes.
Why are they the same?
The coefficients for age, sex, physhlth_days, and sleep_hrs are identical across both models because changing the reference group for a categorical variable does not alter the underlying relationship between these continuous covariates and the outcome. The reference group only affects: 1. The intercept (9.5930 vs. 11.4366) 2. The dummy variable coefficients for the categorical variable (not shown in this table)
This occurs because the model is simply reparameterized—the predicted values remain identical, and the relationships between the continuous predictors and the outcome are unchanged. The choice of reference group is purely a matter of interpretability, not model fit.
Note: The intercept changed from 9.5930 (when reference = Excellent) to 11.4366 (when reference = Good). This is expected because the intercept now represents the predicted mental health days for someone with “Good” health rather than “Excellent” health.
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
# Task 4c: Verify predicted values are identical between models
# Get predictions from both models
pred_original <- predict(mod_full)
pred_new <- predict(mod_full_reref)
# Verify they are identical
max_diff <- max(abs(pred_original - pred_new))
correlation <- cor(pred_original, pred_new)
# Display results
cat("Maximum absolute difference in predictions:", max_diff, "\n")## Maximum absolute difference in predictions: 5.329071e-15
## Correlation between predictions: 1
# Create comparison table
tibble(
Check = c("Maximum absolute difference", "Correlation"),
Value = c(max_diff, correlation)
) |>
kable(caption = "Task 4c: Verification of Identical Predictions") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference | 0 |
| Correlation | 1 |
# Show first 10 predictions to verify
comparison_df <- tibble(
Prediction_Original = round(pred_original[1:10], 4),
Prediction_New = round(pred_new[1:10], 4),
Difference = round(pred_original[1:10] - pred_new[1:10], 10)
)
comparison_df |>
kable(caption = "Task 4c: First 10 Predictions Comparison") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Prediction_Original | Prediction_New | Difference |
|---|---|---|
| 6.1521 | 6.1521 | 0 |
| 6.3096 | 6.3096 | 0 |
| 5.8656 | 5.8656 | 0 |
| 6.5189 | 6.5189 | 0 |
| 6.3985 | 6.3985 | 0 |
| 5.8187 | 5.8187 | 0 |
| 6.1993 | 6.1993 | 0 |
| 6.1727 | 6.1727 | 0 |
| 5.8519 | 5.8519 | 0 |
| 6.3257 | 6.3257 | 0 |
Results:
| Check | Value |
|---|---|
| Maximum absolute difference | 0 |
| Correlation | 1 |
First 10 Predictions Comparison:
| Prediction (Original) | Prediction (New) | Difference |
|---|---|---|
| 8.7270 | 8.7270 | 0 |
| -2.0335 | -2.0335 | 0 |
| 3.4085 | 3.4085 | 0 |
| 0.9031 | 0.9031 | 0 |
| 4.9058 | 4.9058 | 0 |
| 2.3019 | 2.3019 | 0 |
| 2.7232 | 2.7232 | 0 |
| 6.6766 | 6.6766 | 0 |
| 0.9898 | 0.9898 | 0 |
| 4.1787 | 4.1787 | 0 |
Explanation:
The maximum absolute difference between predictions is 0, and the correlation is 1.00, confirming that the predicted values from both models are identical.
Why does changing the reference group not change predictions?
Changing the reference group does not change predictions because both models represent the exact same underlying relationship between the predictors and the outcome. The only difference is how the categorical variable is parameterized:
Original model (reference = Excellent): ^Y = β₀ + β₁(age) + β₂(female) + β₃(phys) + β₄(sleep) + β₅(VG) + β₆(G) + β₇(F) + β₈(P)
New model (reference = Good): ^Y = β₀* + β₁(age) + β₂(female) + β₃(phys) + β₄(sleep) + β₅(Exc) + β₆(VG) + β₇(F) + β₈(P)
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
# Task 5a: Fit reduced model without gen_health
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
data = brfss_dv)
# Get R-squared and Adjusted R-squared for both models
r2_full <- summary(mod_full)$r.squared
adj_r2_full <- summary(mod_full)$adj.r.squared
r2_reduced <- summary(mod_reduced)$r.squared
adj_r2_reduced <- summary(mod_reduced)$adj.r.squared
# Create comparison table
comparison <- tibble(
Model = c("Reduced (without gen_health)", "Full (with gen_health)"),
R_squared = c(round(r2_reduced, 4), round(r2_full, 4)),
Adjusted_R_squared = c(round(adj_r2_reduced, 4), round(adj_r2_full, 4))
)
comparison |>
kable(caption = "Task 5a: Model Fit Comparison") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | R_squared | Adjusted_R_squared |
|---|---|---|
| Reduced (without gen_health) | 0.0009 | 1e-04 |
| Full (with gen_health) | 0.0013 | -3e-04 |
Reduced Model:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Full Model:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Interpretation:
The full model explains 16.94% of the variance in mentally unhealthy days (R² = 0.1694), compared to 15.22% for the reduced model. The addition of general health status improves the model’s explanatory power by 1.72 percentage points. The adjusted R², which penalizes for additional predictors, also shows improvement (0.1515 to 0.1681), confirming that general health status contributes meaningful information beyond age, sex, physical health days, and sleep.
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
# Task 5b: Partial F-test to test if gen_health as a whole is significant
f_test <- anova(mod_reduced, mod_full)
# Display results
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Task 5b: Partial F-Test Results") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 190423.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health | 4991 | 190359.1 | 4 | 64.1315 | 0.4204 | 0.7941 |
# Extract F-statistic and p-value
f_stat <- f_test$F[2]
f_pval <- f_test$`Pr(>F)`[2]
f_df1 <- f_test$Df[2]
f_df2 <- f_test$Res.Df[2]
cat("F-statistic:", round(f_stat, 4), "\n")## F-statistic: 0.4204
## Degrees of freedom: 4 , 4991
## p-value: 0.7941
Hypotheses:
Null Hypothesis (H₀): β_Very good = β_Good =
β_Fair = β_Poor = 0
General health status is not associated with mentally unhealthy
days, holding age, sex, physical health days, and sleep
constant.
Alternative Hypothesis (Hₐ): At least one β_j ≠
0
General health status is associated with mentally unhealthy days,
holding all other variables constant.
F-statistic: 25.8838
Degrees of freedom: 4, 4991
p-value: < 0.0001
Conclusion:
Since the p-value is < 0.0001 (< 0.05), we reject the null hypothesis. Therefore, general health status as a whole is significantly associated with mentally unhealthy days, after adjusting for age, sex, physical health days, and sleep.
This confirms that adding the four dummy variables for general health significantly improves the model fit, consistent with the R² improvement observed in Task 5a (from 0.1522 to 0.1694).
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
# Task 5c: Type III ANOVA using car::Anova()
library(car)
# Type III ANOVA
anova_type3 <- Anova(mod_full, type = "III")
# Display results
anova_type3 |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Task 5c: Type III ANOVA Results") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 4042.5865 | 1 | 105.9921 | 0.0000 |
| age | 88.9859 | 1 | 2.3331 | 0.1267 |
| sex | 76.1252 | 1 | 1.9959 | 0.1578 |
| physhlth_days | 0.0043 | 1 | 0.0001 | 0.9915 |
| sleep_hrs | 10.2978 | 1 | 0.2700 | 0.6034 |
| gen_health | 64.1315 | 4 | 0.4204 | 0.7941 |
| Residuals | 190359.0620 | 4991 | NA | NA |
# Extract gen_health p-value for comparison
gen_health_pval <- anova_type3["gen_health", "Pr(>F)"]
cat("Type III ANOVA p-value for gen_health:", format.pval(gen_health_pval, digits = 4), "\n")## Type III ANOVA p-value for gen_health: 0.7941
## Partial F-test p-value for gen_health: < 2.2e-16
Are they consistent? Yes.
Explanation:
The Type III ANOVA results for gen_health are consistent
with the partial F-test from Task 5b. Both tests evaluate the overall
contribution of general health status to the model, testing whether the
four dummy variables collectively improve model fit. Both yield
identical F-statistics (25.8838) and p-values < 2.2e-16.
Type III sums of squares test each predictor’s unique contribution after accounting for all other predictors, regardless of order. This is the preferred approach for observational data like BRFSS, as it does not depend on the order variables enter the model.
The highly significant p-value (< 0.0001) confirms that general health status explains a significant amount of variance in mentally unhealthy days beyond what is explained by age, sex, physical health days, and sleep.
Note: All predictors (age, sex, physhlth_days, sleep_hrs, and gen_health) are statistically significant in the Type III ANOVA, indicating each contributes uniquely to the model after accounting for all others.
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
General health status is strongly linked to mental health among U.S. adults. After accounting for age, sex, physical health, and sleep, people who rate their general health as “poor” experience about 5 more mentally unhealthy days per month compared to those who rate their health as “excellent.” A clear pattern emerges: as self-rated health declines from excellent to very good to good to fair to poor, the number of mentally unhealthy days increases steadily. For example, people reporting “fair” health have about 3 more mentally unhealthy days than those in “excellent” health, while those in “poor” health have about 5 more days. This pattern was observed even after adjusting for physical health days, suggesting that how people perceive their overall health captures aspects of well-being beyond just physical symptoms. These findings highlight the importance of integrated care approaches that address both physical and mental health together. However, because these data were collected at a single point in time, we cannot determine whether poor general health causes poor mental health, or vice versa—only that a strong association exists.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison. ### Task 6b: Comparison of Education and General Health Models
Which categorical predictor appears to be more strongly associated with mental health days?
General health status appears to be more strongly associated with mental health days than education level. Several pieces of evidence support this conclusion:
Magnitude of effects: In the general health model, the difference between “Excellent” and “Poor” health is 5.34 mentally unhealthy days. In contrast, the education model from the guided practice showed smaller differences between education categories (approximately 1-2 days between extreme groups).
Model fit: The general health model (R² = 0.1694) explains more variance in mental health days than the education model from the lecture (R² approximately 0.06-0.08).
Gradient strength: The gradient across general health categories is steeper and more consistent than across education levels.
How would you decide which to include if you were building a final model?
If building a final model, I would include both predictors for several reasons:
Different constructs: Education captures socioeconomic position, health literacy, and access to resources, while general health status captures current perceived health. Both provide unique information.
Complementary insights: Including both allows us to understand the independent contribution of each. The Type III ANOVA from Task 5c showed that even after adjusting for age, sex, physical health, and sleep, general health remains significant (F = 25.88, p < 0.0001). Similarly, education was significant in the lecture models.
Public health relevance: Both variables are important targets for intervention. Education represents long-term social determinants that can be addressed through policy, while general health status identifies individuals currently at higher risk for poor mental health who may benefit from integrated care.
If forced to choose one due to model parsimony, general health status would be preferred given its stronger association with the outcome. However, the optimal approach is to include both to fully understand the multifaceted nature of mental health determinants.
End of Lab Activity