In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.
When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.
Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.
Research question for today:
How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?
brfss_dv <- brfss_full |>
mutate(
# Outcome: mentally unhealthy days in past 30
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age
age = age80,
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Education (6-level raw BRFSS variable EDUCA)
# 1 = Never attended school or only kindergarten
# 2 = Grades 1 through 8 (Elementary)
# 3 = Grades 9 through 11 (Some high school)
# 4 = Grade 12 or GED (High school graduate)
# 5 = College 1 year to 3 years (Some college or technical school)
# 6 = College 4 years or more (College graduate)
# 9 = Refused
education = factor(case_when(
educa %in% c(1, 2, 3) ~ "Less than HS",
educa == 4 ~ "HS graduate",
educa == 5 ~ "Some college",
educa == 6 ~ "College graduate",
TRUE ~ NA_character_
), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
# General health status (5-level)
gen_health = factor(case_when(
genhlth == 1 ~ "Excellent",
genhlth == 2 ~ "Very good",
genhlth == 3 ~ "Good",
genhlth == 4 ~ "Fair",
genhlth == 5 ~ "Poor",
TRUE ~ NA_character_
), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
# Marital status
marital_status = factor(case_when(
marital == 1 ~ "Married",
marital == 2 ~ "Divorced",
marital == 3 ~ "Widowed",
marital == 4 ~ "Separated",
marital == 5 ~ "Never married",
marital == 6 ~ "Unmarried couple",
TRUE ~ NA_character_
), levels = c("Married", "Divorced", "Widowed", "Separated",
"Never married", "Unmarried couple")),
# Store the raw education numeric code for the "naive approach" demonstration
educ_numeric = case_when(
educa %in% c(1, 2, 3) ~ 1,
educa == 4 ~ 2,
educa == 5 ~ 3,
educa == 6 ~ 4,
TRUE ~ NA_real_
)
) |>
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(age), age >= 18,
!is.na(sex),
!is.na(education),
!is.na(gen_health),
!is.na(marital_status)
)
# Reproducible random sample
set.seed(1220)
brfss_dv <- brfss_dv |>
select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
education, gen_health, marital_status, educ_numeric) |>
slice_sample(n = 5000)
# Save for lab activity
saveRDS(brfss_dv,
"~/Downloads/brfss_dv_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_dv), ncol(brfss_dv))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
brfss_dv |>
select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
education, gen_health) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
sex ~ "Sex",
education ~ "Education level",
gen_health ~ "General health status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Physically unhealthy days (past 30) | 5,000 | 3.3 (7.9) |
Sleep (hours/night) | 5,000 | 7.0 (1.4) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
Education level | 5,000 | |
Less than HS | 290 (5.8%) | |
HS graduate | 1,348 (27%) | |
Some college | 1,340 (27%) | |
College graduate | 2,022 (40%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
1Mean (SD); n (%) | ||
ggplot(brfss_dv, aes(x = education, fill = education)) +
geom_bar(alpha = 0.85) +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Distribution of Education Level",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Education Level",
y = "Count"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Distribution of Education Level in Analytic Sample
ggplot(brfss_dv, aes(x = education, y = menthlth_days, fill = education)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by Education Level",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "Education Level",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Mental Health Days by Education Level
Categorical predictor variables come in two forms:
| Type | Definition | Examples |
|---|---|---|
| Nominal | Categories with no natural ordering | Sex, race/ethnicity, marital status, blood type |
| Ordinal | Categories with a meaningful order | Education level, income bracket, disease stage, Likert scale |
A further distinction is:
Note that categorical variables can also be created by grouping continuous variables (e.g., age groups from continuous age), though this generally results in a loss of information.
Suppose education has been coded as: 1 = Less than HS, 2 = HS graduate, 3 = Some college, 4 = College graduate.
If we include this numeric code directly in a regression model, we are assuming:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 \cdot \text{educ_numeric} + \varepsilon\]
This forces the model to assume that the difference in expected \(Y\) between “Less than HS” and “HS graduate” is the same as the difference between “HS graduate” and “Some college,” and the same again between “Some college” and “College graduate.” In other words, we are assuming equally spaced, linear increments.
# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ age + educ_numeric, data = brfss_dv)
tidy(naive_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Naive Model: Education Treated as Continuous",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5601 | 0.5039 | 18.9727 | 0 | 8.5723 | 10.5479 |
| age | -0.0661 | 0.0063 | -10.5135 | 0 | -0.0784 | -0.0538 |
| educ_numeric | -0.7168 | 0.1158 | -6.1917 | 0 | -0.9437 | -0.4898 |
This model estimates a single coefficient for education, meaning each step up the education ladder is associated with the same change in mentally unhealthy days. This constraint is problematic for two reasons:
Let’s visualize why this matters:
# Compute observed group means
group_means <- brfss_dv |>
summarise(mean_days = mean(menthlth_days), .by = c(education, educ_numeric))
# Generate predictions from the naive model
pred_naive <- tibble(
educ_numeric = 1:4,
predicted = predict(naive_mod, newdata = tibble(age = mean(brfss_dv$age), educ_numeric = 1:4))
)
ggplot() +
geom_point(data = group_means,
aes(x = educ_numeric, y = mean_days),
size = 4, color = "steelblue") +
geom_line(data = pred_naive,
aes(x = educ_numeric, y = predicted),
color = "tomato", linewidth = 1.2, linetype = "dashed") +
geom_point(data = pred_naive,
aes(x = educ_numeric, y = predicted),
size = 3, color = "tomato", shape = 17) +
scale_x_continuous(
breaks = 1:4,
labels = c("Less than HS", "HS graduate", "Some college", "College graduate")
) +
labs(
title = "Observed Group Means (blue) vs. Naive Linear Fit (red)",
subtitle = "The naive model forces equal spacing between education levels",
x = "Education Level",
y = "Mean Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13)Naive Linear Fit vs. Actual Group Means by Education
Key takeaway: The blue dots (observed means) do not fall along a straight line. The naive linear model (red) misrepresents the actual pattern. We need a more flexible approach.
A dummy variable (also called an indicator variable) is a variable that takes on only two possible values:
If a categorical predictor has \(k\) categories, we need exactly \(k - 1\) dummy variables when the model includes an intercept. The omitted category becomes the reference group (also called the control group or baseline group).
Why \(k - 1\) and not \(k\)? Because the intercept already captures the mean for the reference group. Including all \(k\) dummies plus the intercept would create perfect multicollinearity (the dummy variables would sum to equal the intercept column), and the model could not be estimated.
The simplest example is a variable with two categories, such as sex.
With \(k = 2\), we need \(2 - 1 = 1\) dummy variable. If we choose Female as the reference group:
\[\text{male} = \begin{cases} 1 & \text{if male} \\ 0 & \text{if female} \end{cases}\]
The regression model becomes:
\[Y = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{male} + \varepsilon\]
For males (\(\text{male} = 1\)): \[E(Y | \text{age}, \text{male}) = (\beta_0 + \beta_2) + \beta_1 \cdot \text{age}\]
For females (\(\text{male} = 0\)): \[E(Y | \text{age}, \text{female}) = \beta_0 + \beta_1 \cdot \text{age}\]
Both groups share the same slope for age but have different intercepts. The coefficient \(\beta_2\) is the expected difference in \(Y\) between males and females, holding age constant.
# Fit model with sex as a dummy variable
mod_sex <- lm(menthlth_days ~ age + sex, data = brfss_dv)
tidy(mod_sex, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Dichotomous Dummy Variable: Sex",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 6.6262 | 0.3730 | 17.7666 | 0 | 5.8951 | 7.3574 |
| age | -0.0698 | 0.0063 | -11.1011 | 0 | -0.0821 | -0.0575 |
| sexFemale | 1.8031 | 0.2210 | 8.1585 | 0 | 1.3698 | 2.2364 |
Interpretation:
Note that R automatically creates dummy variables when a factor is included in
lm(). It uses alphabetical or level order to set the reference group, which is why Male (the first level) is the reference here.
pred_sex <- ggpredict(mod_sex, terms = c("age [20:80]", "sex"))
ggplot(pred_sex, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15, color = NA) +
labs(
title = "Predicted Mental Health Days by Age and Sex",
subtitle = "Parallel lines: same slope, different intercepts",
x = "Age (years)",
y = "Predicted Mentally Unhealthy Days",
color = "Sex",
fill = "Sex"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")Parallel Regression Lines: Males vs. Females
Geometrically: Dummy variables produce parallel regression lines. The intercept shifts by \(\beta_2\) for the non-reference group, but the slope remains the same.
Education has \(k = 4\) categories, so we need \(4 - 1 = 3\) dummy variables. If we choose “Less than HS” as the reference group:
\[\text{HS_graduate} = \begin{cases} 1 & \text{if HS graduate} \\ 0 & \text{otherwise} \end{cases}\]
\[\text{Some_college} = \begin{cases} 1 & \text{if Some college} \\ 0 & \text{otherwise} \end{cases}\]
\[\text{College_graduate} = \begin{cases} 1 & \text{if College graduate} \\ 0 & \text{otherwise} \end{cases}\]
The data matrix looks like this:
| Observation | Education | HS_graduate | Some_college | College_graduate |
|---|---|---|---|---|
| 1 | Less than HS | 0 | 0 | 0 |
| 2 | HS graduate | 1 | 0 | 0 |
| 3 | Some college | 0 | 1 | 0 |
| 4 | College graduate | 0 | 0 | 1 |
| 5 | Less than HS | 0 | 0 | 0 |
Notice that the reference group is identified by having all dummy variables equal to zero.
The reference group is the category against which all others are compared. Key points:
When we include a factor variable in lm(), R
automatically creates the dummy variables. The first level of the factor
is used as the reference group by default.
# Fit model with education as a factor (R creates dummies automatically)
mod_educ <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education,
data = brfss_dv)
tidy(mod_educ, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Education Dummy Variables (Reference: Less than HS)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.1377 | 0.7390 | 15.0709 | 0.0000 | 9.6889 | 12.5865 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| educationHS graduate | -0.5873 | 0.4719 | -1.2445 | 0.2134 | -1.5125 | 0.3379 |
| educationSome college | -0.1289 | 0.4735 | -0.2723 | 0.7854 | -1.0572 | 0.7993 |
| educationCollege graduate | -1.1429 | 0.4607 | -2.4805 | 0.0132 | -2.0461 | -0.2396 |
The model is:
\[\widehat{\text{Mental Health Days}} = 11.138 + -0.077(\text{Age}) + 1.681(\text{Female}) + 0.311(\text{Phys Days}) + -0.628(\text{Sleep}) + -0.587(\text{HS grad}) + -0.129(\text{Some college}) + -1.143(\text{College grad})\]
Each education coefficient represents the estimated difference in mentally unhealthy days between that group and the reference group (Less than HS), holding all other variables constant:
HS graduate (\(\hat{\beta}\) = -0.587): Compared to those with less than a high school education, HS graduates report an estimated 0.587 fewer mentally unhealthy days, holding age, sex, physical health days, and sleep constant.
Some college (\(\hat{\beta}\) = -0.129): Compared to those with less than a high school education, those with some college report an estimated 0.129 fewer mentally unhealthy days, holding all else constant.
College graduate (\(\hat{\beta}\) = -1.143): Compared to those with less than a high school education, college graduates report an estimated 1.143 fewer mentally unhealthy days, holding all else constant.
Key pattern: All comparisons are made relative to the reference group. The coefficients do NOT directly tell us the difference between, say, HS graduates and college graduates. We would need to compute \(\hat{\beta}_{\text{HS grad}} - \hat{\beta}_{\text{College grad}}\) for that comparison (or change the reference group).
pred_educ <- ggpredict(mod_educ, terms = c("age [20:80]", "education"))
ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
labs(
title = "Predicted Mental Health Days by Age and Education",
subtitle = "Parallel lines: same slopes for age, different intercepts by education",
x = "Age (years)",
y = "Predicted Mentally Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set2")Predicted Mental Health Days by Age and Education Level
These are a series of parallel lines, one for each education level. The slope for age is the same across all groups; only the intercept differs. Each education dummy shifts the intercept up or down relative to the reference group.
relevel() in RWe may want to change the reference group to a category that is more epidemiologically meaningful. For instance, “College graduate” is the largest group and could serve as a natural comparison.
# Change reference group to College graduate
brfss_dv$education_reref <- relevel(brfss_dv$education, ref = "College graduate")
mod_educ_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_reref,
data = brfss_dv)
tidy(mod_educ_reref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: College graduate)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.9948 | 0.6272 | 15.9349 | 0.0000 | 8.7652 | 11.2245 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| education_rerefLess than HS | 1.1429 | 0.4607 | 2.4805 | 0.0132 | 0.2396 | 2.0461 |
| education_rerefHS graduate | 0.5556 | 0.2574 | 2.1586 | 0.0309 | 0.0510 | 1.0601 |
| education_rerefSome college | 1.0139 | 0.2566 | 3.9507 | 0.0001 | 0.5108 | 1.5171 |
tribble(
~Quantity, ~`Ref: Less than HS`, ~`Ref: College graduate`,
"Intercept", round(coef(mod_educ)[1], 3), round(coef(mod_educ_reref)[1], 3),
"Age coefficient", round(coef(mod_educ)[2], 3), round(coef(mod_educ_reref)[2], 3),
"Sex coefficient", round(coef(mod_educ)[3], 3), round(coef(mod_educ_reref)[3], 3),
"Physical health days", round(coef(mod_educ)[4], 3), round(coef(mod_educ_reref)[4], 3),
"Sleep hours", round(coef(mod_educ)[5], 3), round(coef(mod_educ_reref)[5], 3),
"R-squared", round(summary(mod_educ)$r.squared, 4), round(summary(mod_educ_reref)$r.squared, 4),
"Residual SE", round(summary(mod_educ)$sigma, 3), round(summary(mod_educ_reref)$sigma, 3)
) |>
kable(caption = "Comparing Models with Different Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Ref: Less than HS | Ref: College graduate |
|---|---|---|
| Intercept | 11.1380 | 9.9950 |
| Age coefficient | -0.0770 | -0.0770 |
| Sex coefficient | 1.6810 | 1.6810 |
| Physical health days | 0.3110 | 0.3110 |
| Sleep hours | -0.6280 | -0.6280 |
| R-squared | 0.1553 | 0.1553 |
| Residual SE | 7.2690 | 7.2690 |
What changes:
What stays the same:
This is a critical point: Changing the reference group does not change the model’s fit or predictions. It only changes the interpretation of the dummy variable coefficients.
# Verify that predicted values are identical
pred_orig <- predict(mod_educ)
pred_reref <- predict(mod_educ_reref)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
If we include \(k\) dummy variables and an intercept for a variable with \(k\) categories, the columns of the design matrix \(X\) are linearly dependent. Specifically:
\[\text{Intercept} = D_1 + D_2 + \cdots + D_k\]
where \(D_1, \ldots, D_k\) are the \(k\) dummy variables (one for each category). This means the matrix \(X^TX\) is singular and cannot be inverted, so the OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^TY\) does not exist.
This is called the dummy variable trap.
| Obs | Intercept | A | B | C | A + B + C |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 0 | 1 |
| 2 | 1 | 0 | 1 | 0 | 1 |
| 3 | 1 | 0 | 0 | 1 | 1 |
| 4 | 1 | 1 | 0 | 0 | 1 |
Solutions:
- 1 in the formula and include all \(k\) dummies. Then each coefficient is the
group mean (adjusted for other predictors) rather than a difference from
a reference.# Model without intercept: all k dummies included
mod_no_int <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education - 1,
data = brfss_dv)
tidy(mod_no_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model Without Intercept: All k Education Dummies Included",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexMale | 11.1377 | 0.7390 | 15.0709 | 0.0000 | 9.6889 | 12.5865 |
| sexFemale | 12.8190 | 0.7524 | 17.0365 | 0.0000 | 11.3439 | 14.2941 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| educationHS graduate | -0.5873 | 0.4719 | -1.2445 | 0.2134 | -1.5125 | 0.3379 |
| educationSome college | -0.1289 | 0.4735 | -0.2723 | 0.7854 | -1.0572 | 0.7993 |
| educationCollege graduate | -1.1429 | 0.4607 | -2.4805 | 0.0132 | -2.0461 | -0.2396 |
Caution: Removing the intercept changes the interpretation of \(R^2\) and should only be done when there is a substantive reason. In most epidemiological applications, reference cell coding (the default) is preferred.
When a categorical variable with \(k\) levels enters the model as \(k - 1\) dummies, we cannot assess its overall significance by looking at individual t-tests for each dummy. A single dummy might not be statistically significant on its own, yet the variable as a whole might be.
To test whether education as a whole is associated with the outcome, we use a partial F-test (also called an extra sum of squares F-test):
\[H_0: \beta_{\text{HS grad}} = \beta_{\text{Some college}} = \beta_{\text{College grad}} = 0\] \[H_A: \text{At least one } \beta_j \neq 0\]
This compares the full model (with education) to a reduced model (without education):
# Reduced model (no education)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)
# Partial F-test
f_test <- anova(mod_reduced, mod_educ)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does Education Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education | 4992 | 263744.4 | 3 | 970.7509 | 6.1246 | 4e-04 |
car::Anova() for Type III TestsThe car::Anova() function with type = "III"
provides a convenient way to test the overall significance of each
predictor, including categorical variables:
Anova(mod_educ, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12000.1867 | 1 | 227.1325 | 0e+00 |
| age | 8863.3522 | 1 | 167.7603 | 0e+00 |
| sex | 3469.6448 | 1 | 65.6714 | 0e+00 |
| physhlth_days | 28765.1139 | 1 | 544.4492 | 0e+00 |
| sleep_hrs | 3506.1243 | 1 | 66.3619 | 0e+00 |
| education | 970.7509 | 3 | 6.1246 | 4e-04 |
| Residuals | 263744.4348 | 4992 | NA | NA |
Type I vs. Type III: Type I (sequential) sums of squares depend on the order variables enter the model. Type III (partial) sums of squares test each variable after all others, regardless of order. For unbalanced observational data (the norm in epidemiology), Type III is preferred.
This is what R uses by default (contr.treatment). Each
coefficient represents the difference between a group and the reference
group.
## HS graduate Some college College graduate
## Less than HS 0 0 0
## HS graduate 1 0 0
## Some college 0 1 0
## College graduate 0 0 1
In effect coding (contr.sum), each
coefficient represents the difference between a group’s mean and the
grand mean (the unweighted average of all group means).
This is common in ANOVA contexts.
# Set effect coding
brfss_dv$education_effect <- brfss_dv$education
contrasts(brfss_dv$education_effect) <- contr.sum(4)
mod_effect <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_effect,
data = brfss_dv)
tidy(mod_effect, conf.int = TRUE) |>
mutate(
term = case_when(
str_detect(term, "education_effect1") ~ "Education: Less than HS vs. Grand Mean",
str_detect(term, "education_effect2") ~ "Education: HS graduate vs. Grand Mean",
str_detect(term, "education_effect3") ~ "Education: Some college vs. Grand Mean",
TRUE ~ term
),
across(where(is.numeric), \(x) round(x, 4))
) |>
kable(
caption = "Effect Coding: Each Education Coefficient vs. Grand Mean",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6729 | 0.6172 | 17.2911 | 0.0000 | 9.4628 | 11.8830 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| Education: Less than HS vs. Grand Mean | 0.4648 | 0.3323 | 1.3988 | 0.1619 | -0.1866 | 1.1162 |
| Education: HS graduate vs. Grand Mean | -0.1225 | 0.1939 | -0.6319 | 0.5275 | -0.5026 | 0.2576 |
| Education: Some college vs. Grand Mean | 0.3358 | 0.1946 | 1.7257 | 0.0845 | -0.0457 | 0.7174 |
With effect coding, the intercept is the grand mean (adjusted for covariates), and each education coefficient shows how far that group deviates from the grand mean. The omitted group’s deviation is the negative sum of the others.
When a categorical variable is truly ordinal (like
education), we can test for specific patterns using orthogonal
polynomial contrasts (contr.poly). These decompose the
group differences into linear, quadratic, and cubic trends.
# Ordinal polynomial contrasts
brfss_dv$education_ord <- brfss_dv$education
contrasts(brfss_dv$education_ord) <- contr.poly(4)
mod_ord <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_ord,
data = brfss_dv)
tidy(mod_ord, conf.int = TRUE) |>
mutate(
term = case_when(
str_detect(term, "\\.L$") ~ "Education: Linear trend",
str_detect(term, "\\.Q$") ~ "Education: Quadratic trend",
str_detect(term, "\\.C$") ~ "Education: Cubic trend",
TRUE ~ term
),
across(where(is.numeric), \(x) round(x, 4))
) |>
kable(
caption = "Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6729 | 0.6172 | 17.2911 | 0.0000 | 9.4628 | 11.8830 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| Education: Linear trend | -0.6642 | 0.3158 | -2.1028 | 0.0355 | -1.2833 | -0.0450 |
| Education: Quadratic trend | -0.2133 | 0.2682 | -0.7954 | 0.4264 | -0.7391 | 0.3125 |
| Education: Cubic trend | -0.5630 | 0.2142 | -2.6282 | 0.0086 | -0.9830 | -0.1431 |
Interpretation:
Polynomial contrasts are most useful when the categories have a clear, meaningful order and you want to characterize the shape of the trend rather than compare individual groups to a reference.
| Coding Scheme | R Function | Intercept | Each β represents | Best for |
|---|---|---|---|---|
| Treatment (Reference) | contr.treatment (default) | Reference group mean | Difference from reference group | Group comparisons to baseline |
| Effect (Deviation) | contr.sum | Grand mean | Deviation from grand mean | ANOVA-style analyses |
| Polynomial (Ordinal) | contr.poly | Grand mean | Linear/quadratic/cubic trend | Ordinal variables with ordered levels |
Guidelines for choosing the reference group:
as.factor() Is RequiredIf a categorical variable is stored as numeric in your data (e.g.,
coded 0, 1, 2, 3), R will treat it as continuous by default. You
must use as.factor() or
factor() to tell R it is categorical:
# WRONG: R treats educ_numeric as continuous
mod_wrong <- lm(menthlth_days ~ educ_numeric, data = brfss_dv)
# RIGHT: Convert to factor first
mod_right <- lm(menthlth_days ~ factor(educ_numeric), data = brfss_dv)
# Compare: 1 coefficient (wrong) vs. 3 coefficients (right)
tribble(
~Model, ~`Number of education coefficients`, ~`Degrees of freedom used`,
"Numeric (wrong)", 1, 1,
"Factor (correct)", 3, 3
) |>
kable(caption = "Numeric vs. Factor Treatment of Categorical Variables") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Model | Number of education coefficients | Degrees of freedom used |
|---|---|---|
| Numeric (wrong) | 1 | 1 |
| Factor (correct) | 3 | 3 |
What if we want to compare HS graduates to college graduates, but neither is the reference group? We have two options:
Option 1: Change the reference group with
relevel().
Option 2: Compute the difference manually from the model output.
# Difference between HS graduate and College graduate
# = β_HS_grad - β_College_grad
diff_est <- coef(mod_educ)["educationHS graduate"] - coef(mod_educ)["educationCollege graduate"]
# Use linearHypothesis() for a formal test with SE and p-value
lin_test <- linearHypothesis(mod_educ, "educationHS graduate - educationCollege graduate = 0")
cat("Estimated difference (HS grad - College grad):", round(diff_est, 3), "days\n")## Estimated difference (HS grad - College grad): 0.556 days
## F-statistic: 4.66
## p-value: 0.0309
car::linearHypothesis()is a powerful function for testing any linear combination of coefficients, not just comparisons to the reference group.
| Concept | Key Point |
|---|---|
| Categorical predictors | Cannot be included as raw numeric codes in regression |
| Dummy variables | Binary (0/1) indicators; need \(k - 1\) for \(k\) categories |
| Reference group | The omitted category; all comparisons are relative to it |
| Changing reference | Use relevel(); predictions unchanged, interpretation
changes |
| Partial F-test | Tests whether the categorical variable as a whole is significant |
| Dummy variable trap | Including \(k\) dummies + intercept = perfect multicollinearity |
as.factor() |
Required when categorical variable is stored as numeric |
| Coding schemes | Treatment (default), effect, polynomial — each answers a different question |
| Type III ANOVA | Preferred for unbalanced observational data |
| Linear hypothesis | linearHypothesis() tests comparisons between
non-reference groups |
EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
brfss_dv_2020 <- readRDS("~/Downloads/epi552/brfss_dv_2020.rds")1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
brfss_dv_2020 |>
select(menthlth_days, age, sex, gen_health, marital_status) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days:past 30)",
age ~ "Age (years)",
sex ~ "Sex",
gen_health ~ "Health status",
marital_status ~ "Marital status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1: Descriptive Statistics for Lab Variables**")| Characteristic | N | N = 5,0001 |
|---|---|---|
| Mentally unhealthy days:past 30) | 5,000 | 3.8 (7.9) |
| Age (years) | 5,000 | 54.9 (17.5) |
| Sex | 5,000 | |
| Male | 2,303 (46%) | |
| Female | 2,697 (54%) | |
| Health status | 5,000 | |
| Excellent | 1,065 (21%) | |
| Very good | 1,803 (36%) | |
| Good | 1,426 (29%) | |
| Fair | 523 (10%) | |
| Poor | 183 (3.7%) | |
| Marital status | 5,000 | |
| Married | 2,708 (54%) | |
| Divorced | 622 (12%) | |
| Widowed | 534 (11%) | |
| Separated | 109 (2.2%) | |
| Never married | 848 (17%) | |
| Unmarried couple | 179 (3.6%) | |
| 1 Mean (SD); n (%) | ||
The sample included 5,000 individuals with a mean age of 54.9 years (SD = 17.5). On average, participants reported 3.8 mentally unhealthy days (SD = 7.9) in the past 30 days. The sample was 54% female and 46% male. Most participants reported very good (36%) or good (29%) general health, while fewer reported fair (10%) or poor (3.7%) health. Over half of the sample (54%) were married.
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
ggplot(brfss_dv_2020, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
labs(
title = "Mentally Unhealthy Days by Health Status",
x = "Health Status",
y = "Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")On average, individuals who reported poor general health had the highest number of mentally unhealthy days. The median and spread of mentally unhealthy days increase as general health worsens from excellent to poor. This pattern is consistent with what we would expect, as individuals with worse overall health tend to experience more mental health challenges.
1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
marital_means <- brfss_dv_2020 |>
group_by(marital_status) |>
summarise(
mean_menthlth = mean(menthlth_days, na.rm = TRUE),
.groups = "drop"
)
marital_means |>
kable(
caption = "Mean Mentally Unhealthy Days by Marital Status",
digits = 2
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| marital_status | mean_menthlth |
|---|---|
| Married | 3.10 |
| Divorced | 4.49 |
| Widowed | 2.67 |
| Separated | 6.22 |
| Never married | 5.28 |
| Unmarried couple | 6.07 |
ggplot(marital_means, aes(x = marital_status, y = mean_menthlth, fill = marital_status)) +
geom_col(alpha = 0.85) +
labs(
title = "Mean Mentally Unhealthy Days by Marital Status",
x = "Marital Status",
y = "Mean Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
brfss_dv_2020 <- brfss_dv_2020 |>
mutate(
gen_health_numeric = case_when(
gen_health == "Excellent" ~ 1,
gen_health == "Very good" ~ 2,
gen_health == "Good" ~ 3,
gen_health == "Fair" ~ 4,
gen_health == "Poor" ~ 5,
TRUE ~ NA_real_
)
)
mod_gen_naive <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv_2020)
tidy(mod_gen_naive, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Naive Model: Health Treated as Continuous",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -0.6718 | 0.2705 | -2.4840 | 0.013 | -1.2021 | -0.1416 |
| gen_health_numeric | 1.8578 | 0.1036 | 17.9259 | 0.000 | 1.6547 | 2.0610 |
The coefficient for gen_health_numeric is 1.8578. On average, a one-unit increase in general health (representing worse health) is associated with a 1.86 increase in mentally unhealthy days. This means that moving from one category to the next (e.g., Excellent to Very good, or Good to Fair) is associated with about 1.86 more mentally unhealthy days, on average.
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
mod_gen_factor <- lm(menthlth_days ~ gen_health, data = brfss_dv_2020)
tidy(mod_gen_factor, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Correct Model: Health Treated as a Factor",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.1174 | 0.2332 | 9.0790 | 0.0000 | 1.6602 | 2.5746 |
| gen_healthVery good | 0.5903 | 0.2941 | 2.0070 | 0.0448 | 0.0137 | 1.1670 |
| gen_healthGood | 1.9535 | 0.3082 | 6.3375 | 0.0000 | 1.3492 | 2.5577 |
| gen_healthFair | 5.0624 | 0.4064 | 12.4572 | 0.0000 | 4.2657 | 5.8590 |
| gen_healthPoor | 9.6640 | 0.6090 | 15.8678 | 0.0000 | 8.4701 | 10.8580 |
The factor model uses 4 coefficients instead of 1 because gen_health has 5 categories, and the category, Excellent, is used as the reference group. Each of the remaining categories is compared to this reference group, which results in 4 dummy variables. Compared to the naive model, the factor model allows each category of general health to have a different association with mentally unhealthy days. In contrast, the numeric model assumes that each one-step increase in general health has the same effect. The naive numeric approach may be misleading because it assumes equal spacing between categories. However, the results from the factor model show that the increases are not equal, as the estimated effects become much larger for worse health categories.
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation.
mod_gen_full <- lm(
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv_2020
)
tidy(mod_gen_full, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Full Model with Health Dummy Variables",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5930 | 0.6304 | 15.2163 | 0.0000 | 8.3570 | 10.8289 |
| age | -0.0867 | 0.0060 | -14.4888 | 0.0000 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0.0000 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0.0000 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0.0000 | -0.7367 | -0.4365 |
| gen_healthVery good | 0.7899 | 0.2797 | 2.8247 | 0.0048 | 0.2417 | 1.3382 |
| gen_healthGood | 1.8436 | 0.2973 | 6.2020 | 0.0000 | 1.2608 | 2.4264 |
| gen_healthFair | 3.3953 | 0.4180 | 8.1234 | 0.0000 | 2.5759 | 4.2147 |
| gen_healthPoor | 5.3353 | 0.6829 | 7.8122 | 0.0000 | 3.9965 | 6.6742 |
## (Intercept) age sexFemale physhlth_days
## 9.5929825 -0.0866721 1.7253789 0.2314200
## sleep_hrs gen_healthVery good gen_healthGood gen_healthFair
## -0.5865953 0.7899467 1.8436011 3.3952833
## gen_healthPoor
## 5.3353468
menthlth_days=9.5930−0.0867(age)+1.7254(Female)+0.2314(physhlth_days)−0.5866(sleep_hrs)+0.7899(Verygood)+1.8436(Good)+3.3953(Fair)+5.3353(Poor)
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
Very good: On average, individuals with very good health report 0.79 more mentally unhealthy days compared to those with excellent health, holding all other variables constant. Good: On average, individuals with good health report 1.84 more mentally unhealthy days compared to those with excellent health, holding all other variables constant. Fair: On average, individuals with fair health report 3.40 more mentally unhealthy days compared to those with excellent health, holding all other variables constant. Poor: On average, individuals with poor health report 5.34 more mentally unhealthy days compared to those with excellent health, holding all other variables constant.
The group that differs the most from the reference group is the poor health group. This group has the largest estimated coefficient, indicating the greatest increase in mentally unhealthy days compared to those with excellent health.
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
gen_coef_plot <- tidy(mod_gen_full, conf.int = TRUE) |>
filter(str_detect(term, "gen_health")) |>
mutate(
term = case_when(
term == "gen_healthVery good" ~ "Very good vs Excellent",
term == "gen_healthGood" ~ "Good vs Excellent",
term == "gen_healthFair" ~ "Fair vs Excellent",
term == "gen_healthPoor" ~ "Poor vs Excellent",
TRUE ~ term
)
)
ggplot(gen_coef_plot, aes(x = estimate, y = fct_reorder(term, estimate))) +
geom_point(size = 3) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
geom_vline(xintercept = 0, linetype = "dashed") +
labs(
title = "Health Dummy Variable Coefficients",
x = "Estimated Coefficient",
y = ""
) +
theme_minimal(base_size = 13)4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
brfss_dv_2020 <- brfss_dv_2020 |>
mutate(gen_health_goodref = relevel(gen_health, ref = "Good"))
mod_gen_goodref <- lm(
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_goodref,
data = brfss_dv_2020
)
tidy(mod_gen_goodref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Full Model with Good as the Reference Group",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.4366 | 0.6298 | 18.1584 | 0e+00 | 10.2019 | 12.6713 |
| age | -0.0867 | 0.0060 | -14.4888 | 0e+00 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0e+00 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0e+00 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0e+00 | -0.7367 | -0.4365 |
| gen_health_goodrefExcellent | -1.8436 | 0.2973 | -6.2020 | 0e+00 | -2.4264 | -1.2608 |
| gen_health_goodrefVery good | -1.0537 | 0.2581 | -4.0819 | 0e+00 | -1.5597 | -0.5476 |
| gen_health_goodrefFair | 1.5517 | 0.3861 | 4.0186 | 1e-04 | 0.7947 | 2.3087 |
| gen_health_goodrefPoor | 3.4917 | 0.6506 | 5.3673 | 0e+00 | 2.2164 | 4.7671 |
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
tribble(
~Term, ~Original_Reference, ~Good_Reference,
"Age", coef(mod_gen_full)["age"], coef(mod_gen_goodref)["age"],
"SexFemale", coef(mod_gen_full)["sexFemale"], coef(mod_gen_goodref)["sexFemale"],
"Physical health days", coef(mod_gen_full)["physhlth_days"], coef(mod_gen_goodref)["physhlth_days"],
"Sleep hours", coef(mod_gen_full)["sleep_hrs"], coef(mod_gen_goodref)["sleep_hrs"]
) |>
mutate(across(where(is.numeric), round, 4)) |>
kable(caption = "Comparison of Non-General Health Coefficients Across Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Original_Reference | Good_Reference |
|---|---|---|
| Age | -0.0867 | -0.0867 |
| SexFemale | 1.7254 | 1.7254 |
| Physical health days | 0.2314 | 0.2314 |
| Sleep hours | -0.5866 | -0.5866 |
The coefficients for age, sex, physical health days, and sleep hours are the same in both models. This is because changing the reference group for a categorical variable only affects the intercept and the dummy variable coefficients for that categorical variable. The relationships between the continuous variables and mentally unhealthy days remain unchanged because they are estimated independently of the reference category. Changing the reference group changes how the general health categories are compared, but it does not affect the overall model fit or the effects of the other variables.
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
pred_orig <- predict(mod_gen_full)
pred_goodref <- predict(mod_gen_goodref)
cor(pred_orig, pred_goodref)## [1] 1
## [1] 3.552714e-15
The correlation between the predicted values from the two models is 1, and the difference between them is essentially 0, indicating that the predictions are identical. Changing the reference group does not change the predicted values because the model is mathematically the same. It only changes how the coefficients for the categorical variable are expressed. The underlying relationships and fitted values remain unchanged, using a different baseline category.
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
mod_gen_reduced <- lm(
menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
data = brfss_dv_2020
)
tibble(
Model = c("Reduced model", "Full model"),
R_squared = c(summary(mod_gen_reduced)$r.squared, summary(mod_gen_full)$r.squared),
Adjusted_R_squared = c(summary(mod_gen_reduced)$adj.r.squared, summary(mod_gen_full)$adj.r.squared)
) |>
mutate(across(where(is.numeric), round, 4)) |>
kable(caption = "Model Fit Statistics: Reduced vs Full Model") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | R_squared | Adjusted_R_squared |
|---|---|---|
| Reduced model | 0.1522 | 0.1515 |
| Full model | 0.1694 | 0.1681 |
The reduced model has an R^2 of 0.1522 and an adjusted R^2 of 0.1515. The full model has an R^2 of 0.1694 and an adjusted R^2 of 0.1681. The full model explains more variation in mentally unhealthy days compared to the reduced model, showing that adding general health improves the model fit.
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
## Analysis of Variance Table
##
## Model 1: menthlth_days ~ age + sex + physhlth_days + sleep_hrs
## Model 2: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4995 264715
## 2 4991 259335 4 5379.8 25.884 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0:β Very good=β Good=β Fair=β Poor=0 HA:At least one of these
coefficients is not equal to 0. F-statistic=25.88
degrees of freedom=4 and 4991 p-value < 2.2 × 10^-16. Since the
p-value is small, we reject the null hypothesis. This indicates that
general health, as a group of variables, significantly improves the
model. On average, general health is significantly associated with
mentally unhealthy days, making all other variables constant.
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
Anova(mod_gen_full, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), round, 4)) |>
kable(caption = "Type III ANOVA for Full Model") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12030.737 | 1 | 231.5357 | 0 |
| age | 10907.874 | 1 | 209.9258 | 0 |
| sex | 3663.847 | 1 | 70.5120 | 0 |
| physhlth_days | 10633.920 | 1 | 204.6535 | 0 |
| sleep_hrs | 3049.400 | 1 | 58.6868 | 0 |
| gen_health | 5379.751 | 4 | 25.8838 | 0 |
| Residuals | 259335.435 | 4991 | NA | NA |
The Type III ANOVA result for gen_health shows an F-statistic of 25.88 with a p-value of 0 (p < 0.001). This is consistent with the results from the partial F-test. Both tests indicate that general health, as a group of variables, significantly improves the model and is significantly associated with mentally unhealthy days, holding all other variables constant. —
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
People who reported fair or poor health had noticeably more mentally unhealthy days compared to those with excellent health, with those in poor health having the largest difference. On average, individuals in poorer health reported about 3 to 5 more mentally unhealthy days in the past month. This shows that worse overall health is linked to worse mental health. However, because the data were collected at one point in time, we cannot say that general health causes changes in mental health.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.