In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.
When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.
Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.
Research question for today:
How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?
brfss_dv <- brfss_full |>
mutate(
# Outcome: mentally unhealthy days in past 30
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age
age = age80,
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Education (6-level raw BRFSS variable EDUCA)
# 1 = Never attended school or only kindergarten
# 2 = Grades 1 through 8 (Elementary)
# 3 = Grades 9 through 11 (Some high school)
# 4 = Grade 12 or GED (High school graduate)
# 5 = College 1 year to 3 years (Some college or technical school)
# 6 = College 4 years or more (College graduate)
# 9 = Refused
education = factor(case_when(
educa %in% c(1, 2, 3) ~ "Less than HS",
educa == 4 ~ "HS graduate",
educa == 5 ~ "Some college",
educa == 6 ~ "College graduate",
TRUE ~ NA_character_
), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
# General health status (5-level)
gen_health = factor(case_when(
genhlth == 1 ~ "Excellent",
genhlth == 2 ~ "Very good",
genhlth == 3 ~ "Good",
genhlth == 4 ~ "Fair",
genhlth == 5 ~ "Poor",
TRUE ~ NA_character_
), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
# Marital status
marital_status = factor(case_when(
marital == 1 ~ "Married",
marital == 2 ~ "Divorced",
marital == 3 ~ "Widowed",
marital == 4 ~ "Separated",
marital == 5 ~ "Never married",
marital == 6 ~ "Unmarried couple",
TRUE ~ NA_character_
), levels = c("Married", "Divorced", "Widowed", "Separated",
"Never married", "Unmarried couple")),
# Store the raw education numeric code for the "naive approach" demonstration
educ_numeric = case_when(
educa %in% c(1, 2, 3) ~ 1,
educa == 4 ~ 2,
educa == 5 ~ 3,
educa == 6 ~ 4,
TRUE ~ NA_real_
)
) |>
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(age), age >= 18,
!is.na(sex),
!is.na(education),
!is.na(gen_health),
!is.na(marital_status)
)
# Reproducible random sample
set.seed(1220)
brfss_dv <- brfss_dv |>
select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
education, gen_health, marital_status, educ_numeric) |>
slice_sample(n = 5000)
# Save for lab activity
saveRDS(brfss_dv,
"brfss_dv_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_dv), ncol(brfss_dv))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
brfss_dv |>
select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
education, gen_health) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
sex ~ "Sex",
education ~ "Education level",
gen_health ~ "General health status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Physically unhealthy days (past 30) | 5,000 | 3.3 (7.9) |
Sleep (hours/night) | 5,000 | 7.0 (1.4) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
Education level | 5,000 | |
Less than HS | 290 (5.8%) | |
HS graduate | 1,348 (27%) | |
Some college | 1,340 (27%) | |
College graduate | 2,022 (40%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
1Mean (SD); n (%) | ||
ggplot(brfss_dv, aes(x = education, fill = education)) +
geom_bar(alpha = 0.85) +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Distribution of Education Level",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Education Level",
y = "Count"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Distribution of Education Level in Analytic Sample
ggplot(brfss_dv, aes(x = education, y = menthlth_days, fill = education)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by Education Level",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "Education Level",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Mental Health Days by Education Level
Categorical predictor variables come in two forms:
| Type | Definition | Examples |
|---|---|---|
| Nominal | Categories with no natural ordering | Sex, race/ethnicity, marital status, blood type |
| Ordinal | Categories with a meaningful order | Education level, income bracket, disease stage, Likert scale |
A further distinction is:
Note that categorical variables can also be created by grouping continuous variables (e.g., age groups from continuous age), though this generally results in a loss of information.
Suppose education has been coded as: 1 = Less than HS, 2 = HS graduate, 3 = Some college, 4 = College graduate.
If we include this numeric code directly in a regression model, we are assuming:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 \cdot \text{educ_numeric} + \varepsilon\]
This forces the model to assume that the difference in expected \(Y\) between “Less than HS” and “HS graduate” is the same as the difference between “HS graduate” and “Some college,” and the same again between “Some college” and “College graduate.” In other words, we are assuming equally spaced, linear increments.
# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ age + educ_numeric, data = brfss_dv)
tidy(naive_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Naive Model: Education Treated as Continuous",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5601 | 0.5039 | 18.9727 | 0 | 8.5723 | 10.5479 |
| age | -0.0661 | 0.0063 | -10.5135 | 0 | -0.0784 | -0.0538 |
| educ_numeric | -0.7168 | 0.1158 | -6.1917 | 0 | -0.9437 | -0.4898 |
This model estimates a single coefficient for education, meaning each step up the education ladder is associated with the same change in mentally unhealthy days. This constraint is problematic for two reasons:
Let’s visualize why this matters:
# Compute observed group means
group_means <- brfss_dv |>
summarise(mean_days = mean(menthlth_days), .by = c(education, educ_numeric))
# Generate predictions from the naive model
pred_naive <- tibble(
educ_numeric = 1:4,
predicted = predict(naive_mod, newdata = tibble(age = mean(brfss_dv$age), educ_numeric = 1:4))
)
ggplot() +
geom_point(data = group_means,
aes(x = educ_numeric, y = mean_days),
size = 4, color = "steelblue") +
geom_line(data = pred_naive,
aes(x = educ_numeric, y = predicted),
color = "tomato", linewidth = 1.2, linetype = "dashed") +
geom_point(data = pred_naive,
aes(x = educ_numeric, y = predicted),
size = 3, color = "tomato", shape = 17) +
scale_x_continuous(
breaks = 1:4,
labels = c("Less than HS", "HS graduate", "Some college", "College graduate")
) +
labs(
title = "Observed Group Means (blue) vs. Naive Linear Fit (red)",
subtitle = "The naive model forces equal spacing between education levels",
x = "Education Level",
y = "Mean Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13)Naive Linear Fit vs. Actual Group Means by Education
Key takeaway: The blue dots (observed means) do not fall along a straight line. The naive linear model (red) misrepresents the actual pattern. We need a more flexible approach.
A dummy variable (also called an indicator variable) is a variable that takes on only two possible values:
If a categorical predictor has \(k\) categories, we need exactly \(k - 1\) dummy variables when the model includes an intercept. The omitted category becomes the reference group (also called the control group or baseline group).
Why \(k - 1\) and not \(k\)? Because the intercept already captures the mean for the reference group. Including all \(k\) dummies plus the intercept would create perfect multicollinearity (the dummy variables would sum to equal the intercept column), and the model could not be estimated.
The simplest example is a variable with two categories, such as sex.
With \(k = 2\), we need \(2 - 1 = 1\) dummy variable. If we choose Female as the reference group:
\[\text{male} = \begin{cases} 1 & \text{if male} \\ 0 & \text{if female} \end{cases}\]
The regression model becomes:
\[Y = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{male} + \varepsilon\]
For males (\(\text{male} = 1\)): \[E(Y | \text{age}, \text{male}) = (\beta_0 + \beta_2) + \beta_1 \cdot \text{age}\]
For females (\(\text{male} = 0\)): \[E(Y | \text{age}, \text{female}) = \beta_0 + \beta_1 \cdot \text{age}\]
Both groups share the same slope for age but have different intercepts. The coefficient \(\beta_2\) is the expected difference in \(Y\) between males and females, holding age constant.
# Fit model with sex as a dummy variable
mod_sex <- lm(menthlth_days ~ age + sex, data = brfss_dv)
tidy(mod_sex, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Dichotomous Dummy Variable: Sex",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 6.6262 | 0.3730 | 17.7666 | 0 | 5.8951 | 7.3574 |
| age | -0.0698 | 0.0063 | -11.1011 | 0 | -0.0821 | -0.0575 |
| sexFemale | 1.8031 | 0.2210 | 8.1585 | 0 | 1.3698 | 2.2364 |
Interpretation:
Note that R automatically creates dummy variables when a factor is included in
lm(). It uses alphabetical or level order to set the reference group, which is why Male (the first level) is the reference here.
pred_sex <- ggpredict(mod_sex, terms = c("age [20:80]", "sex"))
ggplot(pred_sex, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15, color = NA) +
labs(
title = "Predicted Mental Health Days by Age and Sex",
subtitle = "Parallel lines: same slope, different intercepts",
x = "Age (years)",
y = "Predicted Mentally Unhealthy Days",
color = "Sex",
fill = "Sex"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")Parallel Regression Lines: Males vs. Females
Geometrically: Dummy variables produce parallel regression lines. The intercept shifts by \(\beta_2\) for the non-reference group, but the slope remains the same.
Education has \(k = 4\) categories, so we need \(4 - 1 = 3\) dummy variables. If we choose “Less than HS” as the reference group:
\[\text{HS_graduate} = \begin{cases} 1 & \text{if HS graduate} \\ 0 & \text{otherwise} \end{cases}\]
\[\text{Some_college} = \begin{cases} 1 & \text{if Some college} \\ 0 & \text{otherwise} \end{cases}\]
\[\text{College_graduate} = \begin{cases} 1 & \text{if College graduate} \\ 0 & \text{otherwise} \end{cases}\]
The data matrix looks like this:
| Observation | Education | HS_graduate | Some_college | College_graduate |
|---|---|---|---|---|
| 1 | Less than HS | 0 | 0 | 0 |
| 2 | HS graduate | 1 | 0 | 0 |
| 3 | Some college | 0 | 1 | 0 |
| 4 | College graduate | 0 | 0 | 1 |
| 5 | Less than HS | 0 | 0 | 0 |
Notice that the reference group is identified by having all dummy variables equal to zero.
The reference group is the category against which all others are compared. Key points:
When we include a factor variable in lm(), R
automatically creates the dummy variables. The first level of the factor
is used as the reference group by default.
# Fit model with education as a factor (R creates dummies automatically)
mod_educ <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education,
data = brfss_dv)
tidy(mod_educ, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Education Dummy Variables (Reference: Less than HS)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.1377 | 0.7390 | 15.0709 | 0.0000 | 9.6889 | 12.5865 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| educationHS graduate | -0.5873 | 0.4719 | -1.2445 | 0.2134 | -1.5125 | 0.3379 |
| educationSome college | -0.1289 | 0.4735 | -0.2723 | 0.7854 | -1.0572 | 0.7993 |
| educationCollege graduate | -1.1429 | 0.4607 | -2.4805 | 0.0132 | -2.0461 | -0.2396 |
The model is:
\[\widehat{\text{Mental Health Days}} = 11.138 + -0.077(\text{Age}) + 1.681(\text{Female}) + 0.311(\text{Phys Days}) + -0.628(\text{Sleep}) + -0.587(\text{HS grad}) + -0.129(\text{Some college}) + -1.143(\text{College grad})\]
Each education coefficient represents the estimated difference in mentally unhealthy days between that group and the reference group (Less than HS), holding all other variables constant:
HS graduate (\(\hat{\beta}\) = -0.587): Compared to those with less than a high school education, HS graduates report an estimated 0.587 fewer mentally unhealthy days, holding age, sex, physical health days, and sleep constant.
Some college (\(\hat{\beta}\) = -0.129): Compared to those with less than a high school education, those with some college report an estimated 0.129 fewer mentally unhealthy days, holding all else constant.
College graduate (\(\hat{\beta}\) = -1.143): Compared to those with less than a high school education, college graduates report an estimated 1.143 fewer mentally unhealthy days, holding all else constant.
Key pattern: All comparisons are made relative to the reference group. The coefficients do NOT directly tell us the difference between, say, HS graduates and college graduates. We would need to compute \(\hat{\beta}_{\text{HS grad}} - \hat{\beta}_{\text{College grad}}\) for that comparison (or change the reference group).
pred_educ <- ggpredict(mod_educ, terms = c("age [20:80]", "education"))
ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
labs(
title = "Predicted Mental Health Days by Age and Education",
subtitle = "Parallel lines: same slopes for age, different intercepts by education",
x = "Age (years)",
y = "Predicted Mentally Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set2")Predicted Mental Health Days by Age and Education Level
These are a series of parallel lines, one for each education level. The slope for age is the same across all groups; only the intercept differs. Each education dummy shifts the intercept up or down relative to the reference group.
relevel() in RWe may want to change the reference group to a category that is more epidemiologically meaningful. For instance, “College graduate” is the largest group and could serve as a natural comparison.
# Change reference group to College graduate
brfss_dv$education_reref <- relevel(brfss_dv$education, ref = "College graduate")
mod_educ_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_reref,
data = brfss_dv)
tidy(mod_educ_reref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: College graduate)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.9948 | 0.6272 | 15.9349 | 0.0000 | 8.7652 | 11.2245 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| education_rerefLess than HS | 1.1429 | 0.4607 | 2.4805 | 0.0132 | 0.2396 | 2.0461 |
| education_rerefHS graduate | 0.5556 | 0.2574 | 2.1586 | 0.0309 | 0.0510 | 1.0601 |
| education_rerefSome college | 1.0139 | 0.2566 | 3.9507 | 0.0001 | 0.5108 | 1.5171 |
tribble(
~Quantity, ~`Ref: Less than HS`, ~`Ref: College graduate`,
"Intercept", round(coef(mod_educ)[1], 3), round(coef(mod_educ_reref)[1], 3),
"Age coefficient", round(coef(mod_educ)[2], 3), round(coef(mod_educ_reref)[2], 3),
"Sex coefficient", round(coef(mod_educ)[3], 3), round(coef(mod_educ_reref)[3], 3),
"Physical health days", round(coef(mod_educ)[4], 3), round(coef(mod_educ_reref)[4], 3),
"Sleep hours", round(coef(mod_educ)[5], 3), round(coef(mod_educ_reref)[5], 3),
"R-squared", round(summary(mod_educ)$r.squared, 4), round(summary(mod_educ_reref)$r.squared, 4),
"Residual SE", round(summary(mod_educ)$sigma, 3), round(summary(mod_educ_reref)$sigma, 3)
) |>
kable(caption = "Comparing Models with Different Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Ref: Less than HS | Ref: College graduate |
|---|---|---|
| Intercept | 11.1380 | 9.9950 |
| Age coefficient | -0.0770 | -0.0770 |
| Sex coefficient | 1.6810 | 1.6810 |
| Physical health days | 0.3110 | 0.3110 |
| Sleep hours | -0.6280 | -0.6280 |
| R-squared | 0.1553 | 0.1553 |
| Residual SE | 7.2690 | 7.2690 |
What changes:
What stays the same:
This is a critical point: Changing the reference group does not change the model’s fit or predictions. It only changes the interpretation of the dummy variable coefficients.
# Verify that predicted values are identical
pred_orig <- predict(mod_educ)
pred_reref <- predict(mod_educ_reref)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
If we include \(k\) dummy variables and an intercept for a variable with \(k\) categories, the columns of the design matrix \(X\) are linearly dependent. Specifically:
\[\text{Intercept} = D_1 + D_2 + \cdots + D_k\]
where \(D_1, \ldots, D_k\) are the \(k\) dummy variables (one for each category). This means the matrix \(X^TX\) is singular and cannot be inverted, so the OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^TY\) does not exist.
This is called the dummy variable trap.
| Obs | Intercept | A | B | C | A + B + C |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 0 | 1 |
| 2 | 1 | 0 | 1 | 0 | 1 |
| 3 | 1 | 0 | 0 | 1 | 1 |
| 4 | 1 | 1 | 0 | 0 | 1 |
Solutions:
- 1 in the formula and include all \(k\) dummies. Then each coefficient is the
group mean (adjusted for other predictors) rather than a difference from
a reference.# Model without intercept: all k dummies included
mod_no_int <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education - 1,
data = brfss_dv)
tidy(mod_no_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model Without Intercept: All k Education Dummies Included",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexMale | 11.1377 | 0.7390 | 15.0709 | 0.0000 | 9.6889 | 12.5865 |
| sexFemale | 12.8190 | 0.7524 | 17.0365 | 0.0000 | 11.3439 | 14.2941 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| educationHS graduate | -0.5873 | 0.4719 | -1.2445 | 0.2134 | -1.5125 | 0.3379 |
| educationSome college | -0.1289 | 0.4735 | -0.2723 | 0.7854 | -1.0572 | 0.7993 |
| educationCollege graduate | -1.1429 | 0.4607 | -2.4805 | 0.0132 | -2.0461 | -0.2396 |
Caution: Removing the intercept changes the interpretation of \(R^2\) and should only be done when there is a substantive reason. In most epidemiological applications, reference cell coding (the default) is preferred.
When a categorical variable with \(k\) levels enters the model as \(k - 1\) dummies, we cannot assess its overall significance by looking at individual t-tests for each dummy. A single dummy might not be statistically significant on its own, yet the variable as a whole might be.
To test whether education as a whole is associated with the outcome, we use a partial F-test (also called an extra sum of squares F-test):
\[H_0: \beta_{\text{HS grad}} = \beta_{\text{Some college}} = \beta_{\text{College grad}} = 0\] \[H_A: \text{At least one } \beta_j \neq 0\]
This compares the full model (with education) to a reduced model (without education):
# Reduced model (no education)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)
# Partial F-test
f_test <- anova(mod_reduced, mod_educ)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does Education Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education | 4992 | 263744.4 | 3 | 970.7509 | 6.1246 | 4e-04 |
car::Anova() for Type III TestsThe car::Anova() function with type = "III"
provides a convenient way to test the overall significance of each
predictor, including categorical variables:
Anova(mod_educ, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12000.1867 | 1 | 227.1325 | 0e+00 |
| age | 8863.3522 | 1 | 167.7603 | 0e+00 |
| sex | 3469.6448 | 1 | 65.6714 | 0e+00 |
| physhlth_days | 28765.1139 | 1 | 544.4492 | 0e+00 |
| sleep_hrs | 3506.1243 | 1 | 66.3619 | 0e+00 |
| education | 970.7509 | 3 | 6.1246 | 4e-04 |
| Residuals | 263744.4348 | 4992 | NA | NA |
Type I vs. Type III: Type I (sequential) sums of squares depend on the order variables enter the model. Type III (partial) sums of squares test each variable after all others, regardless of order. For unbalanced observational data (the norm in epidemiology), Type III is preferred.
This is what R uses by default (contr.treatment). Each
coefficient represents the difference between a group and the reference
group.
## HS graduate Some college College graduate
## Less than HS 0 0 0
## HS graduate 1 0 0
## Some college 0 1 0
## College graduate 0 0 1
In effect coding (contr.sum), each
coefficient represents the difference between a group’s mean and the
grand mean (the unweighted average of all group means).
This is common in ANOVA contexts.
# Set effect coding
brfss_dv$education_effect <- brfss_dv$education
contrasts(brfss_dv$education_effect) <- contr.sum(4)
mod_effect <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_effect,
data = brfss_dv)
tidy(mod_effect, conf.int = TRUE) |>
mutate(
term = case_when(
str_detect(term, "education_effect1") ~ "Education: Less than HS vs. Grand Mean",
str_detect(term, "education_effect2") ~ "Education: HS graduate vs. Grand Mean",
str_detect(term, "education_effect3") ~ "Education: Some college vs. Grand Mean",
TRUE ~ term
),
across(where(is.numeric), \(x) round(x, 4))
) |>
kable(
caption = "Effect Coding: Each Education Coefficient vs. Grand Mean",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6729 | 0.6172 | 17.2911 | 0.0000 | 9.4628 | 11.8830 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| Education: Less than HS vs. Grand Mean | 0.4648 | 0.3323 | 1.3988 | 0.1619 | -0.1866 | 1.1162 |
| Education: HS graduate vs. Grand Mean | -0.1225 | 0.1939 | -0.6319 | 0.5275 | -0.5026 | 0.2576 |
| Education: Some college vs. Grand Mean | 0.3358 | 0.1946 | 1.7257 | 0.0845 | -0.0457 | 0.7174 |
With effect coding, the intercept is the grand mean (adjusted for covariates), and each education coefficient shows how far that group deviates from the grand mean. The omitted group’s deviation is the negative sum of the others.
When a categorical variable is truly ordinal (like
education), we can test for specific patterns using orthogonal
polynomial contrasts (contr.poly). These decompose the
group differences into linear, quadratic, and cubic trends.
# Ordinal polynomial contrasts
brfss_dv$education_ord <- brfss_dv$education
contrasts(brfss_dv$education_ord) <- contr.poly(4)
mod_ord <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_ord,
data = brfss_dv)
tidy(mod_ord, conf.int = TRUE) |>
mutate(
term = case_when(
str_detect(term, "\\.L$") ~ "Education: Linear trend",
str_detect(term, "\\.Q$") ~ "Education: Quadratic trend",
str_detect(term, "\\.C$") ~ "Education: Cubic trend",
TRUE ~ term
),
across(where(is.numeric), \(x) round(x, 4))
) |>
kable(
caption = "Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6729 | 0.6172 | 17.2911 | 0.0000 | 9.4628 | 11.8830 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| Education: Linear trend | -0.6642 | 0.3158 | -2.1028 | 0.0355 | -1.2833 | -0.0450 |
| Education: Quadratic trend | -0.2133 | 0.2682 | -0.7954 | 0.4264 | -0.7391 | 0.3125 |
| Education: Cubic trend | -0.5630 | 0.2142 | -2.6282 | 0.0086 | -0.9830 | -0.1431 |
Interpretation:
Polynomial contrasts are most useful when the categories have a clear, meaningful order and you want to characterize the shape of the trend rather than compare individual groups to a reference.
| Coding Scheme | R Function | Intercept | Each β represents | Best for |
|---|---|---|---|---|
| Treatment (Reference) | contr.treatment (default) | Reference group mean | Difference from reference group | Group comparisons to baseline |
| Effect (Deviation) | contr.sum | Grand mean | Deviation from grand mean | ANOVA-style analyses |
| Polynomial (Ordinal) | contr.poly | Grand mean | Linear/quadratic/cubic trend | Ordinal variables with ordered levels |
Guidelines for choosing the reference group:
as.factor() Is RequiredIf a categorical variable is stored as numeric in your data (e.g.,
coded 0, 1, 2, 3), R will treat it as continuous by default. You
must use as.factor() or
factor() to tell R it is categorical:
# WRONG: R treats educ_numeric as continuous
mod_wrong <- lm(menthlth_days ~ educ_numeric, data = brfss_dv)
# RIGHT: Convert to factor first
mod_right <- lm(menthlth_days ~ factor(educ_numeric), data = brfss_dv)
# Compare: 1 coefficient (wrong) vs. 3 coefficients (right)
tribble(
~Model, ~`Number of education coefficients`, ~`Degrees of freedom used`,
"Numeric (wrong)", 1, 1,
"Factor (correct)", 3, 3
) |>
kable(caption = "Numeric vs. Factor Treatment of Categorical Variables") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Model | Number of education coefficients | Degrees of freedom used |
|---|---|---|
| Numeric (wrong) | 1 | 1 |
| Factor (correct) | 3 | 3 |
What if we want to compare HS graduates to college graduates, but neither is the reference group? We have two options:
Option 1: Change the reference group with
relevel().
Option 2: Compute the difference manually from the model output.
# Difference between HS graduate and College graduate
# = β_HS_grad - β_College_grad
diff_est <- coef(mod_educ)["educationHS graduate"] - coef(mod_educ)["educationCollege graduate"]
# Use linearHypothesis() for a formal test with SE and p-value
lin_test <- linearHypothesis(mod_educ, "educationHS graduate - educationCollege graduate = 0")
cat("Estimated difference (HS grad - College grad):", round(diff_est, 3), "days\n")## Estimated difference (HS grad - College grad): 0.556 days
## F-statistic: 4.66
## p-value: 0.0309
car::linearHypothesis()is a powerful function for testing any linear combination of coefficients, not just comparisons to the reference group.
| Concept | Key Point |
|---|---|
| Categorical predictors | Cannot be included as raw numeric codes in regression |
| Dummy variables | Binary (0/1) indicators; need \(k - 1\) for \(k\) categories |
| Reference group | The omitted category; all comparisons are relative to it |
| Changing reference | Use relevel(); predictions unchanged, interpretation
changes |
| Partial F-test | Tests whether the categorical variable as a whole is significant |
| Dummy variable trap | Including \(k\) dummies + intercept = perfect multicollinearity |
as.factor() |
Required when categorical variable is stored as numeric |
| Coding schemes | Treatment (default), effect, polynomial — each answers a different question |
| Type III ANOVA | Preferred for unbalanced observational data |
| Linear hypothesis | linearHypothesis() tests comparisons between
non-reference groups |
EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| SVariable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
brfss_dv <- readRDS(
"brfss_dv_2020.rds"
)
#Task 1a
library(tidyverse)
library(gtsummary)
# Descriptive statistics table
brfss_dv %>%
select(menthlth_days, age, sex, gen_health, marital_status) %>%
tbl_summary(
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 2
)| Characteristic | N = 5,0001 |
|---|---|
| menthlth_days | 3.77 (7.90) |
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 | 54.86 (17.53) |
| sex | |
| Male | 2,303 (46%) |
| Female | 2,697 (54%) |
| gen_health | |
| Excellent | 1,065 (21%) |
| Very good | 1,803 (36%) |
| Good | 1,426 (29%) |
| Fair | 523 (10%) |
| Poor | 183 (3.7%) |
| marital_status | |
| Married | 2,708 (54%) |
| Divorced | 622 (12%) |
| Widowed | 534 (11%) |
| Separated | 109 (2.2%) |
| Never married | 848 (17%) |
| Unmarried couple | 179 (3.6%) |
| 1 Mean (SD); n (%) | |
#Task 1b
ggplot(brfss_dv, aes(x = factor(gen_health,
levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
y = menthlth_days)) +
geom_boxplot() +
labs(
title = "Mentally Unhealthy Days by General Health",
x = "General Health",
y = "Mentally Unhealthy Days (past 30 days)"
) +
theme_minimal()#task 1c
brfss_dv %>%
group_by(marital_status) %>%
summarise(
mean_menthlth_days = mean(menthlth_days, na.rm = TRUE)
) %>%
arrange(desc(mean_menthlth_days))## # A tibble: 6 × 2
## marital_status mean_menthlth_days
## <fct> <dbl>
## 1 Separated 6.22
## 2 Unmarried couple 6.07
## 3 Never married 5.28
## 4 Divorced 4.49
## 5 Married 3.10
## 6 Widowed 2.67
#task 2a
brfss_dv <- brfss_dv %>%
mutate(
gen_health_numeric = case_when(
gen_health == "Excellent" ~ 1,
gen_health == "Very good" ~ 2,
gen_health == "Good" ~ 3,
gen_health == "Fair" ~ 4,
gen_health == "Poor" ~ 5
)
)
model1 <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)
summary(model1)##
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6173 -4.9016 -3.0438 -0.0438 28.8140
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6718 0.2705 -2.484 0.013 *
## gen_health_numeric 1.8578 0.1036 17.926 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.661 on 4998 degrees of freedom
## Multiple R-squared: 0.06041, Adjusted R-squared: 0.06022
## F-statistic: 321.3 on 1 and 4998 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = menthlth_days ~ gen_health, data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7814 -4.0708 -2.7077 -0.1174 27.8826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1174 0.2332 9.079 < 2e-16 ***
## gen_healthVery good 0.5903 0.2941 2.007 0.0448 *
## gen_healthGood 1.9535 0.3082 6.337 2.54e-10 ***
## gen_healthFair 5.0624 0.4064 12.457 < 2e-16 ***
## gen_healthPoor 9.6640 0.6090 15.868 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.611 on 4995 degrees of freedom
## Multiple R-squared: 0.07334, Adjusted R-squared: 0.07259
## F-statistic: 98.83 on 4 and 4995 DF, p-value: < 2.2e-16
#Task 3a
model3 <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv)
summary(model3)##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs +
## gen_health, data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5175 -3.5549 -1.6999 0.4316 31.3631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.592982 0.630441 15.216 < 2e-16 ***
## age -0.086672 0.005982 -14.489 < 2e-16 ***
## sexFemale 1.725379 0.205472 8.397 < 2e-16 ***
## physhlth_days 0.231420 0.016177 14.306 < 2e-16 ***
## sleep_hrs -0.586595 0.076572 -7.661 2.21e-14 ***
## gen_healthVery good 0.789947 0.279661 2.825 0.00475 **
## gen_healthGood 1.843601 0.297260 6.202 6.03e-10 ***
## gen_healthFair 3.395283 0.417964 8.123 5.66e-16 ***
## gen_healthPoor 5.335347 0.682949 7.812 6.80e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1681
## F-statistic: 127.3 on 8 and 4991 DF, p-value: < 2.2e-16
#Task 3c
library(broom)
library(ggplot2)
library(dplyr)
# Extract coefficients + CIs
coef_data <- tidy(model3, conf.int = TRUE) %>%
filter(grepl("gen_health", term))
# Clean labels
coef_data <- coef_data %>%
mutate(term = case_when(
term == "gen_healthVery good" ~ "Very good",
term == "gen_healthGood" ~ "Good",
term == "gen_healthFair" ~ "Fair",
term == "gen_healthPoor" ~ "Poor",
TRUE ~ term
))
# Plot
ggplot(coef_data, aes(x = estimate, y = term)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
geom_vline(xintercept = 0, linetype = "dashed") +
labs(
title = "Effect of General Health on Mentally Unhealthy Days",
x = "Coefficient (Difference from Excellent)",
y = "General Health Category"
) +
theme_minimal()#Task 4a
brfss_dv <- brfss_dv %>%
mutate(gen_health = relevel(gen_health, ref = "Good"))
model4 <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv)
summary(model4)##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs +
## gen_health, data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5175 -3.5549 -1.6999 0.4316 31.3631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.436584 0.629825 18.158 < 2e-16 ***
## age -0.086672 0.005982 -14.489 < 2e-16 ***
## sexFemale 1.725379 0.205472 8.397 < 2e-16 ***
## physhlth_days 0.231420 0.016177 14.306 < 2e-16 ***
## sleep_hrs -0.586595 0.076572 -7.661 2.21e-14 ***
## gen_healthExcellent -1.843601 0.297260 -6.202 6.03e-10 ***
## gen_healthVery good -1.053654 0.258126 -4.082 4.54e-05 ***
## gen_healthFair 1.551682 0.386128 4.019 5.94e-05 ***
## gen_healthPoor 3.491746 0.650560 5.367 8.35e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1681
## F-statistic: 127.3 on 8 and 4991 DF, p-value: < 2.2e-16
#Task 4c
#Get predictions from both models
pred1 <- predict(model3)
pred2 <- predict(model4)
#Compute correlation
cor(pred1, pred2)## [1] 1
#Task 5a
model_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
data = brfss_dv)
summary(model_reduced)##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
## data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.314 -3.520 -1.805 0.159 32.013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.504082 0.612374 17.153 < 2e-16 ***
## age -0.077335 0.005962 -12.972 < 2e-16 ***
## sexFemale 1.686413 0.207440 8.130 5.38e-16 ***
## physhlth_days 0.317247 0.013210 24.015 < 2e-16 ***
## sleep_hrs -0.633026 0.077171 -8.203 2.96e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.28 on 4995 degrees of freedom
## Multiple R-squared: 0.1522, Adjusted R-squared: 0.1515
## F-statistic: 224.2 on 4 and 4995 DF, p-value: < 2.2e-16
## [1] 0.1521948
## [1] 0.1515159
## [1] 0.1694246
## [1] 0.1680933
## Analysis of Variance Table
##
## Model 1: menthlth_days ~ age + sex + physhlth_days + sleep_hrs
## Model 2: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4995 264715
## 2 4991 259335 4 5379.8 25.884 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Anova Table (Type III tests)
##
## Response: menthlth_days
## Sum Sq Df F value Pr(>F)
## (Intercept) 12031 1 231.536 < 2.2e-16 ***
## age 10908 1 209.926 < 2.2e-16 ***
## sex 3664 1 70.512 < 2.2e-16 ***
## physhlth_days 10634 1 204.654 < 2.2e-16 ***
## sleep_hrs 3049 1 58.687 2.207e-14 ***
## gen_health 5380 4 25.884 < 2.2e-16 ***
## Residuals 259335 4991
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
The analytic sample included 5,000 respondents. On average, participants reported 3.77 mentally unhealthy days in the past 30 days (SD = 7.90). The mean age of respondents was 54.86 years (SD = 17.53).
In terms of sex, the sample was slightly majority female (54%), with 46% male participants.
Self-reported general health was most commonly rated as “very good” (36%), followed by “good” (29%) and “excellent” (21%). Smaller proportions reported “fair” (10%) or “poor” (3.7%) health.
Regarding marital status, over half of respondents were married (54%). The remaining participants were never married (17%), divorced (12%), widowed (11%), unmarried couples (3.6%), and separated (2.2%).
Overall, the sample is predominantly middle-aged to older adults, slightly more female, with most individuals reporting relatively good general health and being married.
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
The group reporting the most mentally unhealthy days is the “Poor” general health group, which has the highest median and overall distribution of mentally unhealthy days.
The pattern shows a clear gradient: as general health worsens (from Excellent → Very good → Good → Fair → Poor), the number of mentally unhealthy days increases. Individuals in the “Excellent” category report very few mentally unhealthy days, while those in “Fair” and especially “Poor” report substantially more.
This pattern is consistent with expectations. Poorer self-rated general health is typically associated with greater physical and mental health burdens, which would lead to more mentally unhealthy days. The increasing spread in the worse health categories also suggests greater variability in mental health among those groups.
1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
The marital status group with the highest mean number of mentally unhealthy days is Separated (6.22 days), while the group with the lowest mean is Widowed (2.67 days).
2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
The coefficient for gen_health_numeric is 1.8578.
This means that for each one-unit increase in general health (i.e., moving from a better to a worse health category, such as from “Excellent” to “Very good”), the number of mentally unhealthy days increases by approximately 1.86 days on average.
This suggests a positive association between poorer self-rated general health and an increase in mentally unhealthy days.
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
Why does the factor version use 4 coefficients instead of 1?
The factor model uses 4 coefficients because gen_health
has 5 categories, and one category (in this case, “Excellent”) is used
as the reference group. The remaining four categories (Very good, Good,
Fair, Poor) each get their own coefficient, representing the difference
in mean mentally unhealthy days compared to the reference group.
Comparison of the two models
The numeric model estimates a single coefficient (1.8578), assuming that each step from one health category to the next increases mentally unhealthy days by the same amount.
In contrast, the factor model estimates separate effects:
Very good: +0.59 days
Good: +1.95 days
Fair: +5.06 days
Poor: +9.66 days
This shows that the increase in mentally unhealthy days is not linear, the jump from “Fair” to “Poor” is much larger than from “Excellent” to “Very good.”
Additionally, the factor model has a slightly higher R² (0.073 vs. 0.060), indicating a better fit to the data.
Why the numeric approach may be misleading
The numeric approach assumes that the categories of general health are equally spaced, meaning each step has the same effect on mentally unhealthy days. However, the factor model shows this is not true, the increases vary substantially across categories.
Because of this, the numeric model oversimplifies the relationship and may underestimate or overestimate effects for certain groups. The factor model is more appropriate because it allows for nonlinear differences between categories.
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation.
menthlth_days_hat = 9.5930
- 0.0867(age)
+ 1.7254(Female)
+ 0.2314(physhlth_days)
- 0.5866(sleep_hrs)
+ 0.7899(Very good)
+ 1.8436(Good)
+ 3.3953(Fair)
+ 5.3353(Poor)
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
In this model, “Excellent” general health is the reference group, so all coefficients are interpreted relative to individuals reporting excellent health.
Very good: Individuals reporting very good health have, on average, 0.79 more mentally unhealthy days compared to those in excellent health, holding all other variables constant.
Good: Individuals reporting good health have, on average, 1.84 more mentally unhealthy days compared to those in excellent health, holding all other variables constant.
Fair: Individuals reporting fair health have, on average, 3.40 more mentally unhealthy days compared to those in excellent health, holding all other variables constant.
Poor: Individuals reporting poor health have, on average, 5.34 more mentally unhealthy days compared to those in excellent health, holding all other variables constant.
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
The Poor general health group differs the most from the reference group (Excellent health), as it has the largest positive coefficient. This indicates that individuals reporting poor health have the greatest increase in mentally unhealthy days compared to those in excellent health, holding all other variables constant.
4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
The coefficients for the continuous variables (such as age, physical health days, and sleep hours) are the same in both models (original reference = Excellent vs. new reference = Good). This is because changing the reference group for a categorical variable (gen_health) only affects the intercept and the dummy variable coefficients for that categorical variable. It does not affect the slopes of the continuous predictors, which represent the independent relationship between each continuous variable and the outcome. Therefore, the continuous variable coefficients remain unchanged because their estimates are not dependent on the choice of reference category for gen_health.
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
The correlation between the predicted values from the two models is approximately 1, indicating that the predictions are identical.
Changing the reference group does not change the predicted values because it only reparameterizes the model, it shifts how the coefficients are expressed (i.e., the intercept and dummy variable coefficients), but the underlying fitted values remain the same. In other words, both models represent the same relationships in the data, just using a different baseline category, so they produce identical predictions.
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
Reduced Model
R^2 = 0.1522 Adjusted R^2 = 0.1515
Full Model
R^2 = 0.1694
Adjusted R^2 = 0.1681
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
Null hypothesis (H₀): All coefficients for
gen_health = 0
(gen_health does not improve the model)
Alternative hypothesis (H₁): At least one
gen_health coefficient ≠ 0
(gen_health improves the model)
Test Results
F-statistic: 25.88
Degrees of freedom: (4, 4991)
p-value: < 2.2 × 10⁻¹⁶
A partial F-test comparing the reduced and full models yielded an F-statistic of 25.88 with 4 and 4991 degrees of freedom, and a p-value \< 2.2 × 10⁻¹⁶. Since the p-value is extremely small, we reject the null hypothesis and conclude that gen_health as a whole significantly improves the model, even after adjusting for age, sex, physical health days, and sleep hours.
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
Using a Type III ANOVA, the overall effect of gen_health is statistically significant (F = 25.884, df = 4, p \< 2.2 × 10⁻¹⁶). This result is consistent with the partial F-test conducted earlier, which also yielded an F-statistic of 25.88 with 4 and 4991 degrees of freedom and a p-value \< 2.2 × 10⁻¹⁶. Both methods test whether all gen_health coefficients are jointly equal to zero, and both lead to the same conclusion that gen_health significantly improves the model.
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
People who report worse general health tend to experience more mentally unhealthy days. Compared to those in excellent health, individuals in poor health report about 5 more mentally unhealthy days per month, while those in fair health report about 3 more days. Smaller differences are seen for those in good or very good health. Because this data was collected at one point in time, we cannot determine whether poorer health causes worse mental health—only that they are related.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.
General health appears to be more strongly associated with mentally unhealthy days than education, as it shows larger differences across categories and contributes more to explaining variation in the outcome. In particular, the gap between excellent and poor health is substantial compared to typical differences seen across education levels. If building a final model, I would prioritize including general health, while considering education if it adds meaningful explanatory value or is important for the research question. Ultimately, the decision would depend on both the strength of association and the conceptual relevance of each variable. In contrast, the education model showed smaller differences in mentally unhealthy days across categories.
End of Lab Activity