In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.
When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.
Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.
Research question for today:
How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?
brfss_dv <- brfss_full |>
mutate(
# Outcome: mentally unhealthy days in past 30
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age
age = age80,
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Education (6-level raw BRFSS variable EDUCA)
# 1 = Never attended school or only kindergarten
# 2 = Grades 1 through 8 (Elementary)
# 3 = Grades 9 through 11 (Some high school)
# 4 = Grade 12 or GED (High school graduate)
# 5 = College 1 year to 3 years (Some college or technical school)
# 6 = College 4 years or more (College graduate)
# 9 = Refused
education = factor(case_when(
educa %in% c(1, 2, 3) ~ "Less than HS",
educa == 4 ~ "HS graduate",
educa == 5 ~ "Some college",
educa == 6 ~ "College graduate",
TRUE ~ NA_character_
), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
# General health status (5-level)
gen_health = factor(case_when(
genhlth == 1 ~ "Excellent",
genhlth == 2 ~ "Very good",
genhlth == 3 ~ "Good",
genhlth == 4 ~ "Fair",
genhlth == 5 ~ "Poor",
TRUE ~ NA_character_
), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
# Marital status
marital_status = factor(case_when(
marital == 1 ~ "Married",
marital == 2 ~ "Divorced",
marital == 3 ~ "Widowed",
marital == 4 ~ "Separated",
marital == 5 ~ "Never married",
marital == 6 ~ "Unmarried couple",
TRUE ~ NA_character_
), levels = c("Married", "Divorced", "Widowed", "Separated",
"Never married", "Unmarried couple")),
# Store the raw education numeric code for the "naive approach" demonstration
educ_numeric = case_when(
educa %in% c(1, 2, 3) ~ 1,
educa == 4 ~ 2,
educa == 5 ~ 3,
educa == 6 ~ 4,
TRUE ~ NA_real_
)
) |>
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(age), age >= 18,
!is.na(sex),
!is.na(education),
!is.na(gen_health),
!is.na(marital_status)
)
# Reproducible random sample
set.seed(1220)
brfss_dv <- brfss_dv |>
select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
education, gen_health, marital_status, educ_numeric) |>
slice_sample(n = 5000)
setwd("C:/Users/Alexandra/Desktop/Epi 553/Labs")
# Save for lab activity
saveRDS(brfss_dv,
"brfss_dv_2020.rds")
brfss_dv <- readRDS("brfss_dv_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_dv), ncol(brfss_dv))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
brfss_dv |>
select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
education, gen_health) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
sex ~ "Sex",
education ~ "Education level",
gen_health ~ "General health status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Physically unhealthy days (past 30) | 5,000 | 3.3 (7.9) |
Sleep (hours/night) | 5,000 | 7.0 (1.4) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
Education level | 5,000 | |
Less than HS | 290 (5.8%) | |
HS graduate | 1,348 (27%) | |
Some college | 1,340 (27%) | |
College graduate | 2,022 (40%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
1Mean (SD); n (%) | ||
ggplot(brfss_dv, aes(x = education, fill = education)) +
geom_bar(alpha = 0.85) +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Distribution of Education Level",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Education Level",
y = "Count"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Distribution of Education Level in Analytic Sample
ggplot(brfss_dv, aes(x = education, y = menthlth_days, fill = education)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by Education Level",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "Education Level",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Mental Health Days by Education Level
Categorical predictor variables come in two forms:
| Type | Definition | Examples |
|---|---|---|
| Nominal | Categories with no natural ordering | Sex, race/ethnicity, marital status, blood type |
| Ordinal | Categories with a meaningful order | Education level, income bracket, disease stage, Likert scale |
A further distinction is:
Note that categorical variables can also be created by grouping continuous variables (e.g., age groups from continuous age), though this generally results in a loss of information.
Suppose education has been coded as: 1 = Less than HS, 2 = HS graduate, 3 = Some college, 4 = College graduate.
If we include this numeric code directly in a regression model, we are assuming:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 \cdot \text{educ_numeric} + \varepsilon\]
This forces the model to assume that the difference in expected \(Y\) between “Less than HS” and “HS graduate” is the same as the difference between “HS graduate” and “Some college,” and the same again between “Some college” and “College graduate.” In other words, we are assuming equally spaced, linear increments.
# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ age + educ_numeric, data = brfss_dv)
tidy(naive_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Naive Model: Education Treated as Continuous",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5601 | 0.5039 | 18.9727 | 0 | 8.5723 | 10.5479 |
| age | -0.0661 | 0.0063 | -10.5135 | 0 | -0.0784 | -0.0538 |
| educ_numeric | -0.7168 | 0.1158 | -6.1917 | 0 | -0.9437 | -0.4898 |
This model estimates a single coefficient for education, meaning each step up the education ladder is associated with the same change in mentally unhealthy days. This constraint is problematic for two reasons:
Let’s visualize why this matters:
# Compute observed group means
group_means <- brfss_dv |>
summarise(mean_days = mean(menthlth_days), .by = c(education, educ_numeric))
# Generate predictions from the naive model
pred_naive <- tibble(
educ_numeric = 1:4,
predicted = predict(naive_mod, newdata = tibble(age = mean(brfss_dv$age), educ_numeric = 1:4))
)
ggplot() +
geom_point(data = group_means,
aes(x = educ_numeric, y = mean_days),
size = 4, color = "steelblue") +
geom_line(data = pred_naive,
aes(x = educ_numeric, y = predicted),
color = "tomato", linewidth = 1.2, linetype = "dashed") +
geom_point(data = pred_naive,
aes(x = educ_numeric, y = predicted),
size = 3, color = "tomato", shape = 17) +
scale_x_continuous(
breaks = 1:4,
labels = c("Less than HS", "HS graduate", "Some college", "College graduate")
) +
labs(
title = "Observed Group Means (blue) vs. Naive Linear Fit (red)",
subtitle = "The naive model forces equal spacing between education levels",
x = "Education Level",
y = "Mean Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13)Naive Linear Fit vs. Actual Group Means by Education
Key takeaway: The blue dots (observed means) do not fall along a straight line. The naive linear model (red) misrepresents the actual pattern. We need a more flexible approach.
A dummy variable (also called an indicator variable) is a variable that takes on only two possible values:
If a categorical predictor has \(k\) categories, we need exactly \(k - 1\) dummy variables when the model includes an intercept. The omitted category becomes the reference group (also called the control group or baseline group).
Why \(k - 1\) and not \(k\)? Because the intercept already captures the mean for the reference group. Including all \(k\) dummies plus the intercept would create perfect multicollinearity (the dummy variables would sum to equal the intercept column), and the model could not be estimated.
The simplest example is a variable with two categories, such as sex.
With \(k = 2\), we need \(2 - 1 = 1\) dummy variable. If we choose Female as the reference group:
\[\text{male} = \begin{cases} 1 & \text{if male} \\ 0 & \text{if female} \end{cases}\]
The regression model becomes:
\[Y = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{male} + \varepsilon\]
For males (\(\text{male} = 1\)): \[E(Y | \text{age}, \text{male}) = (\beta_0 + \beta_2) + \beta_1 \cdot \text{age}\]
For females (\(\text{male} = 0\)): \[E(Y | \text{age}, \text{female}) = \beta_0 + \beta_1 \cdot \text{age}\]
Both groups share the same slope for age but have different intercepts. The coefficient \(\beta_2\) is the expected difference in \(Y\) between males and females, holding age constant.
# Fit model with sex as a dummy variable
mod_sex <- lm(menthlth_days ~ age + sex, data = brfss_dv)
tidy(mod_sex, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Dichotomous Dummy Variable: Sex",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 6.6262 | 0.3730 | 17.7666 | 0 | 5.8951 | 7.3574 |
| age | -0.0698 | 0.0063 | -11.1011 | 0 | -0.0821 | -0.0575 |
| sexFemale | 1.8031 | 0.2210 | 8.1585 | 0 | 1.3698 | 2.2364 |
Interpretation:
Note that R automatically creates dummy variables when a factor is included in
lm(). It uses alphabetical or level order to set the reference group, which is why Male (the first level) is the reference here.
pred_sex <- ggpredict(mod_sex, terms = c("age [20:80]", "sex"))
ggplot(pred_sex, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15, color = NA) +
labs(
title = "Predicted Mental Health Days by Age and Sex",
subtitle = "Parallel lines: same slope, different intercepts",
x = "Age (years)",
y = "Predicted Mentally Unhealthy Days",
color = "Sex",
fill = "Sex"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")Parallel Regression Lines: Males vs. Females
Geometrically: Dummy variables produce parallel regression lines. The intercept shifts by \(\beta_2\) for the non-reference group, but the slope remains the same.
Education has \(k = 4\) categories, so we need \(4 - 1 = 3\) dummy variables. If we choose “Less than HS” as the reference group:
\[\text{HS_graduate} = \begin{cases} 1 & \text{if HS graduate} \\ 0 & \text{otherwise} \end{cases}\]
\[\text{Some_college} = \begin{cases} 1 & \text{if Some college} \\ 0 & \text{otherwise} \end{cases}\]
\[\text{College_graduate} = \begin{cases} 1 & \text{if College graduate} \\ 0 & \text{otherwise} \end{cases}\]
The data matrix looks like this:
| Observation | Education | HS_graduate | Some_college | College_graduate |
|---|---|---|---|---|
| 1 | Less than HS | 0 | 0 | 0 |
| 2 | HS graduate | 1 | 0 | 0 |
| 3 | Some college | 0 | 1 | 0 |
| 4 | College graduate | 0 | 0 | 1 |
| 5 | Less than HS | 0 | 0 | 0 |
Notice that the reference group is identified by having all dummy variables equal to zero.
The reference group is the category against which all others are compared. Key points:
When we include a factor variable in lm(), R
automatically creates the dummy variables. The first level of the factor
is used as the reference group by default.
# Fit model with education as a factor (R creates dummies automatically)
mod_educ <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education,
data = brfss_dv)
tidy(mod_educ, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Education Dummy Variables (Reference: Less than HS)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.1377 | 0.7390 | 15.0709 | 0.0000 | 9.6889 | 12.5865 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| educationHS graduate | -0.5873 | 0.4719 | -1.2445 | 0.2134 | -1.5125 | 0.3379 |
| educationSome college | -0.1289 | 0.4735 | -0.2723 | 0.7854 | -1.0572 | 0.7993 |
| educationCollege graduate | -1.1429 | 0.4607 | -2.4805 | 0.0132 | -2.0461 | -0.2396 |
The model is:
\[\widehat{\text{Mental Health Days}} = 11.138 + -0.077(\text{Age}) + 1.681(\text{Female}) + 0.311(\text{Phys Days}) + -0.628(\text{Sleep}) + -0.587(\text{HS grad}) + -0.129(\text{Some college}) + -1.143(\text{College grad})\]
Each education coefficient represents the estimated difference in mentally unhealthy days between that group and the reference group (Less than HS), holding all other variables constant:
HS graduate (\(\hat{\beta}\) = -0.587): Compared to those with less than a high school education, HS graduates report an estimated 0.587 fewer mentally unhealthy days, holding age, sex, physical health days, and sleep constant.
Some college (\(\hat{\beta}\) = -0.129): Compared to those with less than a high school education, those with some college report an estimated 0.129 fewer mentally unhealthy days, holding all else constant.
College graduate (\(\hat{\beta}\) = -1.143): Compared to those with less than a high school education, college graduates report an estimated 1.143 fewer mentally unhealthy days, holding all else constant.
Key pattern: All comparisons are made relative to the reference group. The coefficients do NOT directly tell us the difference between, say, HS graduates and college graduates. We would need to compute \(\hat{\beta}_{\text{HS grad}} - \hat{\beta}_{\text{College grad}}\) for that comparison (or change the reference group).
pred_educ <- ggpredict(mod_educ, terms = c("age [20:80]", "education"))
ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
labs(
title = "Predicted Mental Health Days by Age and Education",
subtitle = "Parallel lines: same slopes for age, different intercepts by education",
x = "Age (years)",
y = "Predicted Mentally Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set2")Predicted Mental Health Days by Age and Education Level
These are a series of parallel lines, one for each education level. The slope for age is the same across all groups; only the intercept differs. Each education dummy shifts the intercept up or down relative to the reference group.
relevel() in RWe may want to change the reference group to a category that is more epidemiologically meaningful. For instance, “College graduate” is the largest group and could serve as a natural comparison.
# Change reference group to College graduate
brfss_dv$education_reref <- relevel(brfss_dv$education, ref = "College graduate")
mod_educ_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_reref,
data = brfss_dv)
tidy(mod_educ_reref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: College graduate)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.9948 | 0.6272 | 15.9349 | 0.0000 | 8.7652 | 11.2245 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| education_rerefLess than HS | 1.1429 | 0.4607 | 2.4805 | 0.0132 | 0.2396 | 2.0461 |
| education_rerefHS graduate | 0.5556 | 0.2574 | 2.1586 | 0.0309 | 0.0510 | 1.0601 |
| education_rerefSome college | 1.0139 | 0.2566 | 3.9507 | 0.0001 | 0.5108 | 1.5171 |
tribble(
~Quantity, ~`Ref: Less than HS`, ~`Ref: College graduate`,
"Intercept", round(coef(mod_educ)[1], 3), round(coef(mod_educ_reref)[1], 3),
"Age coefficient", round(coef(mod_educ)[2], 3), round(coef(mod_educ_reref)[2], 3),
"Sex coefficient", round(coef(mod_educ)[3], 3), round(coef(mod_educ_reref)[3], 3),
"Physical health days", round(coef(mod_educ)[4], 3), round(coef(mod_educ_reref)[4], 3),
"Sleep hours", round(coef(mod_educ)[5], 3), round(coef(mod_educ_reref)[5], 3),
"R-squared", round(summary(mod_educ)$r.squared, 4), round(summary(mod_educ_reref)$r.squared, 4),
"Residual SE", round(summary(mod_educ)$sigma, 3), round(summary(mod_educ_reref)$sigma, 3)
) |>
kable(caption = "Comparing Models with Different Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Ref: Less than HS | Ref: College graduate |
|---|---|---|
| Intercept | 11.1380 | 9.9950 |
| Age coefficient | -0.0770 | -0.0770 |
| Sex coefficient | 1.6810 | 1.6810 |
| Physical health days | 0.3110 | 0.3110 |
| Sleep hours | -0.6280 | -0.6280 |
| R-squared | 0.1553 | 0.1553 |
| Residual SE | 7.2690 | 7.2690 |
What changes:
What stays the same:
This is a critical point: Changing the reference group does not change the model’s fit or predictions. It only changes the interpretation of the dummy variable coefficients.
# Verify that predicted values are identical
pred_orig <- predict(mod_educ)
pred_reref <- predict(mod_educ_reref)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
If we include \(k\) dummy variables and an intercept for a variable with \(k\) categories, the columns of the design matrix \(X\) are linearly dependent. Specifically:
\[\text{Intercept} = D_1 + D_2 + \cdots + D_k\]
where \(D_1, \ldots, D_k\) are the \(k\) dummy variables (one for each category). This means the matrix \(X^TX\) is singular and cannot be inverted, so the OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^TY\) does not exist.
This is called the dummy variable trap.
| Obs | Intercept | A | B | C | A + B + C |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 0 | 1 |
| 2 | 1 | 0 | 1 | 0 | 1 |
| 3 | 1 | 0 | 0 | 1 | 1 |
| 4 | 1 | 1 | 0 | 0 | 1 |
Solutions:
- 1 in the formula and include all \(k\) dummies. Then each coefficient is the
group mean (adjusted for other predictors) rather than a difference from
a reference.# Model without intercept: all k dummies included
mod_no_int <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education - 1,
data = brfss_dv)
tidy(mod_no_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model Without Intercept: All k Education Dummies Included",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexMale | 11.1377 | 0.7390 | 15.0709 | 0.0000 | 9.6889 | 12.5865 |
| sexFemale | 12.8190 | 0.7524 | 17.0365 | 0.0000 | 11.3439 | 14.2941 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| educationHS graduate | -0.5873 | 0.4719 | -1.2445 | 0.2134 | -1.5125 | 0.3379 |
| educationSome college | -0.1289 | 0.4735 | -0.2723 | 0.7854 | -1.0572 | 0.7993 |
| educationCollege graduate | -1.1429 | 0.4607 | -2.4805 | 0.0132 | -2.0461 | -0.2396 |
Caution: Removing the intercept changes the interpretation of \(R^2\) and should only be done when there is a substantive reason. In most epidemiological applications, reference cell coding (the default) is preferred.
When a categorical variable with \(k\) levels enters the model as \(k - 1\) dummies, we cannot assess its overall significance by looking at individual t-tests for each dummy. A single dummy might not be statistically significant on its own, yet the variable as a whole might be.
To test whether education as a whole is associated with the outcome, we use a partial F-test (also called an extra sum of squares F-test):
\[H_0: \beta_{\text{HS grad}} = \beta_{\text{Some college}} = \beta_{\text{College grad}} = 0\] \[H_A: \text{At least one } \beta_j \neq 0\]
This compares the full model (with education) to a reduced model (without education):
# Reduced model (no education)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)
# Partial F-test
f_test <- anova(mod_reduced, mod_educ)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does Education Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education | 4992 | 263744.4 | 3 | 970.7509 | 6.1246 | 4e-04 |
car::Anova() for Type III TestsThe car::Anova() function with type = "III"
provides a convenient way to test the overall significance of each
predictor, including categorical variables:
Anova(mod_educ, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12000.1867 | 1 | 227.1325 | 0e+00 |
| age | 8863.3522 | 1 | 167.7603 | 0e+00 |
| sex | 3469.6448 | 1 | 65.6714 | 0e+00 |
| physhlth_days | 28765.1139 | 1 | 544.4492 | 0e+00 |
| sleep_hrs | 3506.1243 | 1 | 66.3619 | 0e+00 |
| education | 970.7509 | 3 | 6.1246 | 4e-04 |
| Residuals | 263744.4348 | 4992 | NA | NA |
Type I vs. Type III: Type I (sequential) sums of squares depend on the order variables enter the model. Type III (partial) sums of squares test each variable after all others, regardless of order. For unbalanced observational data (the norm in epidemiology), Type III is preferred.
This is what R uses by default (contr.treatment). Each
coefficient represents the difference between a group and the reference
group.
## HS graduate Some college College graduate
## Less than HS 0 0 0
## HS graduate 1 0 0
## Some college 0 1 0
## College graduate 0 0 1
In effect coding (contr.sum), each
coefficient represents the difference between a group’s mean and the
grand mean (the unweighted average of all group means).
This is common in ANOVA contexts.
# Set effect coding
brfss_dv$education_effect <- brfss_dv$education
contrasts(brfss_dv$education_effect) <- contr.sum(4)
mod_effect <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_effect,
data = brfss_dv)
tidy(mod_effect, conf.int = TRUE) |>
mutate(
term = case_when(
str_detect(term, "education_effect1") ~ "Education: Less than HS vs. Grand Mean",
str_detect(term, "education_effect2") ~ "Education: HS graduate vs. Grand Mean",
str_detect(term, "education_effect3") ~ "Education: Some college vs. Grand Mean",
TRUE ~ term
),
across(where(is.numeric), \(x) round(x, 4))
) |>
kable(
caption = "Effect Coding: Each Education Coefficient vs. Grand Mean",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6729 | 0.6172 | 17.2911 | 0.0000 | 9.4628 | 11.8830 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| Education: Less than HS vs. Grand Mean | 0.4648 | 0.3323 | 1.3988 | 0.1619 | -0.1866 | 1.1162 |
| Education: HS graduate vs. Grand Mean | -0.1225 | 0.1939 | -0.6319 | 0.5275 | -0.5026 | 0.2576 |
| Education: Some college vs. Grand Mean | 0.3358 | 0.1946 | 1.7257 | 0.0845 | -0.0457 | 0.7174 |
With effect coding, the intercept is the grand mean (adjusted for covariates), and each education coefficient shows how far that group deviates from the grand mean. The omitted group’s deviation is the negative sum of the others.
When a categorical variable is truly ordinal (like
education), we can test for specific patterns using orthogonal
polynomial contrasts (contr.poly). These decompose the
group differences into linear, quadratic, and cubic trends.
# Ordinal polynomial contrasts
brfss_dv$education_ord <- brfss_dv$education
contrasts(brfss_dv$education_ord) <- contr.poly(4)
mod_ord <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_ord,
data = brfss_dv)
tidy(mod_ord, conf.int = TRUE) |>
mutate(
term = case_when(
str_detect(term, "\\.L$") ~ "Education: Linear trend",
str_detect(term, "\\.Q$") ~ "Education: Quadratic trend",
str_detect(term, "\\.C$") ~ "Education: Cubic trend",
TRUE ~ term
),
across(where(is.numeric), \(x) round(x, 4))
) |>
kable(
caption = "Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6729 | 0.6172 | 17.2911 | 0.0000 | 9.4628 | 11.8830 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| Education: Linear trend | -0.6642 | 0.3158 | -2.1028 | 0.0355 | -1.2833 | -0.0450 |
| Education: Quadratic trend | -0.2133 | 0.2682 | -0.7954 | 0.4264 | -0.7391 | 0.3125 |
| Education: Cubic trend | -0.5630 | 0.2142 | -2.6282 | 0.0086 | -0.9830 | -0.1431 |
Interpretation:
Polynomial contrasts are most useful when the categories have a clear, meaningful order and you want to characterize the shape of the trend rather than compare individual groups to a reference.
| Coding Scheme | R Function | Intercept | Each β represents | Best for |
|---|---|---|---|---|
| Treatment (Reference) | contr.treatment (default) | Reference group mean | Difference from reference group | Group comparisons to baseline |
| Effect (Deviation) | contr.sum | Grand mean | Deviation from grand mean | ANOVA-style analyses |
| Polynomial (Ordinal) | contr.poly | Grand mean | Linear/quadratic/cubic trend | Ordinal variables with ordered levels |
Guidelines for choosing the reference group:
as.factor() Is RequiredIf a categorical variable is stored as numeric in your data (e.g.,
coded 0, 1, 2, 3), R will treat it as continuous by default. You
must use as.factor() or
factor() to tell R it is categorical:
# WRONG: R treats educ_numeric as continuous
mod_wrong <- lm(menthlth_days ~ educ_numeric, data = brfss_dv)
# RIGHT: Convert to factor first
mod_right <- lm(menthlth_days ~ factor(educ_numeric), data = brfss_dv)
# Compare: 1 coefficient (wrong) vs. 3 coefficients (right)
tribble(
~Model, ~`Number of education coefficients`, ~`Degrees of freedom used`,
"Numeric (wrong)", 1, 1,
"Factor (correct)", 3, 3
) |>
kable(caption = "Numeric vs. Factor Treatment of Categorical Variables") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Model | Number of education coefficients | Degrees of freedom used |
|---|---|---|
| Numeric (wrong) | 1 | 1 |
| Factor (correct) | 3 | 3 |
What if we want to compare HS graduates to college graduates, but neither is the reference group? We have two options:
Option 1: Change the reference group with
relevel().
Option 2: Compute the difference manually from the model output.
# Difference between HS graduate and College graduate
# = β_HS_grad - β_College_grad
diff_est <- coef(mod_educ)["educationHS graduate"] - coef(mod_educ)["educationCollege graduate"]
# Use linearHypothesis() for a formal test with SE and p-value
lin_test <- linearHypothesis(mod_educ, "educationHS graduate - educationCollege graduate = 0")
cat("Estimated difference (HS grad - College grad):", round(diff_est, 3), "days\n")## Estimated difference (HS grad - College grad): 0.556 days
## F-statistic: 4.66
## p-value: 0.0309
car::linearHypothesis()is a powerful function for testing any linear combination of coefficients, not just comparisons to the reference group.
| Concept | Key Point |
|---|---|
| Categorical predictors | Cannot be included as raw numeric codes in regression |
| Dummy variables | Binary (0/1) indicators; need \(k - 1\) for \(k\) categories |
| Reference group | The omitted category; all comparisons are relative to it |
| Changing reference | Use relevel(); predictions unchanged, interpretation
changes |
| Partial F-test | Tests whether the categorical variable as a whole is significant |
| Dummy variable trap | Including \(k\) dummies + intercept = perfect multicollinearity |
as.factor() |
Required when categorical variable is stored as numeric |
| Coding schemes | Treatment (default), effect, polynomial — each answers a different question |
| Type III ANOVA | Preferred for unbalanced observational data |
| Linear hypothesis | linearHypothesis() tests comparisons between
non-reference groups |
EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
###1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
brfss_dv |>
select(menthlth_days, age, sex, gen_health, marital_status) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
age ~ "Age (years)",
sex ~ "Sex",
gen_health ~ "General health status",
marital_status ~ "Marital status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
Marital status | 5,000 | |
Married | 2,708 (54%) | |
Divorced | 622 (12%) | |
Widowed | 534 (11%) | |
Separated | 109 (2.2%) | |
Never married | 848 (17%) | |
Unmarried couple | 179 (3.6%) | |
1Mean (SD); n (%) | ||
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by General Health",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "General Health",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Mental Health Days by General Health
marital_status. Which marital status group has the highest
mean? The lowest?# Compute observed group means
groups_lab <- brfss_dv |>
summarise(mean_days = mean(menthlth_days), .by = c(marital_status))
ggplot(groups_lab, aes(x = marital_status, y = mean_days, fill = marital_status)) +
geom_col(alpha = 0.85) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by Marital Status",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Marital Status",
y = "Mean Mentally Unhealthy Days"
) +
theme_classic(base_size = 13) +
theme(legend.position = "none")Mean Mentally Unhealthy Days by Marital Status in Analytic Sample
Separated people have the highest mean number of mentally unhealthy days and widowed people have the lowest mean number.
gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ as.numeric(gen_health), data=brfss_dv)
tidy(naive_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Naive Model: General Health Treated as Continuous",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -0.6718 | 0.2705 | -2.4840 | 0.013 | -1.2021 | -0.1416 |
| as.numeric(gen_health) | 1.8578 | 0.1036 | 17.9259 | 0.000 | 1.6547 | 2.0610 |
In the naive version of the model, for every categorical increase by 1 unit (ie from Excellent down to Very Good) of general health quality, the number of mentally healthy days increases by 1.8.
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.mod <- lm(menthlth_days ~ gen_health, data=brfss_dv)
tidy(mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Categorical Model: General Health Treated as Categorical",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.1174 | 0.2332 | 9.0790 | 0.0000 | 1.6602 | 2.5746 |
| gen_healthVery good | 0.5903 | 0.2941 | 2.0070 | 0.0448 | 0.0137 | 1.1670 |
| gen_healthGood | 1.9535 | 0.3082 | 6.3375 | 0.0000 | 1.3492 | 2.5577 |
| gen_healthFair | 5.0624 | 0.4064 | 12.4572 | 0.0000 | 4.2657 | 5.8590 |
| gen_healthPoor | 9.6640 | 0.6090 | 15.8678 | 0.0000 | 8.4701 | 10.8580 |
The factor version reports 4 discrete categories because each category of a factor is split out as its own covariate. This is less misleading than the continuous version because the difference between a single numeric category (ie. very good health to good health) is unlikely to be uniform across all categories, but the naive model interprets them as if they are.
gen_health as a factor:menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
mod_genhealth <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv)
tidy(mod_genhealth, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Dummy Variable: General Health",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5930 | 0.6304 | 15.2163 | 0.0000 | 8.3570 | 10.8289 |
| age | -0.0867 | 0.0060 | -14.4888 | 0.0000 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0.0000 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0.0000 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0.0000 | -0.7367 | -0.4365 |
| gen_healthVery good | 0.7899 | 0.2797 | 2.8247 | 0.0048 | 0.2417 | 1.3382 |
| gen_healthGood | 1.8436 | 0.2973 | 6.2020 | 0.0000 | 1.2608 | 2.4264 |
| gen_healthFair | 3.3953 | 0.4180 | 8.1234 | 0.0000 | 2.5759 | 4.2147 |
| gen_healthPoor | 5.3353 | 0.6829 | 7.8122 | 0.0000 | 3.9965 | 6.6742 |
Write out the fitted regression equation.
The model is:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
\[\widehat{\text{Mental Health Days}} = 9.593 + -0.087(\text{Age}) + 1.725(\text{Female}) + 0.231(\text{Phys Days}) + -0.587(\text{Sleep}) + 0.79(\text{Very Good Health}) + -0.129(\text{Good Health}) + -1.143(\text{Fair Health}) NA(\text{Poor Health})\]
gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.The reference group for general health reflects responses reporting “Excellent” general health, and all other dummy variables for general health reflects a comparison with that group. Based on our model, holding all other variables constant, participants reporting “Very good” general health report an increase of 0.78 poor mental health days over a 30-day period. Those reporting “Good” health report an increase of 1.8 poor mental health days over those reporting “excellent” health, holding all other variables constant. Those reporting “Fair” general health report an increase of 3.4 poor mental health days over those who report “excellent health. And those reporting”Poor” general health report 5.33 additional poor mental health days over those who report “Excellent” general health, holding all other variables constant.
gen_health dummy variables only. Which group
differs most from the reference group?tidy(mod_genhealth, conf.int = TRUE) %>%
filter(term != "(Intercept)") %>%
filter( grepl("gen", term)) %>%
mutate(
term = dplyr::recode(term,
# "menthlth_days" = "Mentally unhealthy days (past 30)",
# "age" = "Age (years)",
# "sexFemale" = "Sex: Female",
# "physhlth_days" = "Physical health days",
"gen_healthVery good"= "General health: Very good",
"gen_healthGood" = "General health: Good",
"gen_healthFair" = "General health: Fair",
"gen_healthPoor" = "General health: Poor"
),
term = fct_reorder(term, estimate),
sig = ifelse(p.value < 0.05, "Significant (p < 0.05)", "Non-significant")
) %>%
ggplot(aes(x = estimate, y = term, color = sig)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray60") +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.25, linewidth = 0.9) +
geom_point(size = 3.5) +
scale_color_manual(values = c("Significant (p < 0.05)" = "steelblue",
"Non-significant" = "tomato")) +
labs(
title = "Partial Regression Coefficients with 95% Confidence Intervals",
subtitle = "Outcome: Mentally Unhealthy Days (BRFSS 2020, n = 5,000)",
x = "Estimated Change in Mental Health Days (β̂)",
y = NULL,
color = NULL
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")People reporting “Poor” general health have the largest difference from the reference group.
relevel() to change
the reference group for gen_health to “Good.” Refit the
model from Task 3a.# Change reference group to "College graduate"Good"
brfss_dv$gen_reref <- relevel(brfss_dv$gen_health, ref = "Good")
mod_gen_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_reref, data = brfss_dv)
tidy(mod_gen_reref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: Good)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.4366 | 0.6298 | 18.1584 | 0e+00 | 10.2019 | 12.6713 |
| age | -0.0867 | 0.0060 | -14.4888 | 0e+00 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0e+00 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0e+00 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0e+00 | -0.7367 | -0.4365 |
| gen_rerefExcellent | -1.8436 | 0.2973 | -6.2020 | 0e+00 | -2.4264 | -1.2608 |
| gen_rerefVery good | -1.0537 | 0.2581 | -4.0819 | 0e+00 | -1.5597 | -0.5476 |
| gen_rerefFair | 1.5517 | 0.3861 | 4.0186 | 1e-04 | 0.7947 | 2.3087 |
| gen_rerefPoor | 3.4917 | 0.6506 | 5.3673 | 0e+00 | 2.2164 | 4.7671 |
Yes - the continuous variable coefficients are the same between the two models. This is because the difference between the two models is just what the reference group, and only impacts the variables that are relative to the reference group - the dummy variables for general health. The continuous variables are unafffected.
tribble(
~Quantity, ~`Ref: Excellent`, ~`Ref: Good`,
"R-squared", round(summary(mod_genhealth)$r.squared, 4), round(summary(mod_gen_reref)$r.squared, 4),
"Residual SE", round(summary(mod_genhealth)$sigma, 3), round(summary(mod_gen_reref)$sigma, 3)
) |>
kable(caption = "Comparing Models with Different Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Ref: Excellent | Ref: Good |
|---|---|---|
| R-squared | 0.1694 | 0.1694 |
| Residual SE | 7.2080 | 7.2080 |
Changing the reference group does not change the prediction because the only thing that has changes is the interpretation of the dummy variables, not the model itself.
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
# Reduced model (no gen_health)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)
tribble(
~Quantity, ~`Full Model`, ~`Reduced Model`,
"R-squared", round(summary(mod_genhealth)$r.squared, 4), round(summary(mod_reduced)$r.squared, 4),
"Residual SE", round(summary(mod_genhealth)$sigma, 3), round(summary(mod_reduced)$sigma, 3)
) |>
kable(caption = "Comparing Models with Different Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Full Model | Reduced Model |
|---|---|---|
| R-squared | 0.1694 | 0.1522 |
| Residual SE | 7.2080 | 7.2800 |
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
# Reduced model (no education)
# Partial F-test
f_test <- anova(mod_reduced, mod_genhealth)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does Education Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health | 4991 | 259335.4 | 4 | 5379.751 | 25.8838 | 0 |
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
Anova(mod_genhealth, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12030.737 | 1 | 231.5357 | 0 |
| age | 10907.874 | 1 | 209.9258 | 0 |
| sex | 3663.847 | 1 | 70.5120 | 0 |
| physhlth_days | 10633.920 | 1 | 204.6535 | 0 |
| sleep_hrs | 3049.400 | 1 | 58.6868 | 0 |
| gen_health | 5379.751 | 4 | 25.8838 | 0 |
| Residuals | 259335.435 | 4991 | NA | NA |
Yes - the results of the effect of the Statistic of gen_health as a component to the partial f-test comparing the full model and the reduced model are the same - 25.883.
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
This analysis indicates that in the BRFSS 2020 data, participants who reported their general health as “poor”, reported the largest number of additional mentally unhealthy days in a 30-day time period compared to those reporting their general health as “Excellent”. Respondents indicating their general health was poor reported over 5 mentally unhealthy days compared to those who described their general health as excellent. Because the BRFSS dataset is collected all at a single point in time, there is no way to confirm causation or which of these effects occurred first.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.
The education model indicated that categories representing increased education corresponded to a decreased number of mentally unhealthy days, and our lab indicated that increasing reporting levels of poor general health corresponded to an increased number of reported mentally unhealthy days. Both models accounted for Sleep + Age + Sex + Physically Unhealthy Days reported, and then either Education category or General Health category - they had very similar R2s, at 0.155 for Education vs. 0.169 for General Health. However - the Type III ANOVA analysis on each model indicated that General Health had a larger effect on the model (25 vs 6), and the effect of the General Health categories on the model - with an expected additional number of 5 days reporting poor mental health for participants reporting poor mental health. Because these measures indicate greater effect on mentally unhealthy days for the general health response compared to the participant’s level of education, I would recommend using reported general health in the final model.
End of Lab Activity