In the previous lectures on Multiple Linear Regression, all predictors we used were either continuous (sleep hours, age, physical health days) or binary (sex, exercise). But many variables in epidemiology are categorical with more than two levels, including race/ethnicity, education, marital status, and disease staging.
When a categorical predictor has \(k\) levels, we cannot simply plug in the numeric codes (1, 2, 3, …) as if the variable were continuous. Doing so imposes an assumption that the categories are equally spaced and linearly related to the outcome, which is rarely appropriate for nominal variables and often inappropriate even for ordinal ones.
Dummy variables (also called indicator variables) provide the correct way to include categorical predictors in regression models. This lecture covers:
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 dataset. In this lecture, we focus on how categorical predictors, particularly education level, relate to mental health outcomes.
Research question for today:
How is educational attainment associated with the number of mentally unhealthy days in the past 30 days, after adjusting for age, sex, physical health, and sleep?
brfss_full <- read_xpt(
"/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_2020"
) |>
clean_names()brfss_dv <- brfss_full |>
mutate(
# Outcome: mentally unhealthy days in past 30
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age
age = age80,
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Education (6-level raw BRFSS variable EDUCA)
# 1 = Never attended school or only kindergarten
# 2 = Grades 1 through 8 (Elementary)
# 3 = Grades 9 through 11 (Some high school)
# 4 = Grade 12 or GED (High school graduate)
# 5 = College 1 year to 3 years (Some college or technical school)
# 6 = College 4 years or more (College graduate)
# 9 = Refused
education = factor(case_when(
educa %in% c(1, 2, 3) ~ "Less than HS",
educa == 4 ~ "HS graduate",
educa == 5 ~ "Some college",
educa == 6 ~ "College graduate",
TRUE ~ NA_character_
), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
# General health status (5-level)
gen_health = factor(case_when(
genhlth == 1 ~ "Excellent",
genhlth == 2 ~ "Very good",
genhlth == 3 ~ "Good",
genhlth == 4 ~ "Fair",
genhlth == 5 ~ "Poor",
TRUE ~ NA_character_
), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
# Marital status
marital_status = factor(case_when(
marital == 1 ~ "Married",
marital == 2 ~ "Divorced",
marital == 3 ~ "Widowed",
marital == 4 ~ "Separated",
marital == 5 ~ "Never married",
marital == 6 ~ "Unmarried couple",
TRUE ~ NA_character_
), levels = c("Married", "Divorced", "Widowed", "Separated",
"Never married", "Unmarried couple")),
# Store the raw education numeric code for the "naive approach" demonstration
educ_numeric = case_when(
educa %in% c(1, 2, 3) ~ 1,
educa == 4 ~ 2,
educa == 5 ~ 3,
educa == 6 ~ 4,
TRUE ~ NA_real_
)
) |>
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(age), age >= 18,
!is.na(sex),
!is.na(education),
!is.na(gen_health),
!is.na(marital_status)
)
# Reproducible random sample
set.seed(1220)
brfss_dv <- brfss_dv |>
select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
education, gen_health, marital_status, educ_numeric) |>
slice_sample(n = 5000)
# Save for lab activity
saveRDS(brfss_dv,
"/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_dv_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_dv), ncol(brfss_dv))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
brfss_dv |>
select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
education, gen_health) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
sex ~ "Sex",
education ~ "Education level",
gen_health ~ "General health status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Physically unhealthy days (past 30) | 5,000 | 3.3 (7.9) |
Sleep (hours/night) | 5,000 | 7.0 (1.4) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
Education level | 5,000 | |
Less than HS | 290 (5.8%) | |
HS graduate | 1,348 (27%) | |
Some college | 1,340 (27%) | |
College graduate | 2,022 (40%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
1Mean (SD); n (%) | ||
ggplot(brfss_dv, aes(x = education, fill = education)) +
geom_bar(alpha = 0.85) +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Distribution of Education Level",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Education Level",
y = "Count"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Distribution of Education Level in Analytic Sample
ggplot(brfss_dv, aes(x = education, y = menthlth_days, fill = education)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by Education Level",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "Education Level",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Mental Health Days by Education Level
Categorical predictor variables come in two forms:
| Type | Definition | Examples |
|---|---|---|
| Nominal | Categories with no natural ordering | Sex, race/ethnicity, marital status, blood type |
| Ordinal | Categories with a meaningful order | Education level, income bracket, disease stage, Likert scale |
A further distinction is:
Note that categorical variables can also be created by grouping continuous variables (e.g., age groups from continuous age), though this generally results in a loss of information.
Suppose education has been coded as: 1 = Less than HS, 2 = HS graduate, 3 = Some college, 4 = College graduate.
If we include this numeric code directly in a regression model, we are assuming:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 \cdot \text{educ_numeric} + \varepsilon\]
This forces the model to assume that the difference in expected \(Y\) between “Less than HS” and “HS graduate” is the same as the difference between “HS graduate” and “Some college,” and the same again between “Some college” and “College graduate.” In other words, we are assuming equally spaced, linear increments.
# The WRONG way: treating education as a continuous numeric variable
naive_mod <- lm(menthlth_days ~ age + educ_numeric, data = brfss_dv)
tidy(naive_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Naive Model: Education Treated as Continuous",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5601 | 0.5039 | 18.9727 | 0 | 8.5723 | 10.5479 |
| age | -0.0661 | 0.0063 | -10.5135 | 0 | -0.0784 | -0.0538 |
| educ_numeric | -0.7168 | 0.1158 | -6.1917 | 0 | -0.9437 | -0.4898 |
This model estimates a single coefficient for education, meaning each step up the education ladder is associated with the same change in mentally unhealthy days. This constraint is problematic for two reasons:
Let’s visualize why this matters:
# Compute observed group means
group_means <- brfss_dv |>
summarise(mean_days = mean(menthlth_days), .by = c(education, educ_numeric))
# Generate predictions from the naive model
pred_naive <- tibble(
educ_numeric = 1:4,
predicted = predict(naive_mod, newdata = tibble(age = mean(brfss_dv$age), educ_numeric = 1:4))
)
ggplot() +
geom_point(data = group_means,
aes(x = educ_numeric, y = mean_days),
size = 4, color = "steelblue") +
geom_line(data = pred_naive,
aes(x = educ_numeric, y = predicted),
color = "tomato", linewidth = 1.2, linetype = "dashed") +
geom_point(data = pred_naive,
aes(x = educ_numeric, y = predicted),
size = 3, color = "tomato", shape = 17) +
scale_x_continuous(
breaks = 1:4,
labels = c("Less than HS", "HS graduate", "Some college", "College graduate")
) +
labs(
title = "Observed Group Means (blue) vs. Naive Linear Fit (red)",
subtitle = "The naive model forces equal spacing between education levels",
x = "Education Level",
y = "Mean Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13)Naive Linear Fit vs. Actual Group Means by Education
Key takeaway: The blue dots (observed means) do not fall along a straight line. The naive linear model (red) misrepresents the actual pattern. We need a more flexible approach.
A dummy variable (also called an indicator variable) is a variable that takes on only two possible values:
If a categorical predictor has \(k\) categories, we need exactly \(k - 1\) dummy variables when the model includes an intercept. The omitted category becomes the reference group (also called the control group or baseline group).
Why \(k - 1\) and not \(k\)? Because the intercept already captures the mean for the reference group. Including all \(k\) dummies plus the intercept would create perfect multicollinearity (the dummy variables would sum to equal the intercept column), and the model could not be estimated.
The simplest example is a variable with two categories, such as sex.
With \(k = 2\), we need \(2 - 1 = 1\) dummy variable. If we choose Female as the reference group:
\[\text{male} = \begin{cases} 1 & \text{if male} \\ 0 & \text{if female} \end{cases}\]
The regression model becomes:
\[Y = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{male} + \varepsilon\]
For males (\(\text{male} = 1\)): \[E(Y | \text{age}, \text{male}) = (\beta_0 + \beta_2) + \beta_1 \cdot \text{age}\]
For females (\(\text{male} = 0\)): \[E(Y | \text{age}, \text{female}) = \beta_0 + \beta_1 \cdot \text{age}\]
Both groups share the same slope for age but have different intercepts. The coefficient \(\beta_2\) is the expected difference in \(Y\) between males and females, holding age constant.
# Fit model with sex as a dummy variable
mod_sex <- lm(menthlth_days ~ age + sex, data = brfss_dv)
tidy(mod_sex, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Dichotomous Dummy Variable: Sex",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 6.6262 | 0.3730 | 17.7666 | 0 | 5.8951 | 7.3574 |
| age | -0.0698 | 0.0063 | -11.1011 | 0 | -0.0821 | -0.0575 |
| sexFemale | 1.8031 | 0.2210 | 8.1585 | 0 | 1.3698 | 2.2364 |
Interpretation:
Note that R automatically creates dummy variables when a factor is included in
lm(). It uses alphabetical or level order to set the reference group, which is why Male (the first level) is the reference here.
pred_sex <- ggpredict(mod_sex, terms = c("age [20:80]", "sex"))
ggplot(pred_sex, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15, color = NA) +
labs(
title = "Predicted Mental Health Days by Age and Sex",
subtitle = "Parallel lines: same slope, different intercepts",
x = "Age (years)",
y = "Predicted Mentally Unhealthy Days",
color = "Sex",
fill = "Sex"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")Parallel Regression Lines: Males vs. Females
Geometrically: Dummy variables produce parallel regression lines. The intercept shifts by \(\beta_2\) for the non-reference group, but the slope remains the same.
Education has \(k = 4\) categories, so we need \(4 - 1 = 3\) dummy variables. If we choose “Less than HS” as the reference group:
\[\text{HS_graduate} = \begin{cases} 1 & \text{if HS graduate} \\ 0 & \text{otherwise} \end{cases}\]
\[\text{Some_college} = \begin{cases} 1 & \text{if Some college} \\ 0 & \text{otherwise} \end{cases}\]
\[\text{College_graduate} = \begin{cases} 1 & \text{if College graduate} \\ 0 & \text{otherwise} \end{cases}\]
The data matrix looks like this:
| Observation | Education | HS_graduate | Some_college | College_graduate |
|---|---|---|---|---|
| 1 | Less than HS | 0 | 0 | 0 |
| 2 | HS graduate | 1 | 0 | 0 |
| 3 | Some college | 0 | 1 | 0 |
| 4 | College graduate | 0 | 0 | 1 |
| 5 | Less than HS | 0 | 0 | 0 |
Notice that the reference group is identified by having all dummy variables equal to zero.
The reference group is the category against which all others are compared. Key points:
When we include a factor variable in lm(), R
automatically creates the dummy variables. The first level of the factor
is used as the reference group by default.
# Fit model with education as a factor (R creates dummies automatically)
mod_educ <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education,
data = brfss_dv)
tidy(mod_educ, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Education Dummy Variables (Reference: Less than HS)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.1377 | 0.7390 | 15.0709 | 0.0000 | 9.6889 | 12.5865 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| educationHS graduate | -0.5873 | 0.4719 | -1.2445 | 0.2134 | -1.5125 | 0.3379 |
| educationSome college | -0.1289 | 0.4735 | -0.2723 | 0.7854 | -1.0572 | 0.7993 |
| educationCollege graduate | -1.1429 | 0.4607 | -2.4805 | 0.0132 | -2.0461 | -0.2396 |
The model is:
\[\widehat{\text{Mental Health Days}} = 11.138 + -0.077(\text{Age}) + 1.681(\text{Female}) + 0.311(\text{Phys Days}) + -0.628(\text{Sleep}) + -0.587(\text{HS grad}) + -0.129(\text{Some college}) + -1.143(\text{College grad})\]
Each education coefficient represents the estimated difference in mentally unhealthy days between that group and the reference group (Less than HS), holding all other variables constant:
HS graduate (\(\hat{\beta}\) = -0.587): Compared to those with less than a high school education, HS graduates report an estimated 0.587 fewer mentally unhealthy days, holding age, sex, physical health days, and sleep constant.
Some college (\(\hat{\beta}\) = -0.129): Compared to those with less than a high school education, those with some college report an estimated 0.129 fewer mentally unhealthy days, holding all else constant.
College graduate (\(\hat{\beta}\) = -1.143): Compared to those with less than a high school education, college graduates report an estimated 1.143 fewer mentally unhealthy days, holding all else constant.
Key pattern: All comparisons are made relative to the reference group. The coefficients do NOT directly tell us the difference between, say, HS graduates and college graduates. We would need to compute \(\hat{\beta}_{\text{HS grad}} - \hat{\beta}_{\text{College grad}}\) for that comparison (or change the reference group).
pred_educ <- ggpredict(mod_educ, terms = c("age [20:80]", "education"))
ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
labs(
title = "Predicted Mental Health Days by Age and Education",
subtitle = "Parallel lines: same slopes for age, different intercepts by education",
x = "Age (years)",
y = "Predicted Mentally Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set2")Predicted Mental Health Days by Age and Education Level
These are a series of parallel lines, one for each education level. The slope for age is the same across all groups; only the intercept differs. Each education dummy shifts the intercept up or down relative to the reference group.
relevel() in RWe may want to change the reference group to a category that is more epidemiologically meaningful. For instance, “College graduate” is the largest group and could serve as a natural comparison.
# Change reference group to College graduate
brfss_dv$education_reref <- relevel(brfss_dv$education, ref = "College graduate")
mod_educ_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_reref,
data = brfss_dv)
tidy(mod_educ_reref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: College graduate)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.9948 | 0.6272 | 15.9349 | 0.0000 | 8.7652 | 11.2245 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| education_rerefLess than HS | 1.1429 | 0.4607 | 2.4805 | 0.0132 | 0.2396 | 2.0461 |
| education_rerefHS graduate | 0.5556 | 0.2574 | 2.1586 | 0.0309 | 0.0510 | 1.0601 |
| education_rerefSome college | 1.0139 | 0.2566 | 3.9507 | 0.0001 | 0.5108 | 1.5171 |
tribble(
~Quantity, ~`Ref: Less than HS`, ~`Ref: College graduate`,
"Intercept", round(coef(mod_educ)[1], 3), round(coef(mod_educ_reref)[1], 3),
"Age coefficient", round(coef(mod_educ)[2], 3), round(coef(mod_educ_reref)[2], 3),
"Sex coefficient", round(coef(mod_educ)[3], 3), round(coef(mod_educ_reref)[3], 3),
"Physical health days", round(coef(mod_educ)[4], 3), round(coef(mod_educ_reref)[4], 3),
"Sleep hours", round(coef(mod_educ)[5], 3), round(coef(mod_educ_reref)[5], 3),
"R-squared", round(summary(mod_educ)$r.squared, 4), round(summary(mod_educ_reref)$r.squared, 4),
"Residual SE", round(summary(mod_educ)$sigma, 3), round(summary(mod_educ_reref)$sigma, 3)
) |>
kable(caption = "Comparing Models with Different Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Ref: Less than HS | Ref: College graduate |
|---|---|---|
| Intercept | 11.1380 | 9.9950 |
| Age coefficient | -0.0770 | -0.0770 |
| Sex coefficient | 1.6810 | 1.6810 |
| Physical health days | 0.3110 | 0.3110 |
| Sleep hours | -0.6280 | -0.6280 |
| R-squared | 0.1553 | 0.1553 |
| Residual SE | 7.2690 | 7.2690 |
What changes:
What stays the same:
This is a critical point: Changing the reference group does not change the model’s fit or predictions. It only changes the interpretation of the dummy variable coefficients.
# Verify that predicted values are identical
pred_orig <- predict(mod_educ)
pred_reref <- predict(mod_educ_reref)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
If we include \(k\) dummy variables and an intercept for a variable with \(k\) categories, the columns of the design matrix \(X\) are linearly dependent. Specifically:
\[\text{Intercept} = D_1 + D_2 + \cdots + D_k\]
where \(D_1, \ldots, D_k\) are the \(k\) dummy variables (one for each category). This means the matrix \(X^TX\) is singular and cannot be inverted, so the OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^TY\) does not exist.
This is called the dummy variable trap.
| Obs | Intercept | A | B | C | A + B + C |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 0 | 1 |
| 2 | 1 | 0 | 1 | 0 | 1 |
| 3 | 1 | 0 | 0 | 1 | 1 |
| 4 | 1 | 1 | 0 | 0 | 1 |
Solutions:
- 1 in the formula and include all \(k\) dummies. Then each coefficient is the
group mean (adjusted for other predictors) rather than a difference from
a reference.# Model without intercept: all k dummies included
mod_no_int <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education - 1,
data = brfss_dv)
tidy(mod_no_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model Without Intercept: All k Education Dummies Included",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexMale | 11.1377 | 0.7390 | 15.0709 | 0.0000 | 9.6889 | 12.5865 |
| sexFemale | 12.8190 | 0.7524 | 17.0365 | 0.0000 | 11.3439 | 14.2941 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| educationHS graduate | -0.5873 | 0.4719 | -1.2445 | 0.2134 | -1.5125 | 0.3379 |
| educationSome college | -0.1289 | 0.4735 | -0.2723 | 0.7854 | -1.0572 | 0.7993 |
| educationCollege graduate | -1.1429 | 0.4607 | -2.4805 | 0.0132 | -2.0461 | -0.2396 |
Caution: Removing the intercept changes the interpretation of \(R^2\) and should only be done when there is a substantive reason. In most epidemiological applications, reference cell coding (the default) is preferred.
When a categorical variable with \(k\) levels enters the model as \(k - 1\) dummies, we cannot assess its overall significance by looking at individual t-tests for each dummy. A single dummy might not be statistically significant on its own, yet the variable as a whole might be.
To test whether education as a whole is associated with the outcome, we use a partial F-test (also called an extra sum of squares F-test):
\[H_0: \beta_{\text{HS grad}} = \beta_{\text{Some college}} = \beta_{\text{College grad}} = 0\] \[H_A: \text{At least one } \beta_j \neq 0\]
This compares the full model (with education) to a reduced model (without education):
# Reduced model (no education)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)
# Partial F-test
f_test <- anova(mod_reduced, mod_educ)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does Education Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education | 4992 | 263744.4 | 3 | 970.7509 | 6.1246 | 4e-04 |
car::Anova() for Type III TestsThe car::Anova() function with type = "III"
provides a convenient way to test the overall significance of each
predictor, including categorical variables:
Anova(mod_educ, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12000.1867 | 1 | 227.1325 | 0e+00 |
| age | 8863.3522 | 1 | 167.7603 | 0e+00 |
| sex | 3469.6448 | 1 | 65.6714 | 0e+00 |
| physhlth_days | 28765.1139 | 1 | 544.4492 | 0e+00 |
| sleep_hrs | 3506.1243 | 1 | 66.3619 | 0e+00 |
| education | 970.7509 | 3 | 6.1246 | 4e-04 |
| Residuals | 263744.4348 | 4992 | NA | NA |
Type I vs. Type III: Type I (sequential) sums of squares depend on the order variables enter the model. Type III (partial) sums of squares test each variable after all others, regardless of order. For unbalanced observational data (the norm in epidemiology), Type III is preferred.
This is what R uses by default (contr.treatment). Each
coefficient represents the difference between a group and the reference
group.
## HS graduate Some college College graduate
## Less than HS 0 0 0
## HS graduate 1 0 0
## Some college 0 1 0
## College graduate 0 0 1
In effect coding (contr.sum), each
coefficient represents the difference between a group’s mean and the
grand mean (the unweighted average of all group means).
This is common in ANOVA contexts.
# Set effect coding
brfss_dv$education_effect <- brfss_dv$education
contrasts(brfss_dv$education_effect) <- contr.sum(4)
mod_effect <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_effect,
data = brfss_dv)
tidy(mod_effect, conf.int = TRUE) |>
mutate(
term = case_when(
str_detect(term, "education_effect1") ~ "Education: Less than HS vs. Grand Mean",
str_detect(term, "education_effect2") ~ "Education: HS graduate vs. Grand Mean",
str_detect(term, "education_effect3") ~ "Education: Some college vs. Grand Mean",
TRUE ~ term
),
across(where(is.numeric), \(x) round(x, 4))
) |>
kable(
caption = "Effect Coding: Each Education Coefficient vs. Grand Mean",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6729 | 0.6172 | 17.2911 | 0.0000 | 9.4628 | 11.8830 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| Education: Less than HS vs. Grand Mean | 0.4648 | 0.3323 | 1.3988 | 0.1619 | -0.1866 | 1.1162 |
| Education: HS graduate vs. Grand Mean | -0.1225 | 0.1939 | -0.6319 | 0.5275 | -0.5026 | 0.2576 |
| Education: Some college vs. Grand Mean | 0.3358 | 0.1946 | 1.7257 | 0.0845 | -0.0457 | 0.7174 |
With effect coding, the intercept is the grand mean (adjusted for covariates), and each education coefficient shows how far that group deviates from the grand mean. The omitted group’s deviation is the negative sum of the others.
When a categorical variable is truly ordinal (like
education), we can test for specific patterns using orthogonal
polynomial contrasts (contr.poly). These decompose the
group differences into linear, quadratic, and cubic trends.
# Ordinal polynomial contrasts
brfss_dv$education_ord <- brfss_dv$education
contrasts(brfss_dv$education_ord) <- contr.poly(4)
mod_ord <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + education_ord,
data = brfss_dv)
tidy(mod_ord, conf.int = TRUE) |>
mutate(
term = case_when(
str_detect(term, "\\.L$") ~ "Education: Linear trend",
str_detect(term, "\\.Q$") ~ "Education: Quadratic trend",
str_detect(term, "\\.C$") ~ "Education: Cubic trend",
TRUE ~ term
),
across(where(is.numeric), \(x) round(x, 4))
) |>
kable(
caption = "Polynomial Contrasts: Testing Linear, Quadratic, and Cubic Trends",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6729 | 0.6172 | 17.2911 | 0.0000 | 9.4628 | 11.8830 |
| age | -0.0772 | 0.0060 | -12.9522 | 0.0000 | -0.0888 | -0.0655 |
| sexFemale | 1.6813 | 0.2075 | 8.1038 | 0.0000 | 1.2745 | 2.0880 |
| physhlth_days | 0.3112 | 0.0133 | 23.3334 | 0.0000 | 0.2850 | 0.3373 |
| sleep_hrs | -0.6281 | 0.0771 | -8.1463 | 0.0000 | -0.7793 | -0.4770 |
| Education: Linear trend | -0.6642 | 0.3158 | -2.1028 | 0.0355 | -1.2833 | -0.0450 |
| Education: Quadratic trend | -0.2133 | 0.2682 | -0.7954 | 0.4264 | -0.7391 | 0.3125 |
| Education: Cubic trend | -0.5630 | 0.2142 | -2.6282 | 0.0086 | -0.9830 | -0.1431 |
Interpretation:
Polynomial contrasts are most useful when the categories have a clear, meaningful order and you want to characterize the shape of the trend rather than compare individual groups to a reference.
| Coding Scheme | R Function | Intercept | Each β represents | Best for |
|---|---|---|---|---|
| Treatment (Reference) | contr.treatment (default) | Reference group mean | Difference from reference group | Group comparisons to baseline |
| Effect (Deviation) | contr.sum | Grand mean | Deviation from grand mean | ANOVA-style analyses |
| Polynomial (Ordinal) | contr.poly | Grand mean | Linear/quadratic/cubic trend | Ordinal variables with ordered levels |
Guidelines for choosing the reference group:
as.factor() Is RequiredIf a categorical variable is stored as numeric in your data (e.g.,
coded 0, 1, 2, 3), R will treat it as continuous by default. You
must use as.factor() or
factor() to tell R it is categorical:
# WRONG: R treats educ_numeric as continuous
mod_wrong <- lm(menthlth_days ~ educ_numeric, data = brfss_dv)
# RIGHT: Convert to factor first
mod_right <- lm(menthlth_days ~ factor(educ_numeric), data = brfss_dv)
# Compare: 1 coefficient (wrong) vs. 3 coefficients (right)
tribble(
~Model, ~`Number of education coefficients`, ~`Degrees of freedom used`,
"Numeric (wrong)", 1, 1,
"Factor (correct)", 3, 3
) |>
kable(caption = "Numeric vs. Factor Treatment of Categorical Variables") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Model | Number of education coefficients | Degrees of freedom used |
|---|---|---|
| Numeric (wrong) | 1 | 1 |
| Factor (correct) | 3 | 3 |
What if we want to compare HS graduates to college graduates, but neither is the reference group? We have two options:
Option 1: Change the reference group with
relevel().
Option 2: Compute the difference manually from the model output.
# Difference between HS graduate and College graduate
# = β_HS_grad - β_College_grad
diff_est <- coef(mod_educ)["educationHS graduate"] - coef(mod_educ)["educationCollege graduate"]
# Use linearHypothesis() for a formal test with SE and p-value
lin_test <- linearHypothesis(mod_educ, "educationHS graduate - educationCollege graduate = 0")
cat("Estimated difference (HS grad - College grad):", round(diff_est, 3), "days\n")## Estimated difference (HS grad - College grad): 0.556 days
## F-statistic: 4.66
## p-value: 0.0309
car::linearHypothesis()is a powerful function for testing any linear combination of coefficients, not just comparisons to the reference group.
| Concept | Key Point |
|---|---|
| Categorical predictors | Cannot be included as raw numeric codes in regression |
| Dummy variables | Binary (0/1) indicators; need \(k - 1\) for \(k\) categories |
| Reference group | The omitted category; all comparisons are relative to it |
| Changing reference | Use relevel(); predictions unchanged, interpretation
changes |
| Partial F-test | Tests whether the categorical variable as a whole is significant |
| Dummy variable trap | Including \(k\) dummies + intercept = perfect multicollinearity |
as.factor() |
Required when categorical variable is stored as numeric |
| Coding schemes | Treatment (default), effect, polynomial — each answers a different question |
| Type III ANOVA | Preferred for unbalanced observational data |
| Linear hypothesis | linearHypothesis() tests comparisons between
non-reference groups |
EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
brfss_dv <- readRDS(
"/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_dv_2020.rds"
)1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
# Create a descriptive statistics table with means and n
brfss_dv |>
select(menthlth_days, age, sex, gen_health, marital_status) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
age ~ "Age (years)",
sex ~ "Sex",
gen_health ~ "General health status",
marital_status ~ "Marital Status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
Marital Status | 5,000 | |
Married | 2,708 (54%) | |
Divorced | 622 (12%) | |
Widowed | 534 (11%) | |
Separated | 109 (2.2%) | |
Never married | 848 (17%) | |
Unmarried couple | 179 (3.6%) | |
1Mean (SD); n (%) | ||
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
# Create boxplot for mentlth_days by gen_health
ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by General Health Status",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "General Health Status",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")The group that reports the most mentally unhealthy days are those who have a poor general health status. The group that reports the least mentally unhealthy days are those who have an excellent general health status. This pattern is consistent with what I would expect because poor health for me includes mental, physical, and social health meaning that they would likely have more mentally unhealthy days since they have a lower overall health status.
1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
# Calculate mean mental health days
mean_mtl_hlt <- brfss_dv |>
group_by(marital_status) |>
summarize(mean_mtl_hlt = mean(menthlth_days, na.rm = TRUE))
# Create grouped bar chart
ggplot(mean_mtl_hlt, aes(x = marital_status, y = mean_mtl_hlt, fill = marital_status)) +
geom_col(alpha = 0.85) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(mean_mtl_hlt, 2)), vjust = -0.3) +
scale_fill_brewer(palette = "Greens") +
labs(
title = "Mean Mentally Unhealthy Days by Marital Status",
x = "Marital Status",
y = "Mean Mentally Unhealthy Days Over Past 30 Days"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")The marital status group with the highest mean were separated couples with 6.22 mean mentally unhealthy days. The marital status group with the lowest mean were widowed individuals with 2.67 mean mentally unhealthy days. —
2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
# Create a new dataset with gen_health as a numeric code
brfss_dv_naive <- brfss_dv |>
mutate(
gen_health_num = case_when(
gen_health == "Excellent" ~ 1,
gen_health == "Very good" ~ 2,
gen_health == "Good" ~ 3,
gen_health == "Fair" ~ 4,
gen_health == "Poor" ~ 5,
))
# Run a model with this new numeric version
naive_mod <- lm(menthlth_days ~ gen_health_num, data = brfss_dv_naive)
# Extract and create a table with values from the model
tidy(naive_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Naive Model: Gen Health Treated as Continuous",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -0.6718 | 0.2705 | -2.4840 | 0.013 | -1.2021 | -0.1416 |
| gen_health_num | 1.8578 | 0.1036 | 17.9259 | 0.000 | 1.6547 | 2.0610 |
According to the naive version, the coefficient using gen_health_num is 1.8578. Each additional unit of general health is associated with an increase of 1.8578 mentally unhealthy days.
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
# Fit linear model
factor_mod <- lm(menthlth_days ~ gen_health, data = brfss_dv_naive)
# Create tidy table
tidy(factor_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Gen Health Treated as Factor",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.1174 | 0.2332 | 9.0790 | 0.0000 | 1.6602 | 2.5746 |
| gen_healthVery good | 0.5903 | 0.2941 | 2.0070 | 0.0448 | 0.0137 | 1.1670 |
| gen_healthGood | 1.9535 | 0.3082 | 6.3375 | 0.0000 | 1.3492 | 2.5577 |
| gen_healthFair | 5.0624 | 0.4064 | 12.4572 | 0.0000 | 4.2657 | 5.8590 |
| gen_healthPoor | 9.6640 | 0.6090 | 15.8678 | 0.0000 | 8.4701 | 10.8580 |
For the naive model, there is only one coefficient because the model is assessing general health as a numeric variable. This makes the model believe that the difference in expected Y is the same from one category to another (fair to poor or excellent to very good). This is an issue because the distance between code 1 and code 2 has no true meaning and is unjustified. The factor model instead provides four coefficients (each one compared to a reference group of excellent). The factor model creates dummy categories with one reference group. Which in this case, as we look at the coefficient, we see that very good has the lowest increase in mentally unhealthy days and poor has the highest increase. The numeric or naive way is misleading because it lumps all of these levels together because it assumes the distance between them is the same causing a confusing and invalid explanation of the variable. —
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation.
# Fit full linear model
full_mod <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv_naive)
# Create tidy table
tidy(full_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Regression with Age, Sex, Physical Health, Sleep Hours, and General Health (Reference: Excellent)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5930 | 0.6304 | 15.2163 | 0.0000 | 8.3570 | 10.8289 |
| age | -0.0867 | 0.0060 | -14.4888 | 0.0000 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0.0000 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0.0000 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0.0000 | -0.7367 | -0.4365 |
| gen_healthVery good | 0.7899 | 0.2797 | 2.8247 | 0.0048 | 0.2417 | 1.3382 |
| gen_healthGood | 1.8436 | 0.2973 | 6.2020 | 0.0000 | 1.2608 | 2.4264 |
| gen_healthFair | 3.3953 | 0.4180 | 8.1234 | 0.0000 | 2.5759 | 4.2147 |
| gen_healthPoor | 5.3353 | 0.6829 | 7.8122 | 0.0000 | 3.9965 | 6.6742 |
The model is:
\[\widehat{\text{Mental Health Days}} = 9.5930 + -0.0867(\text{Age}) + 1.7254 (\text{Female}) + 0.2314 (\text{Phys Days}) + -0.5866 (\text{Sleep}) + 0.7899 (\text{Very good}) + 1.8436(\text{Good}) + 3.3953(\text{Fair}) + 5.3353(\text{Poor})\]
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
## # A tibble: 5 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.12 0.233 9.08 1.55e-19 1.66 2.57
## 2 gen_healthVery good 0.590 0.294 2.01 4.48e- 2 0.0137 1.17
## 3 gen_healthGood 1.95 0.308 6.34 2.54e-10 1.35 2.56
## 4 gen_healthFair 5.06 0.406 12.5 4.23e-35 4.27 5.86
## 5 gen_healthPoor 9.66 0.609 15.9 2.34e-55 8.47 10.9
# Remove the intercept so only dummy variables are displayed
model_data_no_intercept <- model_data |>
filter(term != "(Intercept)")
# Create a coefficient plot with confidence intervals
ggplot(model_data_no_intercept, aes(x = estimate, y = term)) +
geom_point(size = 3) +
geom_errorbar(
aes(xmin = conf.low, xmax = conf.high),
width = 0.2,
orientation = "y"
) +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey40") +
labs(
title = "Estimated Coefficients and 95% Confidence Intervals",
x = "Estimate",
y = "Dummy Variable"
) +
theme_classic(base_size = 14)The group that differs most from the reference group is those who are in the gen_health poor group. —
4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
# Change reference group to Good
brfss_dv_naive$gen_health_reref <- relevel(brfss_dv_naive$gen_health, ref = "Good")
# Refit the model with the relevel
full_mod_relevel <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_reref, data = brfss_dv_naive)
# Create tidy table
tidy(full_mod_relevel, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: Good)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.4366 | 0.6298 | 18.1584 | 0e+00 | 10.2019 | 12.6713 |
| age | -0.0867 | 0.0060 | -14.4888 | 0e+00 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0e+00 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0e+00 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0e+00 | -0.7367 | -0.4365 |
| gen_health_rerefExcellent | -1.8436 | 0.2973 | -6.2020 | 0e+00 | -2.4264 | -1.2608 |
| gen_health_rerefVery good | -1.0537 | 0.2581 | -4.0819 | 0e+00 | -1.5597 | -0.5476 |
| gen_health_rerefFair | 1.5517 | 0.3861 | 4.0186 | 1e-04 | 0.7947 | 2.3087 |
| gen_health_rerefPoor | 3.4917 | 0.6506 | 5.3673 | 0e+00 | 2.2164 | 4.7671 |
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
# Make table to show education and other continuous variables coefficients
tribble(
~Quantity, ~`Ref: Excellent`, ~`Ref: Good`,
"Intercept", round(coef(full_mod)[1], 3), round(coef(full_mod_relevel)[1],3),
"Age coefficient", round(coef(full_mod)[2], 3), round(coef(full_mod_relevel)[2],3),
"Sex coefficient", round(coef(full_mod)[3], 3), round(coef(full_mod_relevel)[3],3),
"Physical health days",round(coef(full_mod)[4], 3), round(coef(full_mod_relevel)[4],3),
"Sleep hours", round(coef(full_mod)[5], 3), round(coef(full_mod_relevel)[5],3),
"R-squared", round(summary(full_mod)$r.squared, 4), round(summary(full_mod_relevel)$r.squared, 4),
"Residual SE", round(summary(full_mod)$sigma, 3), round(summary(full_mod_relevel)$sigma, 3),
) |>
kable(caption = "Comparing Models with Different Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Ref: Excellent | Ref: Good |
|---|---|---|
| Intercept | 9.5930 | 11.4370 |
| Age coefficient | -0.0870 | -0.0870 |
| Sex coefficient | 1.7250 | 1.7250 |
| Physical health days | 0.2310 | 0.2310 |
| Sleep hours | -0.5870 | -0.5870 |
| R-squared | 0.1694 | 0.1694 |
| Residual SE | 7.2080 | 7.2080 |
The other variable coefficients between the two models are the same. The only differences are within the general health and intercept because we are using a different reference group. By using a different reference group, these coefficients now represent comparisons to the new reference group. The other coefficients for age, sex, physical health, and sleep hours will stay the same.
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
# Verify that predicted values are identical
pred_orig <- predict(full_mod)
pred_reref <- predict(full_mod_relevel)
# Create tibble table
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
The predicted values from both models are identical as demonstrated by a 0 maximum difference and a correlation of 1 between the predictions. This is because changing the reference group will not change the model’s fit or predictions. It does change the interpretation of the dummy variable coefficients. The model predictions are not any different because the releveling only changes the baseline instead of altering the relationship that was already defined. —
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
# Create reduced model
reduced_mod <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv_naive)
# Create tidy table for reduced model
tidy(reduced_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Reduced Regression with Age, Sex, Physical Health, Sleep Hours",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.5041 | 0.6124 | 17.1531 | 0 | 9.3036 | 11.7046 |
| age | -0.0773 | 0.0060 | -12.9721 | 0 | -0.0890 | -0.0656 |
| sexFemale | 1.6864 | 0.2074 | 8.1296 | 0 | 1.2797 | 2.0931 |
| physhlth_days | 0.3172 | 0.0132 | 24.0152 | 0 | 0.2913 | 0.3431 |
| sleep_hrs | -0.6330 | 0.0772 | -8.2029 | 0 | -0.7843 | -0.4817 |
# Create table with R² and adjusted R² for full model
glance(full_mod) |>
select(r.squared, adj.r.squared) |>
pivot_longer(everything(), names_to = "Metric", values_to = "Value") |>
mutate(
Value = round(Value, 4),
Metric = dplyr::recode(Metric,
"r.squared" = "R²",
"adj.r.squared" = "Adjusted R²"
)
) |>
kable(caption = "Full Model Summary") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Metric | Value |
|---|---|
| R² | 0.1694 |
| Adjusted R² | 0.1681 |
# Create table with R² and adjusted R² for reduced model
glance(reduced_mod) |>
select(r.squared, adj.r.squared) |>
pivot_longer(everything(), names_to = "Metric", values_to = "Value") %>%
mutate(
Value = round(Value, 4),
Metric = dplyr::recode(Metric,
"r.squared" = "R²",
"adj.r.squared" = "Adjusted R²"
)
) |>
kable(caption = "Reduced Model Summary") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Metric | Value |
|---|---|
| R² | 0.1522 |
| Adjusted R² | 0.1515 |
From the full model, R² is 0.1694 and adjusted R² is 0.1681. From the reduced model, R² is 0.1522 and adjusted R² is 0.1515.
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
# Compare using anova() and perform the partial F-test
anova(reduced_mod, full_mod) |>
as.data.frame() |>
rownames_to_column("Model") |>
mutate(
Model = c("Reduced (no general health)", "Full (with general health)"),
across(where(is.numeric), ~ round(., 4))
) |>
kable(
caption = "Table 6. Partial F-Test: Does gen_health add to the model?",
col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Res. df | RSS | df | Sum of Sq | F | p-value |
|---|---|---|---|---|---|---|
| Reduced (no general health) | 4995 | 264715.2 | NA | NA | NA | NA |
| Full (with general health) | 4991 | 259335.4 | 4 | 5379.751 | 25.8838 | 0 |
\[H_0: \beta_{\text{General Health}} = 0\] \[H_A: \beta_{\text{General Health}}\neq 0\] The f-statistic value is 25.8838, with 4991 degrees of freedom, and the p-value is less than .001. Based on this f-statistic and p-value, we would reject the null hypothesis that general health is not associated with mentally unhealthy days. We would turn to the alternative hypothesis that there is an association between general health and mentally unhealthy days and determine what evidence there is.
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
# Type III using car::Anova()
anova_type3 <- Anova(full_mod, type = "III")
# Partial sum of squares Type III
anova_type3 |>
as.data.frame() |>
rownames_to_column("Source") |>
mutate(across(where(is.numeric), ~ round(., 4))) |>
kable(
caption = "Type III (Partial) Sums of Squares — car::Anova()",
col.names = c("Source", "Sum of Sq", "df", "F value", "p-value")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Source | Sum of Sq | df | F value | p-value |
|---|---|---|---|---|
| (Intercept) | 12030.737 | 1 | 231.5357 | 0 |
| age | 10907.874 | 1 | 209.9258 | 0 |
| sex | 3663.847 | 1 | 70.5120 | 0 |
| physhlth_days | 10633.920 | 1 | 204.6535 | 0 |
| sleep_hrs | 3049.400 | 1 | 58.6868 | 0 |
| gen_health | 5379.751 | 4 | 25.8838 | 0 |
| Residuals | 259335.435 | 4991 | NA | NA |
# Partial F-Tests using Type III
Anova(full_mod, type = "III") |>
as.data.frame() |>
rownames_to_column("Source") |>
mutate(
Significant = ifelse(`Pr(>F)` < 0.05, "Yes", "No"),
across(where(is.numeric), ~ round(., 4))
) |>
kable(
caption = "Type III Partial F-Tests for Each Predictor",
col.names = c("Source", "Sum of Sq", "df", "F value", "p-value", "Significant (α = 0.05)")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
column_spec(6, bold = TRUE)| Source | Sum of Sq | df | F value | p-value | Significant (α = 0.05) |
|---|---|---|---|---|---|
| (Intercept) | 12030.737 | 1 | 231.5357 | 0 | Yes |
| age | 10907.874 | 1 | 209.9258 | 0 | Yes |
| sex | 3663.847 | 1 | 70.5120 | 0 | Yes |
| physhlth_days | 10633.920 | 1 | 204.6535 | 0 | Yes |
| sleep_hrs | 3049.400 | 1 | 58.6868 | 0 | Yes |
| gen_health | 5379.751 | 4 | 25.8838 | 0 | Yes |
| Residuals | 259335.435 | 4991 | NA | NA | NA |
Yes, the result from type III partial F-test is consistent with my partial F-test. This is because they are both determining if the predictor (gen_health) is significant with the same nested models. Both the f-statistic (25.8839) and p-value (<.001) are the same in both tables. —
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
The general health groups that differ most form the reference are those who identify with poor and fair general health status. Compared to excellent health, poor health status individuals have an estimated 5.3353 more mentally unhealthy days, holding age, sex, physical health days, and sleep constant while compared to excellent health, fair health status individuals have an estimated 3.3953 more mentally unhealthy days, holding age, sex, physical health days, and sleep constant. It is important to note that the data employed in this study is cross-sectional so the data may not be fully independent and does not include weights meaning it may be unrepresentative.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison. —
I would say say that after completing this lab, it is suggested that the general health model appears to be more strongly associated with mental health days than the education model. College graduates with a reference of less than HS only suggested a negative association with mentally unhealthy days of 1.14 when holding age, sex, physical health days, and sleep constant whereas poor general health with a reference of excellent health suggested a positive association of 5.34 more mentally unhealthy days when holding age, sex, physical health days, and sleep constant. I would decide which one to include in my model by employing the calculation of R² and adjusted R² because the adjusted R² will inform whether or not a predictor adds to the model since R² simply increase as more predictors are added. I would also consult p-values to determine statistical significant of certain predictors.
End of Lab Activity