In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
library(dplyr)
brfss_dv <- readRDS(
"C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/brfss_dv_2020.rds"
)1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
brfss_dv |>
select(menthlth_days, age, sex, marital_status, gen_health) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
age ~ "Age (years)",
sex ~ "Sex",
marital_status ~ "Marital Status",
gen_health ~ "General health status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
Marital Status | 5,000 | |
Married | 2,708 (54%) | |
Divorced | 622 (12%) | |
Widowed | 534 (11%) | |
Separated | 109 (2.2%) | |
Never married | 848 (17%) | |
Unmarried couple | 179 (3.6%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
1Mean (SD); n (%) | ||
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
The group that reports the most mentally unhealthy days is the “poor” group. The pattern does appear what I would expect.
ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by General Health Status",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "General Health Status",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
The marital status group with the highest mean is the “separated” group with a mean of 6.2. The marital status group with the lowest mean is the “widowed” group with a mean of 2.7.
mean_mental <- brfss_dv |>
group_by(marital_status) |>
summarise(mean_menthlth = mean(menthlth_days, na.rm = TRUE))
ggplot(mean_mental, aes(x = marital_status, y = mean_menthlth, fill = marital_status)) +
geom_col(alpha = 0.85) +
geom_text(aes(label = round(mean_menthlth, 1)), vjust = -0.3) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mean Mentally Unhealthy Days by Marital Status",
x = "Marital Status",
y = "Mean Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
Interpretation: The coefficient for gen_health_num is 1.8578. For each worsening level of general health, the number of mentally unhealthy days increases by about 1.86 days on average.
brfss_new <- brfss_dv |>
mutate(gen_health_num = case_when(
gen_health == "Excellent" ~ 1,
gen_health == "Very good" ~ 2,
gen_health == "Good" ~ 3,
gen_health == "Fair" ~ 4,
gen_health == "Poor" ~ 5
))
model <- lm(menthlth_days ~ gen_health_num, data = brfss_new)
summary(model)##
## Call:
## lm(formula = menthlth_days ~ gen_health_num, data = brfss_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6173 -4.9016 -3.0438 -0.0438 28.8140
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6718 0.2705 -2.484 0.013 *
## gen_health_num 1.8578 0.1036 17.926 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.661 on 4998 degrees of freedom
## Multiple R-squared: 0.06041, Adjusted R-squared: 0.06022
## F-statistic: 321.3 on 1 and 4998 DF, p-value: < 2.2e-16
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
Gen_health has 5 levels but when treated as a factor, dummy variables are created for all levels expect the reference group (excellent health). Each level represents the difference in mentally unhealthy days between that group and the reference group, holding all other variables constant. The numeric model assumes that moving from excellent to poor represents equal linear steps but in reality the differences between the categories may not be evenly spaced and forcing linear assumption can distort the true relationship.
mod_gen_health <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv)
tidy(mod_gen_health, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with gen_health Dummy Variables (Reference: Excellent)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5930 | 0.6304 | 15.2163 | 0.0000 | 8.3570 | 10.8289 |
| age | -0.0867 | 0.0060 | -14.4888 | 0.0000 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0.0000 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0.0000 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0.0000 | -0.7367 | -0.4365 |
| gen_healthVery good | 0.7899 | 0.2797 | 2.8247 | 0.0048 | 0.2417 | 1.3382 |
| gen_healthGood | 1.8436 | 0.2973 | 6.2020 | 0.0000 | 1.2608 | 2.4264 |
| gen_healthFair | 3.3953 | 0.4180 | 8.1234 | 0.0000 | 2.5759 | 4.2147 |
| gen_healthPoor | 5.3353 | 0.6829 | 7.8122 | 0.0000 | 3.9965 | 6.6742 |
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation.
Menthlth_days = 9.592982-0.086672(age)+1.725379(sexFemale)+0.231420(physhlth_days)-0.586595(sleep_hrs)+0.789947(gen_health = Very good)+1.843601(gen_health = Good)+3.395283(gen+health= fair)+5.335347(gen_health=poor)
model_full <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv)
summary(model_full)##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs +
## gen_health, data = brfss_dv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5175 -3.5549 -1.6999 0.4316 31.3631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.592982 0.630441 15.216 < 2e-16 ***
## age -0.086672 0.005982 -14.489 < 2e-16 ***
## sexFemale 1.725379 0.205472 8.397 < 2e-16 ***
## physhlth_days 0.231420 0.016177 14.306 < 2e-16 ***
## sleep_hrs -0.586595 0.076572 -7.661 2.21e-14 ***
## gen_healthVery good 0.789947 0.279661 2.825 0.00475 **
## gen_healthGood 1.843601 0.297260 6.202 6.03e-10 ***
## gen_healthFair 3.395283 0.417964 8.123 5.66e-16 ***
## gen_healthPoor 5.335347 0.682949 7.812 6.80e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1681
## F-statistic: 127.3 on 8 and 4991 DF, p-value: < 2.2e-16
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
The reference group for gen_health is “Excellent” health so all coefficients are interpreted relative to individuals in excellent health, holding all other variables constant.
gen_healthVery good 0.789947 = Individuals who reported very good health have on average 0.79 more mentally unhealthy days compared to those in excellent health, holding all other variables constant. gen_healthGood 1.843601 = Individuals who reported good health have on average 1.84 more mentally unhealthy days compared to those in excellent health, holding all other variables constant. gen_healthFair 3.395283 = Individuals who reported fair health have on average 3.395 more mentally unhealthy days compared to those in excellent health, holding all other variables constant. gen_healthPoor 5.335347 = Individuals who reported poor health have on average 5.335 more mentally unhealthy days compared to those in excellent health, holding all other variables constant.
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
The gen_health poor group differs the most from the reference group.
mod_full <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv)
coef_table <- tidy(mod_full, conf.int = TRUE) |>
filter(grepl("gen_health", term))
ggplot(coef_table, aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_pointrange(color = "steelblue", size = 1) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
coord_flip() + # Horizontal layout
labs(
title = "Estimated Coefficients for gen_health Dummy Variables",
x = "gen_health Level (vs. Reference: Excellent)",
y = "Estimated Effect on Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13)4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
brfss_dv$gen_health_reref <- relevel(brfss_dv$gen_health, ref = "Good")
mod_gen <- lm(
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv
)
mod_gen_reref <- lm(
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_reref,
data = brfss_dv
)
tidy(mod_gen_reref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with gen_health Releveled (Reference: Good)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.4366 | 0.6298 | 18.1584 | 0e+00 | 10.2019 | 12.6713 |
| age | -0.0867 | 0.0060 | -14.4888 | 0e+00 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0e+00 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0e+00 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0e+00 | -0.7367 | -0.4365 |
| gen_health_rerefExcellent | -1.8436 | 0.2973 | -6.2020 | 0e+00 | -2.4264 | -1.2608 |
| gen_health_rerefVery good | -1.0537 | 0.2581 | -4.0819 | 0e+00 | -1.5597 | -0.5476 |
| gen_health_rerefFair | 1.5517 | 0.3861 | 4.0186 | 1e-04 | 0.7947 | 2.3087 |
| gen_health_rerefPoor | 3.4917 | 0.6506 | 5.3673 | 0e+00 | 2.2164 | 4.7671 |
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
The continuous variable coefficients remain the same because changing the reference group only affects the intercept and the dummy variable for gen_health. The intercept changes and the dummy variable is now relative to the new reference group “good.”
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
Changing the reference group does not change how it predicts it only changes how the model is written. The relationship between the groups stays the same. The model is just recalculating the coefficients so that differences are now measured relative to the new reference group.
pred_orig <- predict(mod_gen)
pred_reref <- predict(mod_gen_reref)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical Across Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
mod_full <- lm(
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)
tribble(
~Model, ~Predictors, ~R2, ~Adj_R2,
"Reduced (no gen_health)", 4, round(summary(mod_reduced)$r.squared, 4), round(summary(mod_reduced)$adj.r.squared, 4),
"Full (+ gen_health)", 5, round(summary(mod_full)$r.squared, 4), round(summary(mod_full)$adj.r.squared, 4)
) |>
kable(caption = "R2 and Adjusted R2 for Reduced and Full Models") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
row_spec(2, bold = TRUE, background = "#EBF5FB")| Model | Predictors | R2 | Adj_R2 |
|---|---|---|---|
| Reduced (no gen_health) | 4 | 0.1522 | 0.1515 |
| Full (+ gen_health) | 5 | 0.1694 | 0.1681 |
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
# Reduced model (no gen_health)
mod_reduced <- lm(
menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
data = brfss_dv
)
# Full model including gen_health (from Task 3a)
mod_full <- lm(
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv
)
# Partial F-test: Does gen_health improve the model?
f_test <- anova(mod_reduced, mod_full)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does gen_health Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health | 4991 | 259335.4 | 4 | 5379.751 | 25.8838 | 0 |
Null Hypotheses: All coefficients for gen_health are equal to zero.
Alternative Hypotheses: At least one coefficient for gen_health doesn’t equal 0.
F-Statistic: 25.884 Df= 4 p-value=<0.001
Adding gen_health significantly improves the models ability to predict mentally unhealthy days, even after accounting for other confounding factors.
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
Anova(mod_full, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12030.737 | 1 | 231.5357 | 0 |
| age | 10907.874 | 1 | 209.9258 | 0 |
| sex | 3663.847 | 1 | 70.5120 | 0 |
| physhlth_days | 10633.920 | 1 | 204.6535 | 0 |
| sleep_hrs | 3049.400 | 1 | 58.6868 | 0 |
| gen_health | 5379.751 | 4 | 25.8838 | 0 |
| Residuals | 259335.435 | 4991 | NA | NA |
The results are consistent. Both show that by adding gen_health to the model improves the fit significantly even after controlling for confounders.
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
People who rate their general health as poor tend to report more mentally unhealthy days in the past month compared to those who say their health is excellent. Individuals who report poor mental health have about 5 more mentally unhealthy days compared to those who report their health as being fair with only about 3 mentally unhealthy days. Individuals that report having good and very good health report slightly more mentally unhealthy days than those in the excellent group. These findings suggest that there is a strong link between how people perceive their own overall health and their mental well being. Its important to note that this data is from a cross-sectional study, so we cannot say that one causes the other to happen.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.
The general health model appears to be more strongly associated with mental health days than education.The average number of mentally unhealthy days across general health categories are larger compared to education, for example individuals reporting poor mental health had more mentally unhealthy days compared to those who reported excellent health, while for education the differences were smaller. When building a final model, I would compare R^2 and adjusted R^2 and look at f-test to see if the model is improved by adding each predictor to the model. Since general health produces a stronger relationship with mentally unhealthy days, I would prioritize including general health over education.