EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
brfss_dv <- readRDS(
"C:/Users/tahia/OneDrive/Desktop/UAlbany PhD/Epi 553/Lab 9/brfss_dv_2020.rds"
)1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
brfss_dv |>
select(menthlth_days, age, sex, gen_health, marital_status) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
age ~ "Age (years)",
sex ~ "Sex",
gen_health ~ "General health status",
marital_status ~ "Marital Status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample, For Specific Variables (n = 5,000)**") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
Marital Status | 5,000 | |
Married | 2,708 (54%) | |
Divorced | 622 (12%) | |
Widowed | 534 (11%) | |
Separated | 109 (2.2%) | |
Never married | 848 (17%) | |
Unmarried couple | 179 (3.6%) | |
1Mean (SD); n (%) | ||
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
1b Ans
The boxplot shows that people in the “Poor” general health group report the highest number of mentally unhealthy days, with a much higher median and wider spread compared to other groups. As general health moves from “Excellent” to “Poor,” the number of mentally unhealthy days steadily increases. This pattern is consistent with what we would expect, since individuals with poorer physical health are more likely to experience stress, illness, and emotional difficulties.
1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
1c Ans The separated group has the highest mean number of mentally unhealthy days, while the widowed group has the lowest mean.
mean_table <- brfss_dv %>%
group_by(marital_status) %>%
summarise(
mean_menthlth_days = mean(menthlth_days, na.rm = TRUE)
)
mean_table## # A tibble: 6 × 2
## marital_status mean_menthlth_days
## <fct> <dbl>
## 1 Married 3.10
## 2 Divorced 4.49
## 3 Widowed 2.67
## 4 Separated 6.22
## 5 Never married 5.28
## 6 Unmarried couple 6.07
ggplot(mean_table, aes(x = marital_status, y = mean_menthlth_days)) +
geom_bar(stat = "identity", fill = "gold") +
labs(
title = "Mean Mentally Unhealthy Days by Marital Status",
x = "Marital Status",
y = "Mean Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
21 Ans
The coefficient for gen_health_numeric is 1.8578. This means that for every one-category worsening in general health (for example, from “Good” to “Fair”), the number of mentally unhealthy days increases by approximately 1.86 days on average. The result is statistically significant (p < 0.001), indicating a strong positive association between poorer general health and more mentally unhealthy days.
# Simple regression model
brfss_dv1 <- brfss_dv %>%
mutate(
gen_health_numeric = case_when(
gen_health == "Excellent" ~ 1,
gen_health == "Very good" ~ 2,
gen_health == "Good" ~ 3,
gen_health == "Fair" ~ 4,
gen_health == "Poor" ~ 5,
TRUE ~ NA_real_
)
)
model_naive <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv1)
summary(model_naive)##
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = brfss_dv1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6173 -4.9016 -3.0438 -0.0438 28.8140
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6718 0.2705 -2.484 0.013 *
## gen_health_numeric 1.8578 0.1036 17.926 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.661 on 4998 degrees of freedom
## Multiple R-squared: 0.06041, Adjusted R-squared: 0.06022
## F-statistic: 321.3 on 1 and 4998 DF, p-value: < 2.2e-16
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
2b ANs
The factor model uses 4 coefficients because there are 5 categories, and one (“Excellent”) is the reference. Each coefficient shows how much higher the mentally unhealthy days are compared to “Excellent.” For example, people with Poor health have about 9.66 more days.
The numeric model assumes each step increases by the same amount (1.86), but the factor model shows the increases are not equal. So, the numeric approach can be misleading because it oversimplifies the relationship.
# Fit model with gen_health as a factor
model_factor <- lm(menthlth_days ~ gen_health, data = brfss_dv1)
summary(model_factor)##
## Call:
## lm(formula = menthlth_days ~ gen_health, data = brfss_dv1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7814 -4.0708 -2.7077 -0.1174 27.8826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1174 0.2332 9.079 < 2e-16 ***
## gen_healthVery good 0.5903 0.2941 2.007 0.0448 *
## gen_healthGood 1.9535 0.3082 6.337 2.54e-10 ***
## gen_healthFair 5.0624 0.4064 12.457 < 2e-16 ***
## gen_healthPoor 9.6640 0.6090 15.868 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.611 on 4995 degrees of freedom
## Multiple R-squared: 0.07334, Adjusted R-squared: 0.07259
## F-statistic: 98.83 on 4 and 4995 DF, p-value: < 2.2e-16
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation.
`menthlth_days^ = B0 +B1(age) + B2(sex) + B3(physhlth_days) + B4(sleep_hrs) + B5(Very good) + B6(Good) + B7(fair) + B8(Poor)`
Here, "Excellent" is the reference group
mod_gen <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv1)
tidy(mod_gen, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with General Health Dummy Variables (Reference: Excellent)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5930 | 0.6304 | 15.2163 | 0.0000 | 8.3570 | 10.8289 |
| age | -0.0867 | 0.0060 | -14.4888 | 0.0000 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0.0000 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0.0000 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0.0000 | -0.7367 | -0.4365 |
| gen_healthVery good | 0.7899 | 0.2797 | 2.8247 | 0.0048 | 0.2417 | 1.3382 |
| gen_healthGood | 1.8436 | 0.2973 | 6.2020 | 0.0000 | 1.2608 | 2.4264 |
| gen_healthFair | 3.3953 | 0.4180 | 8.1234 | 0.0000 | 2.5759 | 4.2147 |
| gen_healthPoor | 5.3353 | 0.6829 | 7.8122 | 0.0000 | 3.9965 | 6.6742 |
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
3b Ans In this model, “Excellent” health is the reference group. Compared to this group, individuals with Very good health report about 0.79 more mentally unhealthy days, those with Good health report about 1.84 more days, those with Fair health report about 3.40 more days, and those with Poor health report about 5.34 more days. This shows a clear pattern where mentally unhealthy days increase as general health worsens, holding all other variables constant.
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
3c Ans The “Poor” general health group differs the most from the reference group (“Excellent”), as it has the largest coefficient (farthest from zero).
coef_df <- tidy(mod_gen, conf.int = TRUE) %>%
filter(grepl("gen_health", term))
coef_df$term <- gsub("gen_health", "", coef_df$term)
ggplot(coef_df, aes(x = estimate, y = term)) +
geom_point(color = "purple") +
geom_errorbar(
aes(xmin = conf.low, xmax = conf.high),
width = 0.2,
orientation = "y"
) +
labs(
title = "Effect of General Health on Mentally Unhealthy Days",
x = "Coefficient (Estimate)",
y = "General Health Category"
) +
theme_minimal()4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
#Change reference group to "Good"
brfss_dv1$gen_health <- relevel(brfss_dv1$gen_health, ref = "Good")
mod_gen_ref_good <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv1)
tidy(mod_gen_ref_good, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: Good)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.4366 | 0.6298 | 18.1584 | 0e+00 | 10.2019 | 12.6713 |
| age | -0.0867 | 0.0060 | -14.4888 | 0e+00 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0e+00 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0e+00 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0e+00 | -0.7367 | -0.4365 |
| gen_healthExcellent | -1.8436 | 0.2973 | -6.2020 | 0e+00 | -2.4264 | -1.2608 |
| gen_healthVery good | -1.0537 | 0.2581 | -4.0819 | 0e+00 | -1.5597 | -0.5476 |
| gen_healthFair | 1.5517 | 0.3861 | 4.0186 | 1e-04 | 0.7947 | 2.3087 |
| gen_healthPoor | 3.4917 | 0.6506 | 5.3673 | 0e+00 | 2.2164 | 4.7671 |
##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs +
## gen_health, data = brfss_dv1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5175 -3.5549 -1.6999 0.4316 31.3631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.436584 0.629825 18.158 < 2e-16 ***
## age -0.086672 0.005982 -14.489 < 2e-16 ***
## sexFemale 1.725379 0.205472 8.397 < 2e-16 ***
## physhlth_days 0.231420 0.016177 14.306 < 2e-16 ***
## sleep_hrs -0.586595 0.076572 -7.661 2.21e-14 ***
## gen_healthExcellent -1.843601 0.297260 -6.202 6.03e-10 ***
## gen_healthVery good -1.053654 0.258126 -4.082 4.54e-05 ***
## gen_healthFair 1.551682 0.386128 4.019 5.94e-05 ***
## gen_healthPoor 3.491746 0.650560 5.367 8.35e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1681
## F-statistic: 127.3 on 8 and 4991 DF, p-value: < 2.2e-16
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
4b Ans
The coefficients for age, sex, physically unhealthy days, and sleep hours are the same in both models. This is because changing the reference group only affects the gen_health categories, not the other variables. The data and model remain the same, so the relationship between these continuous variables and the outcome does not change.
tribble(
~Quantity, ~`Ref: Excellent`, ~`Ref: Good`,
"Intercept", round(coef(mod_gen)[1], 3), round(coef(mod_gen_ref_good)[1], 3),
"Age coefficient", round(coef(mod_gen)[2], 3), round(coef(mod_gen_ref_good)[2], 3),
"Sex coefficient", round(coef(mod_gen)[3], 3), round(coef(mod_gen_ref_good)[3], 3),
"Physical health days", round(coef(mod_gen)[4], 3), round(coef(mod_gen_ref_good)[4], 3),
"Sleep hours", round(coef(mod_gen)[5], 3), round(coef(mod_gen_ref_good)[5], 3),
"R-squared", round(summary(mod_gen)$r.squared, 4), round(summary(mod_gen_ref_good)$r.squared, 4),
"Residual SE", round(summary(mod_gen)$sigma, 3), round(summary(mod_gen_ref_good)$sigma, 3)
) |>
kable(caption = "Comparing Models with Different Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Ref: Excellent | Ref: Good |
|---|---|---|
| Intercept | 9.5930 | 11.4370 |
| Age coefficient | -0.0870 | -0.0870 |
| Sex coefficient | 1.7250 | 1.7250 |
| Physical health days | 0.2310 | 0.2310 |
| Sleep hours | -0.5870 | -0.5870 |
| R-squared | 0.1694 | 0.1694 |
| Residual SE | 7.2080 | 7.2080 |
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
4c Ans The predicted values from both models are identical, as shown by a correlation of 1 and a maximum difference of 0. This happens because changing the reference group only changes how the categorical variable (gen_health) is coded, not the actual information in the data. The model is just re-expressing the same relationships in a different way, so the final predicted values remain exactly the same.
# Verify that predicted values are identical
pred_orig <- predict(mod_gen)
pred_reref <- predict(mod_gen_ref_good)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
5a Ans
The reduced model has an R2 of approximately 0.153 and an adjusted R2 of approximately 0.152. The full model has an R2 of 0.169 and an adjusted R2 of 0.168.
# Reduced model (no gen_health)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv1)
summary(mod_reduced)##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs,
## data = brfss_dv1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.314 -3.520 -1.805 0.159 32.013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.504082 0.612374 17.153 < 2e-16 ***
## age -0.077335 0.005962 -12.972 < 2e-16 ***
## sexFemale 1.686413 0.207440 8.130 5.38e-16 ***
## physhlth_days 0.317247 0.013210 24.015 < 2e-16 ***
## sleep_hrs -0.633026 0.077171 -8.203 2.96e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.28 on 4995 degrees of freedom
## Multiple R-squared: 0.1522, Adjusted R-squared: 0.1515
## F-statistic: 224.2 on 4 and 4995 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs +
## gen_health, data = brfss_dv1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5175 -3.5549 -1.6999 0.4316 31.3631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.592982 0.630441 15.216 < 2e-16 ***
## age -0.086672 0.005982 -14.489 < 2e-16 ***
## sexFemale 1.725379 0.205472 8.397 < 2e-16 ***
## physhlth_days 0.231420 0.016177 14.306 < 2e-16 ***
## sleep_hrs -0.586595 0.076572 -7.661 2.21e-14 ***
## gen_healthVery good 0.789947 0.279661 2.825 0.00475 **
## gen_healthGood 1.843601 0.297260 6.202 6.03e-10 ***
## gen_healthFair 3.395283 0.417964 8.123 5.66e-16 ***
## gen_healthPoor 5.335347 0.682949 7.812 6.80e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1681
## F-statistic: 127.3 on 8 and 4991 DF, p-value: < 2.2e-16
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
5b Ans The null hypothesis is that all coefficients for gen_health are equal to zero (gen_health does not improve the model), while the alternative hypothesis is that at least one of the coefficients is not zero. The partial F-test gives an F-statistic of 25.88 with 4 and 4991 degrees of freedom, and a p-value less than 0.001. Since the p-value is very small, we reject the null hypothesis. This means that general health as a whole significantly improves the model and contributes to explaining variation in mentally unhealthy days.
# Partial F-test
f_test <- anova(mod_reduced, mod_gen)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does General Health Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health | 4991 | 259335.4 | 4 | 5379.751 | 25.8838 | 0 |
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
5c Ans The results are consistent. The Type III ANOVA shows an F-statistic of 25.88 for gen_health with a p-value less than 0.001, which is exactly the same as the partial F-test. This confirms that gen_health as a group significantly improves the model, and both methods lead to the same conclusion.
Anova(mod_gen, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12030.737 | 1 | 231.5357 | 0 |
| age | 10907.874 | 1 | 209.9258 | 0 |
| sex | 3663.847 | 1 | 70.5120 | 0 |
| physhlth_days | 10633.920 | 1 | 204.6535 | 0 |
| sleep_hrs | 3049.400 | 1 | 58.6868 | 0 |
| gen_health | 5379.751 | 4 | 25.8838 | 0 |
| Residuals | 259335.435 | 4991 | NA | NA |
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
6a Ans
People who reported poorer general health tended to have more mentally unhealthy days compared to those in excellent health. Those in poor health had about 5.3 more days, those in fair health had about 3.4 more days, and those in good health had about 1.8 more days, while those in very good health had about 0.8 more days. This shows a clear pattern where mental health worsens as general health declines. However, since the data was collected at one point in time, we cannot say that surely poor health causes worse mental health, only that they are related.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.
6b Ans General health appears to be more strongly associated with mentally unhealthy days than education. The differences across general health groups are much larger (for example, up to about 5.3 more days for those in poor health), while the education groups show smaller differences, with only college graduates having about 1.1 fewer days compared to the reference group. This suggests that general health has a stronger relationship with mental health outcomes. If building a final model, I would prioritize including general health, but I would also consider keeping education if it is important for controlling background differences or potential confounding.
End of Lab Activity