Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
brfss_dv <- readRDS("data/brfss_dv_2020.rds")1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
brfss_dv |>
select(menthlth_days, age, sex, gen_health, marital_status) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
age ~ "Age (years)",
sex ~ "Sex",
gen_health ~ "General health status",
marital_status ~ "Marital status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
Marital status | 5,000 | |
Married | 2,708 (54%) | |
Divorced | 622 (12%) | |
Widowed | 534 (11%) | |
Separated | 109 (2.2%) | |
Never married | 848 (17%) | |
Unmarried couple | 179 (3.6%) | |
1Mean (SD); n (%) | ||
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by General Health",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "General Health",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Mental Health Days by General Health
Those with poor general health reported the most mentally unhealthy days. There is a clear positive association between worsening health and the number of mentally unhealthy days, which is expected.
1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
brfss_dv |>
group_by(marital_status) |> mutate(percent = mean(menthlth_days)) |>
mutate(percent = round(percent, digits = 2)) |>
ggplot(aes(x = marital_status, y = percent, fill = menthlth_days)) +
geom_bar(position="dodge", stat="identity") +
geom_text(stat = "identity", aes(label = after_stat(y)), vjust = -0.3) +
labs(
title = "Distribution of Education Level",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Education Level",
y = "Mean Days with Poor Mental Health"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Distribution of Poor Mental Health Days by Marital Status in Analytic Sample
desc_table <- brfss_dv %>%
select(menthlth_days, marital_status) %>%
group_by(marital_status) %>%
summarise(
N = n(),
`Mean Mental Health Days` = round(mean(menthlth_days),2))
desc_table %>%
kable(caption = "Descriptive Statistics by Marital Status",
align = "lrrrrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| marital_status | N | Mean Mental Health Days |
|---|---|---|
| Married | 2708 | 3.10 |
| Divorced | 622 | 4.49 |
| Widowed | 534 | 2.67 |
| Separated | 109 | 6.22 |
| Never married | 848 | 5.28 |
| Unmarried couple | 179 | 6.07 |
2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
brfss_dv_2 <- brfss_dv |>
mutate(
gen_health_num = factor(case_when(
gen_health == "Poor" ~ 1,
gen_health == "Fair" ~2,
gen_health == "Good" ~3,
gen_health == "Very good" ~ 4,
gen_health == "Excellent" ~ 5), levels = c(1,2,3,4,5)))
class(brfss_dv_2$gen_health_num)## [1] "factor"
brfss_dv_2$gen_health_num <- as.numeric(brfss_dv_2$gen_health_num)
model_1 <- lm(menthlth_days ~ gen_health_num, data = brfss_dv_2)
tidy(model_1, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.4751 | 0.3894 | 26.9013 | 0 | 9.7118 | 11.2385 |
| gen_health_num | -1.8578 | 0.1036 | -17.9259 | 0 | -2.0610 | -1.6547 |
The coefficient for general health is -1.86. This means that a one unit decrease in general health, meaning worsening health, is associated with 1.86 increased number of mentally unhealthy days.
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
model_2 <- lm(menthlth_days ~ gen_health, data = brfss_dv_2)
tidy(model_2, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.1174 | 0.2332 | 9.0790 | 0.0000 | 1.6602 | 2.5746 |
| gen_healthVery good | 0.5903 | 0.2941 | 2.0070 | 0.0448 | 0.0137 | 1.1670 |
| gen_healthGood | 1.9535 | 0.3082 | 6.3375 | 0.0000 | 1.3492 | 2.5577 |
| gen_healthFair | 5.0624 | 0.4064 | 12.4572 | 0.0000 | 4.2657 | 5.8590 |
| gen_healthPoor | 9.6640 | 0.6090 | 15.8678 | 0.0000 | 8.4701 | 10.8580 |
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation.
model_3 <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv_2)
tidy(model_3, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model with Dichotomous Dummy Variable: General Health",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5930 | 0.6304 | 15.2163 | 0.0000 | 8.3570 | 10.8289 |
| age | -0.0867 | 0.0060 | -14.4888 | 0.0000 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0.0000 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0.0000 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0.0000 | -0.7367 | -0.4365 |
| gen_healthVery good | 0.7899 | 0.2797 | 2.8247 | 0.0048 | 0.2417 | 1.3382 |
| gen_healthGood | 1.8436 | 0.2973 | 6.2020 | 0.0000 | 1.2608 | 2.4264 |
| gen_healthFair | 3.3953 | 0.4180 | 8.1234 | 0.0000 | 2.5759 | 4.2147 |
| gen_healthPoor | 5.3353 | 0.6829 | 7.8122 | 0.0000 | 3.9965 | 6.6742 |
\[\text{menthlth_days} = 9.59 + -0.09
\cdot \text{age} + 1.73 \cdot \text{sex} + 0.23 \cdot
\text{physhlth_days} + 0.79 \cdot \text{genhlth_verygood} + 1.84 \cdot
\text{genhlth_good} + 3.40 \cdot \text{genhlth_fair} + 5.34 \cdot
\text{genhlth_poor}+ \varepsilon\] 3b. (10 pts)
Interpret every dummy variable coefficient for
gen_health in plain language. Be specific about the
reference group, the direction and magnitude of each comparison, and
include the phrase “holding all other variables constant.”
Those with very good general health had 0.79 more days of poor mental health compared to thoes with excellent general health, adjusting for other variables.
Those with good general health had 1.84 more days of poor mental health compared to thoes with excellent general health, adjusting for other variables.
Those with fair general health had 3.40 more days of poor mental health compared to thoes with excellent general health, adjusting for other variables.
Those with poor general health had 5.34 more days of poor mental health compared to thoes with excellent general health, adjusting for other variables.
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
pred_educ <- ggpredict(model_3, terms = c("age [20:80]", "gen_health"))
ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
labs(
title = "Predicted Mental Health Days by Age and General Health",
subtitle = "Parallel lines: same slopes for age, different intercepts by general health",
x = "Age (years)",
y = "Predicted Mentally Unhealthy Days",
color = "gen_health",
fill = "gen_health"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set2")
Those with poor general health differ the most from the reference group,
those with excellent general health.
4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
brfss_dv_2$gen_health_reref <- relevel(brfss_dv_2$gen_health, ref = "Good")
model_4 <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_reref, data = brfss_dv_2)
tidy(model_4, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: Good general health)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.4366 | 0.6298 | 18.1584 | 0e+00 | 10.2019 | 12.6713 |
| age | -0.0867 | 0.0060 | -14.4888 | 0e+00 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0e+00 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0e+00 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0e+00 | -0.7367 | -0.4365 |
| gen_health_rerefExcellent | -1.8436 | 0.2973 | -6.2020 | 0e+00 | -2.4264 | -1.2608 |
| gen_health_rerefVery good | -1.0537 | 0.2581 | -4.0819 | 0e+00 | -1.5597 | -0.5476 |
| gen_health_rerefFair | 1.5517 | 0.3861 | 4.0186 | 1e-04 | 0.7947 | 2.3087 |
| gen_health_rerefPoor | 3.4917 | 0.6506 | 5.3673 | 0e+00 | 2.2164 | 4.7671 |
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not? Yes, the other continuous variable coefficients are the same. This is because they are not using general health as a reference group.
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
pred_orig <- predict(model_3)
pred_reref <- predict(model_4)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
model_5 <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv_2)
r_sq <- summary(model_5)$r.squared
adj_r_sq <- summary(model_5)$adj.r.squared
r_sq_full <- summary(model_3)$r.squared
adj_r_sq_full <- summary(model_3)$adj.r.squared
tibble(
Metric = c("R²", "Adjusted R²", "Variance Explained"),
Partial_Model = c(
round(r_sq, 4),
round(adj_r_sq, 4),
paste0(round(r_sq * 100, 2), "%")),
Full_Model = c(
round(r_sq_full, 4),
round(adj_r_sq_full, 4),
paste0(round(r_sq_full * 100, 2), "%")
)
) %>%
kable(caption = "R² and Adjusted R²") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Partial_Model | Full_Model |
|---|---|---|
| R² | 0.1522 | 0.1694 |
| Adjusted R² | 0.1515 | 0.1681 |
| Variance Explained | 15.22% | 16.94% |
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
f_test <- anova(model_5, model_3)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does General Health Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health | 4991 | 259335.4 | 4 | 5379.751 | 25.8838 | 0 |
The null hypothesis is that general health will not imporve the model, considering the other covariates are already in the model, whereas the alternative hypothesis is that it will improve the model. The f-statistic of the full model is 25.88, with 4 degrees of freedom and a p-value <0.05. We can conclude that there is a significant improvement to the full model and can reject the null hypothesis.
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
Anova(model_3, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12030.737 | 1 | 231.5357 | 0 |
| age | 10907.874 | 1 | 209.9258 | 0 |
| sex | 3663.847 | 1 | 70.5120 | 0 |
| physhlth_days | 10633.920 | 1 | 204.6535 | 0 |
| sleep_hrs | 3049.400 | 1 | 58.6868 | 0 |
| gen_health | 5379.751 | 4 | 25.8838 | 0 |
| Residuals | 259335.435 | 4991 | NA | NA |
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
This analysis showed that general health is directly associated with mental health. The model uses cross-sectional data, meaning there is no timeline, so we cannot determine which variable is impacting or causing the other and we can only state that they have an association. Those with the best general health had the fewest days with poor mental health, whereas those with the worst general health had the most days with poor mental health.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.
The analyses showed associations between education and general health with the outcome of interest, days with poor mental health. General health had a stronger association with mental health, as evidenced by the impact that changes to the variables had on the mean days with poor mental health. If I were building a final model, I would include education as well as general health and the other demographic factors to build the strongest model possible.