EPI 553 — Dummy Variables Lab Due: End of class, March 26, 2026 —
In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)
library (flextable)
brfss_dv <- readRDS(
"C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/brfss_dv_2020.rds")
# Data check
cat("Rows:", nrow(brfss_dv), "\n")## Rows: 5000
## Columns: 9
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
brfss_dv |>
select(menthlth_days, age, sex, gen_health, marital_status) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
age ~ "Age (years)",
sex ~ "Sex",
gen_health ~ "General health status",
marital_status ~ "Marital Status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")|>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
Marital Status | 5,000 | |
Married | 2,708 (54%) | |
Divorced | 622 (12%) | |
Widowed | 534 (11%) | |
Separated | 109 (2.2%) | |
Never married | 848 (17%) | |
Unmarried couple | 179 (3.6%) | |
1Mean (SD); n (%) | ||
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by General Health Status",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "General Health Status",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Mental Health Days by General Health Status
Interpretation: Poor Health Status group reports the most mentally unhealthy days (~27 days).
1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
#calculate mean of mentally unhealthy days by marital status
marital_summary <- brfss_dv %>%
group_by(marital_status) %>%
summarise(mean_days = mean(menthlth_days, na.rm = TRUE),
n= n ()
)%>%
arrange(desc(mean_days)) # Sort by highest mean
# display as table
marital_summary %>%
mutate (mean_days = round (mean_days, 2)) %>%
kable (caption = "table 2: Mean Mentally Unhealthy days by Marital Status",
col.names = c("marital Status", "Mean Days", "n")
)%>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE
)| marital Status | Mean Days | n |
|---|---|---|
| Separated | 6.22 | 109 |
| Unmarried couple | 6.07 | 179 |
| Never married | 5.28 | 848 |
| Divorced | 4.49 | 622 |
| Married | 3.10 | 2708 |
| Widowed | 2.67 | 534 |
#create bar chart for visual
ggplot(marital_summary, aes(x = marital_status, y = mean_days, fill = marital_status)) +
geom_col (alpha = 0.85) +
geom_text(aes(label = round (mean_days, 1)), vjust = -0.3) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Distribution of Marital Status",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Marital Status",
y = "Mean Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Mean of Marital Status in Analytic Sample
Interpretation: Separated status group has the highest mean (6.2 days) and widowed have the lowest (2.7 days). —
2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
# Create numeric version of gen_health
brfss_dv <- brfss_dv %>%
mutate(
gen_numeric = case_when(
gen_health == "Excellent" ~ 1,
gen_health == "Very good" ~ 2,
gen_health == "Good" ~ 3,
gen_health == "Fair" ~ 4,
gen_health == "Poor" ~ 5,
TRUE ~ NA_real_
)
)
# Fit model treating health as numeric
model_naive <- lm(menthlth_days ~ gen_numeric, data = brfss_dv)
# Display results
tidy(model_naive, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Table 3. Naive Model (Gen Health as Numeric)",
col.names = c("Term", "Estimate", "Std Error", "t-stat", "p-value", "95% CI Low", "95% CI High")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | Std Error | t-stat | p-value | 95% CI Low | 95% CI High |
|---|---|---|---|---|---|---|
| (Intercept) | -0.6718 | 0.2705 | -2.4840 | 0.013 | -1.2021 | -0.1416 |
| gen_numeric | 1.8578 | 0.1036 | 17.9259 | 0.000 | 1.6547 | 2.0610 |
Interpretation: The coefficient for the gen_numeric is about 1.8 ( P <0.001), which means that for each one-unit increase in the general health status, mentally unhealthy days increase by about 1.8 days on average.
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
# Fit model treating health as numeric
model_factor <- lm(menthlth_days ~ gen_health, data = brfss_dv)
# Display results
tidy(model_factor, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Table 4. Factor Model (Gen Health as Factor)",
col.names = c("Term", "Estimate", "Std Error", "t-stat", "p-value", "95% CI Low", "95% CI High")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | Std Error | t-stat | p-value | 95% CI Low | 95% CI High |
|---|---|---|---|---|---|---|
| (Intercept) | 2.1174 | 0.2332 | 9.0790 | 0.0000 | 1.6602 | 2.5746 |
| gen_healthVery good | 0.5903 | 0.2941 | 2.0070 | 0.0448 | 0.0137 | 1.1670 |
| gen_healthGood | 1.9535 | 0.3082 | 6.3375 | 0.0000 | 1.3492 | 2.5577 |
| gen_healthFair | 5.0624 | 0.4064 | 12.4572 | 0.0000 | 4.2657 | 5.8590 |
| gen_healthPoor | 9.6640 | 0.6090 | 15.8678 | 0.0000 | 8.4701 | 10.8580 |
The factor model of gen_health has 4 coefficients with 1 reference instead of just 1 numeric category because it creates dummy variables for each category with one refeerence category (excellent). This shows separate effect of each health category and the coefficient for gen_health good is 1.9 which is about ~1.8 of gen_numeric, so the naive approach is kinda misleading.
# Compute observed group means
group_means <- brfss_dv |>
summarise(mean_days = mean(menthlth_days), .by = c(gen_health, gen_numeric))
# Generate predictions from the naive model
pred_naive <- tibble(
gen_numeric = 1:5,
predicted = predict(model_naive, newdata = tibble(gen_numeric = 1:5))
)
ggplot() +
geom_point(data = group_means,
aes(x = gen_numeric, y = mean_days),
size = 4, color = "steelblue") +
geom_line(data = pred_naive,
aes(x = gen_numeric, y = predicted),
color = "tomato", linewidth = 1.2, linetype = "dashed") +
geom_point(data = pred_naive,
aes(x = gen_numeric, y = predicted),
size = 3, color = "tomato", shape = 17) +
scale_x_continuous(
breaks = 1:5,
labels = c("Excellent", "Very Good", "Good", "Fair", "Poor")
) +
labs(
title = "Observed Group Means (blue) vs. Naive Linear Fit (red)",
subtitle = "The naive model forces equal spacing between education levels",
x = "General health Status",
y = "Mean Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13)Naive Linear Fit vs. Actual Group Means by gen health
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
# Fit full model with gen_health as factor
model_full <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = brfss_dv)
# Display coefficients
tidy(model_full, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Table 6. Full Model with General Health Dummy Variables",
col.names = c("Term", "Estimate", "Std Error", "t-stat", "p-value", "95% CI Low", "95% CI High")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | Std Error | t-stat | p-value | 95% CI Low | 95% CI High |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5930 | 0.6304 | 15.2163 | 0.0000 | 8.3570 | 10.8289 |
| age | -0.0867 | 0.0060 | -14.4888 | 0.0000 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0.0000 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0.0000 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0.0000 | -0.7367 | -0.4365 |
| gen_healthVery good | 0.7899 | 0.2797 | 2.8247 | 0.0048 | 0.2417 | 1.3382 |
| gen_healthGood | 1.8436 | 0.2973 | 6.2020 | 0.0000 | 1.2608 | 2.4264 |
| gen_healthFair | 3.3953 | 0.4180 | 8.1234 | 0.0000 | 2.5759 | 4.2147 |
| gen_healthPoor | 5.3353 | 0.6829 | 7.8122 | 0.0000 | 3.9965 | 6.6742 |
Write out the fitted regression equation.
\[\widehat{\text{Mental Health Days}} = 9.593 - 0.08(\text{age}) + 1.73(\text{Female}) + 0.23(\text{physhlth\_days}) - 0.589(\text{sleep\_hrs}) + 3.4(\text{Fair}) + 1.8(\text{Good}) + 5.3(\text{Poor}) + 0.8(\text{Very good}))\]
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
| Observation | gen_health | Very Good | Good | Fair | Poor |
|---|---|---|---|---|---|
| 1 | Excellent | 0 | 0 | 0 | 0 |
| 2 | Very Good | 1 | 0 | 0 | 0 |
| 3 | Good | 0 | 1 | 0 | 0 |
| 4 | Fair | 0 | 0 | 1 | 0 |
| 5 | Poor | 0 | 0 | 0 | 1 |
# Extract gen_health coefficients
gen_health_coefs <- tidy(model_full) %>%
filter(str_detect(term, "gen_health")) %>%
select(term, estimate, p.value)
gen_health_coefs %>%
mutate(
across(where(is.numeric), ~ round(., 4))
)%>%
kable(caption = "Table 7. General Health Dummy Variable Coefficients") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | estimate | p.value |
|---|---|---|
| gen_healthVery good | 0.7899 | 0.0048 |
| gen_healthGood | 1.8436 | 0.0000 |
| gen_healthFair | 3.3953 | 0.0000 |
| gen_healthPoor | 5.3353 | 0.0000 |
Interpretations:
gen_healthVery good: Holding all other variables constant, individuals reporting “Very good” health have, on average, 0.8 more mentally unhealthy days compared to those with “Excellent” health. This coefficient is likely the smallest in magnitude.
gen_healthGood: Holding all other variables constant, individuals reporting “Good” health have, on average, 1.8 more mentally unhealthy days than those reporting “Excellent” health.
gen_healthFair: Holding all other variables constant (age, sex, physical health days, sleep hours), individuals who report “Fair” general health have, on average, 3.4 more mentally unhealthy days compared to those who report “Excellent” health (the reference group).
gen_health_Poor: Holding all other variables constant, individuals reporting “Poor” health have, on average, 5.3 more mentally unhealthy days compared to the “Excellent” health reference group. This is the largest difference among all health categories.
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
# Extract gen_health coefficients with CIs
gen_health_plot <- tidy(model_full, conf.int = TRUE) %>%
filter(str_detect(term, "gen_health")) %>%
mutate(
term = str_remove(term, "gen_health"), # Clean labels
term = factor(term, levels = c("Very good", "Good", "Fair", "Poor")) # Order
)
# Create forest plot
ggplot(gen_health_plot, aes(x = estimate, y = term)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
geom_point(size = 3, color = "steelblue") +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, color = "steelblue") +
labs(
title = "General Health Dummy Variable Coefficients",
subtitle = "Reference group: Excellent health",
x = "Coefficient Estimate (Additional Mentally Unhealthy Days)",
y = "General Health Category"
) +
theme_minimal(base_size = 13)Interpretation: - Excellent and Very good general health status both have negative coefficients, meaning they report fewer mentally unhealthy days than the Good group. The group Fair and Poor have positive coefficients, meaning they report more mentally unhealthy days than the Good group. Poor health differs the most from the reference group (Good), with the largest positive coefficient (~ 3.49 additional mentally unhealthy days). This is consistent with worse general health means more mentally unhealthy days.
4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
# Change reference group to Good
brfss_dv$genhealth_reref <- relevel(brfss_dv$gen_health, ref = "Good")
mod_genhealth_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + genhealth_reref,
data = brfss_dv)
tidy(mod_genhealth_reref, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Same Model, Different Reference Group (Reference: Good)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.4366 | 0.6298 | 18.1584 | 0e+00 | 10.2019 | 12.6713 |
| age | -0.0867 | 0.0060 | -14.4888 | 0e+00 | -0.0984 | -0.0749 |
| sexFemale | 1.7254 | 0.2055 | 8.3971 | 0e+00 | 1.3226 | 2.1282 |
| physhlth_days | 0.2314 | 0.0162 | 14.3057 | 0e+00 | 0.1997 | 0.2631 |
| sleep_hrs | -0.5866 | 0.0766 | -7.6607 | 0e+00 | -0.7367 | -0.4365 |
| genhealth_rerefExcellent | -1.8436 | 0.2973 | -6.2020 | 0e+00 | -2.4264 | -1.2608 |
| genhealth_rerefVery good | -1.0537 | 0.2581 | -4.0819 | 0e+00 | -1.5597 | -0.5476 |
| genhealth_rerefFair | 1.5517 | 0.3861 | 4.0186 | 1e-04 | 0.7947 | 2.3087 |
| genhealth_rerefPoor | 3.4917 | 0.6506 | 5.3673 | 0e+00 | 2.2164 | 4.7671 |
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
tribble(
~Quantity, ~`Ref: Excellent`, ~`Ref: Good`,
"Intercept", round(coef(model_full)[1], 3), round(coef(mod_genhealth_reref)[1], 3),
"Age coefficient", round(coef(model_full)[2], 3), round(coef(mod_genhealth_reref)[2], 3),
"Sex coefficient", round(coef(model_full)[3], 3), round(coef(mod_genhealth_reref)[3], 3),
"Physical health days", round(coef(model_full)[4], 3), round(coef(mod_genhealth_reref)[4], 3),
"Sleep hours", round(coef(model_full)[5], 3), round(coef(mod_genhealth_reref)[5], 3),
"R-squared", round(summary(model_full)$r.squared, 4), round(summary(mod_genhealth_reref)$r.squared, 4),
"Residual SE", round(summary(model_full)$sigma, 3), round(summary(mod_genhealth_reref)$sigma, 3)
) |>
kable(caption = "Comparing Models with Different Reference Groups") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Ref: Excellent | Ref: Good |
|---|---|---|
| Intercept | 9.5930 | 11.4370 |
| Age coefficient | -0.0870 | -0.0870 |
| Sex coefficient | 1.7250 | 1.7250 |
| Physical health days | 0.2310 | 0.2310 |
| Sleep hours | -0.5870 | -0.5870 |
| R-squared | 0.1694 | 0.1694 |
| Residual SE | 7.2080 | 7.2080 |
When we change the reference category for a categorical variable,only the dummy‑variable coefficients for that categorical variable change and all other coefficients remain exactly the same because changing the reference group only shifts the baseline (intercept).The slopes for age, sleep hours, physical health days represent marginal changes, not comparisons between categories.These slopes do not depend on which category is coded as 0 for the general health.
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
# Verify that predicted values are identical
pred_orig <- predict(model_full)
pred_reref <- predict(mod_genhealth_reref)
tibble(
Check = c("Maximum absolute difference in predictions",
"Correlation between predictions"),
Value = c(max(abs(pred_orig - pred_reref)),
cor(pred_orig, pred_reref))
) |>
kable(caption = "Verification: Predicted Values Are Identical") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Check | Value |
|---|---|
| Maximum absolute difference in predictions | 0 |
| Correlation between predictions | 1 |
Interpretation: Predicted values from both models are correlated because the underlying model fit does not change; only the coding of the categorical variable changes. A correlation of 1.00 confirms that the two sets of predictions lie on the exact same straight line.
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
# Reduced model (no gen health)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)
summary(mod_reduced)$r.squared## [1] 0.1521948
## [1] 0.1515159
## [1] 0.1694246
## [1] 0.1680933
The reduced model (age, sex, physical unhealthy days, and sleep hours) had an R^2 of approximately 0.1521 and an adjusted R^2 of 0.1515, indicating that these predictors explain about 15% of the variation in mentally unhealthy days. After adding the general‑health dummy variables, the full model’s R^2 increased to 0.1694 and the adjusted R^2 to 0.1681 This improvement shows that general health explains additional variance in mentally unhealthy days.
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
# Partial F-test
f_test <- anova(mod_reduced, model_full)
f_test |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Does General Health Improve the Model?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs | 4995 | 264715.2 | NA | NA | NA | NA |
| menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health | 4991 | 259335.4 | 4 | 5379.751 | 25.8838 | 0 |
Null Hypthesis
\[H_0: \beta_{\text{verygood}} = \beta_{\text{good}} = \beta_{\text{fair}} = \beta_{\text{poor}} = 0\]
Alternative Hypothesis
\[H_A: \text{At least one } \beta_gen_health \neq 0\]
The partial F‑test is highly significant: - F(4,4991)=25.88 - p<0.0001 Because the p‑value is less than 0.0001, we reject the null hypothesis.
Interpretation General health significantly improves the model predicting mentally unhealthy days. The four dummy variables collectively explain additional variance beyond age, sex, physical unhealthy days, and sleep hours. This aligns with the strong gradient in the coefficient plots: poorer general health is associated with more mentally unhealthy days.
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
Anova(model_full, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12030.737 | 1 | 231.5357 | 0 |
| age | 10907.874 | 1 | 209.9258 | 0 |
| sex | 3663.847 | 1 | 70.5120 | 0 |
| physhlth_days | 10633.920 | 1 | 204.6535 | 0 |
| sleep_hrs | 3049.400 | 1 | 58.6868 | 0 |
| gen_health | 5379.751 | 4 | 25.8838 | 0 |
| Residuals | 259335.435 | 4991 | NA | NA |
The Type III ANOVA result for gen_health is consistent with the partial F‑test. Both analyses report the same sum of squares (5379.75), the same numerator degrees of freedom (4), the same F‑statistic (25.88), and the same p‑value (p < 0.0001). This confirms that general health, as a set of four dummy variables, significantly improves the prediction of mentally unhealthy days keeping hold of age, sex, physical unhealthy days, and sleep hours.
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
People who describe their overall health as “Poor” or “Fair” report substantially more days of poor mental health each month compared to those who say their health is “Excellent” with about 3 to 5 additional days, even when accounting for differences in age, sex, sleep, and physical health problems. People who rate their health as “Good” or “Very Good” have moderately more mentally unhealthy days than the Excellent reference group. Because it is cross sectional and we measured both physical and mental health at the same point in time, we cannot tell whether declining physical health leads to mental health problems, or mental distress affects how people perceive their physical health, or if both are influenced by other factors. However, there is a strong connection between physical and mental health.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.
Comparison of Education vs. General Health as Predictors:
General health status appears to be more strongly associated with mentally unhealthy days than education level. When general health was added to the model, R² increased from 0.152 to 0.169, whereas education added less. Additionally, the magnitude of the general health coefficients is larger: individuals with “Poor” health report approximately 5.3 more mentally unhealthy days compared to those with “Excellent” health, while the largest education difference is much smaller between the lowest and highest education groups. The partial F-test for general health was also highly significant (F = 25.88, p < 0.0001), indicating strong evidence that these categories improve model fit. If only one predictor could be included in the final model building, general health would be the stronger choice because it captures more variation in the outcome. In practice, both variables reflect different influences on mental health. —
End of Lab Activity