In-Class Lab Activity

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)

brfss_dv <- readRDS("data/brfss_dv_2020.rds")

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

brfss_dv |>
  select(menthlth_days, age, sex, gen_health, marital_status) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      gen_health     ~ "General health status",
      marital_status ~ "Marital status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)") |>
  as_flex_table()

Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
Marital status	5,000
Married		2,708 (54%)
Divorced		622 (12%)
Widowed		534 (11%)
Separated		109 (2.2%)
Never married		848 (17%)
Unmarried couple		179 (3.6%)
1Mean (SD); n (%)

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days, fill = gen_health)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by General Health",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "General Health",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mental Health Days by General Health

Those with poor general health reported the most mentally unhealthy days. There is a clear positive association between worsening health and the number of mentally unhealthy days, which is expected.

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

brfss_dv |>
group_by(marital_status) |> mutate(percent = mean(menthlth_days)) |>
mutate(percent = round(percent, digits = 2)) |>
ggplot(aes(x = marital_status, y = percent, fill = menthlth_days)) +
  geom_bar(position="dodge", stat="identity") +
  geom_text(stat = "identity", aes(label = after_stat(y)), vjust = -0.3) +
  labs(
    title = "Distribution of Education Level",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x = "Education Level",
    y = "Mean Days with Poor Mental Health"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Distribution of Poor Mental Health Days by Marital Status in Analytic Sample

desc_table <- brfss_dv %>%
  select(menthlth_days, marital_status) %>%
  group_by(marital_status) %>%
  summarise(
    N = n(),
    `Mean Mental Health Days` = round(mean(menthlth_days),2))

desc_table %>%
  kable(caption = "Descriptive Statistics by Marital Status",
        align = "lrrrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Descriptive Statistics by Marital Status
marital_status	N	Mean Mental Health Days
Married	2708	3.10
Divorced	622	4.49
Widowed	534	2.67
Separated	109	6.22
Never married	848	5.28
Unmarried couple	179	6.07

Those who identify as separated have the highest mean number of days with poor mental health, whereas those who are widowed have the lowest.

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

 brfss_dv_2 <- brfss_dv |>
  mutate(
    gen_health_num = factor(case_when(
      gen_health == "Poor" ~ 1,
      gen_health == "Fair" ~2,
      gen_health == "Good" ~3,
      gen_health == "Very good" ~ 4,
      gen_health == "Excellent" ~ 5), levels = c(1,2,3,4,5)))

class(brfss_dv_2$gen_health_num)

## [1] "factor"

brfss_dv_2$gen_health_num <- as.numeric(brfss_dv_2$gen_health_num)

model_1 <- lm(menthlth_days ~ gen_health_num, data = brfss_dv_2)

tidy(model_1, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	10.4751	0.3894	26.9013	0	9.7118	11.2385
gen_health_num	-1.8578	0.1036	-17.9259	0	-2.0610	-1.6547

The coefficient for general health is -1.86. This means that a one unit decrease in general health, meaning worsening health, is associated with 1.86 increased number of mentally unhealthy days.

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

model_2 <- lm(menthlth_days ~ gen_health, data = brfss_dv_2)

tidy(model_2, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.1174	0.2332	9.0790	0.0000	1.6602	2.5746
gen_healthVery good	0.5903	0.2941	2.0070	0.0448	0.0137	1.1670
gen_healthGood	1.9535	0.3082	6.3375	0.0000	1.3492	2.5577
gen_healthFair	5.0624	0.4064	12.4572	0.0000	4.2657	5.8590
gen_healthPoor	9.6640	0.6090	15.8678	0.0000	8.4701	10.8580

This new model that treats general health as a factor shows multiple coefficients because each is treated as a separate ordinal group. We can now see significance when comparing each group to eachother. The native model may be misleading because it assumes equal distance between each ordinal group.

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

model_3 <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv_2)

tidy(model_3, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with Dichotomous Dummy Variable: General Health",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with Dichotomous Dummy Variable: General Health
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.5930	0.6304	15.2163	0.0000	8.3570	10.8289
age	-0.0867	0.0060	-14.4888	0.0000	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0.0000	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0.0000	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0.0000	-0.7367	-0.4365
gen_healthVery good	0.7899	0.2797	2.8247	0.0048	0.2417	1.3382
gen_healthGood	1.8436	0.2973	6.2020	0.0000	1.2608	2.4264
gen_healthFair	3.3953	0.4180	8.1234	0.0000	2.5759	4.2147
gen_healthPoor	5.3353	0.6829	7.8122	0.0000	3.9965	6.6742

\[\text{menthlth_days} = 9.59 + -0.09 \cdot \text{age} + 1.73 \cdot \text{sex} + 0.23 \cdot \text{physhlth_days} + 0.79 \cdot \text{genhlth_verygood} + 1.84 \cdot \text{genhlth_good} + 3.40 \cdot \text{genhlth_fair} + 5.34 \cdot \text{genhlth_poor}+ \varepsilon\] 3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

Those with very good general health had 0.79 more days of poor mental health compared to thoes with excellent general health, adjusting for other variables.
Those with good general health had 1.84 more days of poor mental health compared to thoes with excellent general health, adjusting for other variables.
Those with fair general health had 3.40 more days of poor mental health compared to thoes with excellent general health, adjusting for other variables.
Those with poor general health had 5.34 more days of poor mental health compared to thoes with excellent general health, adjusting for other variables.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

pred_educ <- ggpredict(model_3, terms = c("age [20:80]", "gen_health"))

ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Age and General Health",
    subtitle = "Parallel lines: same slopes for age, different intercepts by general health",
    x        = "Age (years)",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "gen_health",
    fill     = "gen_health"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set2")

Those with poor general health differ the most from the reference group, those with excellent general health.

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

brfss_dv_2$gen_health_reref <- relevel(brfss_dv_2$gen_health, ref = "Good")

model_4 <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_reref, data = brfss_dv_2)

tidy(model_4, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: Good general health)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Same Model, Different Reference Group (Reference: Good general health)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.4366	0.6298	18.1584	0e+00	10.2019	12.6713
age	-0.0867	0.0060	-14.4888	0e+00	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0e+00	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0e+00	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0e+00	-0.7367	-0.4365
gen_health_rerefExcellent	-1.8436	0.2973	-6.2020	0e+00	-2.4264	-1.2608
gen_health_rerefVery good	-1.0537	0.2581	-4.0819	0e+00	-1.5597	-0.5476
gen_health_rerefFair	1.5517	0.3861	4.0186	1e-04	0.7947	2.3087
gen_health_rerefPoor	3.4917	0.6506	5.3673	0e+00	2.2164	4.7671

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not? Yes, the other continuous variable coefficients are the same. This is because they are not using general health as a reference group.

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

pred_orig <- predict(model_3)
pred_reref <- predict(model_4)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(max(abs(pred_orig - pred_reref)),
            cor(pred_orig, pred_reref))
) |>
  kable(caption = "Verification: Predicted Values Are Identical") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Verification: Predicted Values Are Identical
Check	Value
Maximum absolute difference in predictions	0
Correlation between predictions	1

This shows that there is no difference between the predictors from both models. Again, this makes sense because general health is in both models, but this variabel is not the reference group for any other predictors.

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

model_5 <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv_2)

r_sq <- summary(model_5)$r.squared
adj_r_sq <- summary(model_5)$adj.r.squared

r_sq_full <- summary(model_3)$r.squared
adj_r_sq_full <- summary(model_3)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Partial_Model  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")),
  Full_Model  = c(
    round(r_sq_full, 4),
    round(adj_r_sq_full, 4),
    paste0(round(r_sq_full * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Partial_Model	Full_Model
R²	0.1522	0.1694
Adjusted R²	0.1515	0.1681
Variance Explained	15.22%	16.94%

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

f_test <- anova(model_5, model_3)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does General Health Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Does General Health Improve the Model?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	264715.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health	4991	259335.4	4	5379.751	25.8838	0

The null hypothesis is that general health will not imporve the model, considering the other covariates are already in the model, whereas the alternative hypothesis is that it will improve the model. The f-statistic of the full model is 25.88, with 4 degrees of freedom and a p-value <0.05. We can conclude that there is a significant improvement to the full model and can reject the null hypothesis.

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

Anova(model_3, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Type III ANOVA: Testing Each Predictor’s Contribution
term	sumsq	df	statistic	p.value
(Intercept)	12030.737	1	231.5357	0
age	10907.874	1	209.9258	0
sex	3663.847	1	70.5120	0
physhlth_days	10633.920	1	204.6535	0
sleep_hrs	3049.400	1	58.6868	0
gen_health	5379.751	4	25.8838	0
Residuals	259335.435	4991	NA	NA

Yes, the results are identical.

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

This analysis showed that general health is directly associated with mental health. The model uses cross-sectional data, meaning there is no timeline, so we cannot determine which variable is impacting or causing the other and we can only state that they have an association. Those with the best general health had the fewest days with poor mental health, whereas those with the worst general health had the most days with poor mental health.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

The analyses showed associations between education and general health with the outcome of interest, days with poor mental health. General health had a stronger association with mental health, as evidenced by the impact that changes to the variables had on the mean days with poor mental health. If I were building a final model, I would include education as well as general health and the other demographic factors to build the strongest model possible.

Dummy Variables in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Matthew Goldman

March 23, 2026