Part 2: In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026


Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable Description Type
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
age Age in years (capped at 80) Continuous
sex Sex (Male/Female) Factor
education Education level (4 categories) Factor
gen_health General health status (5 categories) Factor
marital_status Marital status (6 categories) Factor
educ_numeric Education as numeric code (1–4) Numeric
# Load the dataset
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
## Warning: package 'broom' was built under R version 4.4.3
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.3
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.4.3
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.4.3
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.4.3
brfss_dv <- readRDS(
  "/Users/sarah/OneDrive/Documents/EPI 553/data/brfss_dv_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

brfss_dv |>
  select(menthlth_days, age, sex, gen_health, marital_status) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      gen_health     ~ "General health status",
      marital_status ~ "Marital status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
  as_flex_table()
**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**

Characteristic

N

N = 5,0001

Mentally unhealthy days (past 30)

5,000

3.8 (7.9)

Age (years)

5,000

54.9 (17.5)

Sex

5,000

Male

2,303 (46%)

Female

2,697 (54%)

General health status

5,000

Excellent

1,065 (21%)

Very good

1,803 (36%)

Good

1,426 (29%)

Fair

523 (10%)

Poor

183 (3.7%)

Marital status

5,000

Married

2,708 (54%)

Divorced

622 (12%)

Widowed

534 (11%)

Separated

109 (2.2%)

Never married

848 (17%)

Unmarried couple

179 (3.6%)

1Mean (SD); n (%)

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

ggplot(brfss_dv, aes(x= gen_health, y= menthlth_days, fill = gen_health)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by General Health Status",
    x = "General Health Status",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

brfss_dv %>%
  group_by(marital_status) %>%
  summarise(mean_menthlth = mean(menthlth_days, na.rm = TRUE),
            n =n()
  )%>%
  arrange(desc(mean_menthlth))
## # A tibble: 6 × 3
##   marital_status   mean_menthlth     n
##   <fct>                    <dbl> <int>
## 1 Separated                 6.22   109
## 2 Unmarried couple          6.07   179
## 3 Never married             5.28   848
## 4 Divorced                  4.49   622
## 5 Married                   3.10  2708
## 6 Widowed                   2.67   534

The separated group reports the highest mean of mentally unhealthy days (6.22), while the widowed group reported the lowest mean (2.67).

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

brfss_dv <- brfss_dv %>%
  mutate(
    gen_health_numeric = case_when(
      gen_health == "Excellent" ~ 1,
      gen_health == "Very Good" ~ 2,
      gen_health == "Good" ~ 3,
      gen_health == "Fair" ~ 4,
      gen_health == "Poor" ~ 5
    )
  )

model_2a <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)
summary(model_2a)
## 
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = brfss_dv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6309 -5.0626 -1.4943 -0.0626 28.5057 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.2899     0.3404  -0.852    0.395    
## gen_health_numeric   1.7842     0.1175  15.186   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.335 on 3195 degrees of freedom
##   (1803 observations deleted due to missingness)
## Multiple R-squared:  0.06732,    Adjusted R-squared:  0.06703 
## F-statistic: 230.6 on 1 and 3195 DF,  p-value: < 2.2e-16

The coefficient equals 1.7842. For each one-unit worsening in general health status, the number of mentally unhealthy days increases by about 1.78 days.

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

model_2b <- lm(menthlth_days ~ gen_health, data = brfss_dv)

summary(model_2b)
## 
## Call:
## lm(formula = menthlth_days ~ gen_health, data = brfss_dv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7814  -4.0708  -2.7077  -0.1174  27.8826 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.1174     0.2332   9.079  < 2e-16 ***
## gen_healthVery good   0.5903     0.2941   2.007   0.0448 *  
## gen_healthGood        1.9535     0.3082   6.337 2.54e-10 ***
## gen_healthFair        5.0624     0.4064  12.457  < 2e-16 ***
## gen_healthPoor        9.6640     0.6090  15.868  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.611 on 4995 degrees of freedom
## Multiple R-squared:  0.07334,    Adjusted R-squared:  0.07259 
## F-statistic: 98.83 on 4 and 4995 DF,  p-value: < 2.2e-16

The factor version uses four coefficients because general health has 5 categories, each non-reference category needs its own comparison with the reference group. The numeric model uses 1 coefficient since it treats general health as a continuous variable. This is misleading because differences between categories are not equal.


Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation. \[Menthlth_days = 9.5930 + -0.0867(age) + 1.7254 (female) + 0.2314 (physhlth_days) + -0.5866(sleep) + 0.7899(very good) + 1.8436(good) + 3.3953(fair) + 5.3353(poor) + \varepsilon\]

mod_gen_health <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv)

tidy(mod_gen_health, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with General Health Status Dummy Variables",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Model with General Health Status Dummy Variables
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 9.5930 0.6304 15.2163 0.0000 8.3570 10.8289
age -0.0867 0.0060 -14.4888 0.0000 -0.0984 -0.0749
sexFemale 1.7254 0.2055 8.3971 0.0000 1.3226 2.1282
physhlth_days 0.2314 0.0162 14.3057 0.0000 0.1997 0.2631
sleep_hrs -0.5866 0.0766 -7.6607 0.0000 -0.7367 -0.4365
gen_healthVery good 0.7899 0.2797 2.8247 0.0048 0.2417 1.3382
gen_healthGood 1.8436 0.2973 6.2020 0.0000 1.2608 2.4264
gen_healthFair 3.3953 0.4180 8.1234 0.0000 2.5759 4.2147
gen_healthPoor 5.3353 0.6829 7.8122 0.0000 3.9965 6.6742

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

The reference group for gen_health is Excellent health. Holding all other variables constant, adults reporting very good health have 0.79 more mentally unhealthy days per month compared to those reporting excellent health. Adults reporting good health have 1.84 more mentally unhealthy days per month than those reporting excellent health. Adults reporting fair health have 3.40 more mentally unhealthy days per month compared to those reporting excellent health. Adults reporting poor health have 5.34 more mentally unhealthy days per month than those reporting excellent health.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

The poor health group differs the most from the reference group.

tidy(mod_gen_health, conf.int = TRUE) %>%
  filter(str_detect(term, "gen_health")) %>%
   mutate(term = str_replace(term, "gen_health", "")) %>%
  ggplot(aes(x = term, y = estimate)) +
  geom_point(size = 3, color = "steelblue") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
                width = 0.15, color = "steelblue") +
  coord_flip() +
  labs(
    title = "Coefficient Plot for General Health Categories",
    subtitle = "Reference group: Excellent health",
    x = "General Health Category",
    y = "Estimated Difference in Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13)

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

brfss_dv$gen_health_ref <- relevel(brfss_dv$gen_health, ref = "Good")

model_4a <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_ref, data = brfss_dv)

tidy(model_4a, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: Good)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Same Model, Different Reference Group (Reference: Good)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 11.4366 0.6298 18.1584 0e+00 10.2019 12.6713
age -0.0867 0.0060 -14.4888 0e+00 -0.0984 -0.0749
sexFemale 1.7254 0.2055 8.3971 0e+00 1.3226 2.1282
physhlth_days 0.2314 0.0162 14.3057 0e+00 0.1997 0.2631
sleep_hrs -0.5866 0.0766 -7.6607 0e+00 -0.7367 -0.4365
gen_health_refExcellent -1.8436 0.2973 -6.2020 0e+00 -2.4264 -1.2608
gen_health_refVery good -1.0537 0.2581 -4.0819 0e+00 -1.5597 -0.5476
gen_health_refFair 1.5517 0.3861 4.0186 1e-04 0.7947 2.3087
gen_health_refPoor 3.4917 0.6506 5.3673 0e+00 2.2164 4.7671

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

The coefficient for age, sex, physhlth_days, and sleep_hrs are the same in both models. Changing the reference group only affects the intercept and dummy-variable coefficients.

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

pred_orig <- predict(mod_gen_health)
pred_reref <- predict(model_4a)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(max(abs(pred_orig - pred_reref)),
            cor(pred_orig, pred_reref))
) |>
  kable(caption = "Verification: Predicted Values Are Identical") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Verification: Predicted Values Are Identical
Check Value
Maximum absolute difference in predictions 0
Correlation between predictions 1

Changing the reference group shifts the intercept and dummy coefficients. These changes cancel each other out when computing predicted values. All continuous predictors keep the same slopes.


Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

reduced_model <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data= brfss_dv)

r2_table <- tibble(
  Model = c("Reduced model", "Full model"),
  R2 = c(summary(reduced_model)$r.squared,
         summary(mod_gen_health)$r.squared),
  Adj_R2 = c(summary(reduced_model)$adj.r.squared,
             summary(mod_gen_health)$adj.r.squared)
) |> 
  mutate(across(where(is.numeric), round, 4))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(where(is.numeric), round, 4)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
r2_table |> 
  kable(caption = "R² and Adjusted R² for Reduced vs. Full Model") |> 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
R² and Adjusted R² for Reduced vs. Full Model
Model R2 Adj_R2
Reduced model 0.1522 0.1515
Full model 0.1694 0.1681

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

f_test <- anova(reduced_model, mod_gen_health)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does General Health Status Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Partial F-test: Does General Health Status Improve the Model?
term df.residual rss df sumsq statistic p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs 4995 264715.2 NA NA NA NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health 4991 259335.4 4 5379.751 25.8838 0

H0:_Very good = _Excellent = _Fair = _Poor = 0

Ha: At least one _gen_health =! 0

F-statistic = 25.8838 Degrees of freedom = 4 P-value = <.001

We reject the null hypothesis since the p value is far less than 0.05. General health status as a whole significantly improves the prediction of mentally unhealthy days, even after adjusting for age, sex, physical health, and sleep.

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

Anova(mod_gen_health, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type III ANOVA") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Type III ANOVA
term sumsq df statistic p.value
(Intercept) 12030.737 1 231.5357 0
age 10907.874 1 209.9258 0
sex 3663.847 1 70.5120 0
physhlth_days 10633.920 1 204.6535 0
sleep_hrs 3049.400 1 58.6868 0
gen_health 5379.751 4 25.8838 0
Residuals 259335.435 4991 NA NA

Theh Type III ANOVA result for gen_health is consistent with the partial F-test.


Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

  • Identify which general health groups differ most from the reference
  • State the direction and approximate magnitude of the association
  • Appropriately acknowledge the cross-sectional nature of the data
  • Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

Adults reporting poorer general health also tend to report more days of poor mental health. Compared with people in excellent health, those in fair or poor health experience several additional mentally unhealthy days each month, with the largest different seem among poor health. People in good and very good health status show smaller differences, but still report slightly more mentally unhealthy days. Because the data was collected at one point in time, we cannot say that general health status causes changes in mental health.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

General health status shows a stronger relationship with mentally unhealthy days than education. In the education model, differences across education levels are small, and only college graduates report noticeably fewer mentally unhealthy days than the lowest education group. In contrast, people with fair or poor general health report several more mentally unhealthy days than those in excellent health. If I were to build a final model I would include general health as it adds more explanatory value.


End of Lab Activity