Part 2: In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026

Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

# Load the dataset
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## Warning: package 'stringr' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)

## Warning: package 'broom' was built under R version 4.5.2

library(knitr)

## Warning: package 'knitr' was built under R version 4.5.2

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.2

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(gtsummary)

## Warning: package 'gtsummary' was built under R version 4.5.2

library(car)

## Warning: package 'car' was built under R version 4.5.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(ggeffects)

## Warning: package 'ggeffects' was built under R version 4.5.2

library(flextable)

## Warning: package 'flextable' was built under R version 4.5.3

## 
## Attaching package: 'flextable'
## 
## The following object is masked from 'package:gtsummary':
## 
##     continuous_summary
## 
## The following objects are masked from 'package:kableExtra':
## 
##     as_image, footnote
## 
## The following object is masked from 'package:purrr':
## 
##     compose

library(plotly)

## Warning: package 'plotly' was built under R version 4.5.2

## 
## Attaching package: 'plotly'
## 
## The following objects are masked from 'package:flextable':
## 
##     highlight, style
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(GGally)

## Warning: package 'GGally' was built under R version 4.5.2

brfss_dv <- readRDS(
  "C:/Users/abbym/OneDrive/Desktop/STATS553/R Materials/epi553/scripts/brfss_dv_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

brfss_dv |>
  select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
         education, gen_health) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      physhlth_days  ~ "Physically unhealthy days (past 30)",
      sleep_hrs      ~ "Sleep (hours/night)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      education      ~ "Education level",
      gen_health     ~ "General health status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Physically unhealthy days (past 30)	5,000	3.3 (7.9)
Sleep (hours/night)	5,000	7.0 (1.4)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
Education level	5,000
Less than HS		290 (5.8%)
HS graduate		1,348 (27%)
Some college		1,340 (27%)
College graduate		2,022 (40%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
1Mean (SD); n (%)

ggplot(brfss_dv, aes(x = gen_health, y = menthlth_days)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by General Health",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "General Health",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

ggplot(brfss_dv, aes(x = marital_status, y = menthlth_days, fill = menthlth_days)) +
  geom_bar(stat = "summary", fun= "mean", alpha = 0.85) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mean Mentally Unhealthy Days by Marital Status",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x = "Marital Status",
    y = "Mean Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect? The poor general health group has the most mentally unhealthy days. This is consistent with what I would expect. 1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest? Separated has the highest mean and widowed has the lowest mean. —

Task 2: The Naive Approach (10 points)

brfss_dv <- brfss_dv |>
mutate(
 gen_health_numeric = case_when(
      gen_health == "Excellent" ~ 1,
      gen_health == "Very good" ~ 2,
      gen_health == "Good"  ~ 3,
      gen_health == "Fair"  ~ 4,
      gen_health == "Poor" ~ 5,
      TRUE                   ~ NA_real_
    )
 )

# Fit model
naive_mod <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv)

tidy(naive_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Naive Model: General Health Treated as Continuous",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Naive Model: General Health Treated as Continuous
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	-0.6718	0.2705	-2.4840	0.013	-1.2021	-0.1416
gen_health_numeric	1.8578	0.1036	17.9259	0.000	1.6547	2.0610

# Fit model
reg_mod <- lm(menthlth_days ~ gen_health, data = brfss_dv)

tidy(reg_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Regular Model: General Health Treated as Factor",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Regular Model: General Health Treated as Factor
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.1174	0.2332	9.0790	0.0000	1.6602	2.5746
gen_healthVery good	0.5903	0.2941	2.0070	0.0448	0.0137	1.1670
gen_healthGood	1.9535	0.3082	6.3375	0.0000	1.3492	2.5577
gen_healthFair	5.0624	0.4064	12.4572	0.0000	4.2657	5.8590
gen_healthPoor	9.6640	0.6090	15.8678	0.0000	8.4701	10.8580

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it. The coefficient is 1.8578, meaning that for every 1 unit increase in general health, the number of poor mental health days increases by 1.8578 days. 2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading. The factor version uses 4 coefficients, because it is using excellent general health as a reference category, and each level needs to be compared to something. The naive approach may be misleading, because general health level is not a numeric variable. The numeric codes to each level in general health have no significant meaning, meaning running a regression with them doesn’t really have that much of a significant meaning. You can’t really tell the distance between “excellent” general health and “very good” general health, for example. —

Task 3: Dummy Variable Regression with General Health (25 points)

# Fit model
new_mod <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health, data = brfss_dv)

tidy(new_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "New Model",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

New Model
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.5930	0.6304	15.2163	0.0000	8.3570	10.8289
age	-0.0867	0.0060	-14.4888	0.0000	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0.0000	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0.0000	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0.0000	-0.7367	-0.4365
gen_healthVery good	0.7899	0.2797	2.8247	0.0048	0.2417	1.3382
gen_healthGood	1.8436	0.2973	6.2020	0.0000	1.2608	2.4264
gen_healthFair	3.3953	0.4180	8.1234	0.0000	2.5759	4.2147
gen_healthPoor	5.3353	0.6829	7.8122	0.0000	3.9965	6.6742

ggcoef_model(new_mod,
  include = c("gen_health"),
  variable_labels = c(
    gen_health = "General Health"),
  facet_labeller = ggplot2::label_wrap_gen(10) 
)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation. menthlth_days = 9.5930 - 0.0867(age) + 1.7254(female) + 0.2314(physhlth_days) - 0.5866(sleep_hrs) + 0.7899 (very good gen_health) + 1.8436(good gen_health) + 3.3953 (fair gen_health) + 5.3353(poor gen_health) 3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.” Compared to those with excellent general health, those with very good general health have 0.7899 more poor mental health days, holding all other variables constant. Compared to those with excellent general health, those with good general health have 1.8436 more poor mental health days, holding all other variables constant. Compared to those with excellent general health, those with poor general health have 3.3953 more poor mental health days, holding all other variables constant. Compared to those with excellent general health, those with poor general health have 5.3353 more poor mental health days, holding all other variables constant. 3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group? Poor general health group differs the most from the reference group. —

Task 4: Changing the Reference Group (15 points)

# Change reference group to Good gen_health
brfss_dv$gen_reref <- relevel(brfss_dv$gen_health, ref = "Good")

mod_gen_reref <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_reref, data = brfss_dv)

tidy(mod_gen_reref, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: Good)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Same Model, Different Reference Group (Reference: Good)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.4366	0.6298	18.1584	0e+00	10.2019	12.6713
age	-0.0867	0.0060	-14.4888	0e+00	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0e+00	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0e+00	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0e+00	-0.7367	-0.4365
gen_rerefExcellent	-1.8436	0.2973	-6.2020	0e+00	-2.4264	-1.2608
gen_rerefVery good	-1.0537	0.2581	-4.0819	0e+00	-1.5597	-0.5476
gen_rerefFair	1.5517	0.3861	4.0186	1e-04	0.7947	2.3087
gen_rerefPoor	3.4917	0.6506	5.3673	0e+00	2.2164	4.7671

pred_orig <- predict(new_mod)
pred_reref <- predict(mod_gen_reref)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(max(abs(pred_orig - pred_reref)),
            cor(pred_orig, pred_reref))
) |>
  kable(caption = "Verification: Predicted Values Are Identical") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Verification: Predicted Values Are Identical
Check	Value
Maximum absolute difference in predictions	0
Correlation between predictions	1

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not? The education coefficients are not the same, because the output given is the conclusion drawn for mental health days when compared to good general health, not excellent general health. The other predictors stay the same, because the rest of the model stayed the same, the only thing that changed was the reference category for general health. 4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions. Changing the reference group does not change the predictions, because it only changes the interpretation of the dummy variables. Everything else stays constant, besides the intercept, including the data and variables used. —

Task 5: Partial F-Test for General Health (20 points)

# Fit model
reduced_mod <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv)

tidy(reduced_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "New Model",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

New Model
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	10.5041	0.6124	17.1531	9.3036	11.7046
age	-0.0773	0.0060	-12.9721	-0.0890	-0.0656
sexFemale	1.6864	0.2074	8.1296	1.2797	2.0931
physhlth_days	0.3172	0.0132	24.0152	0.2913	0.3431
sleep_hrs	-0.6330	0.0772	-8.2029	-0.7843	-0.4817

summary(reduced_mod)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs, 
##     data = brfss_dv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.314  -3.520  -1.805   0.159  32.013 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.504082   0.612374  17.153  < 2e-16 ***
## age           -0.077335   0.005962 -12.972  < 2e-16 ***
## sexFemale      1.686413   0.207440   8.130 5.38e-16 ***
## physhlth_days  0.317247   0.013210  24.015  < 2e-16 ***
## sleep_hrs     -0.633026   0.077171  -8.203 2.96e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.28 on 4995 degrees of freedom
## Multiple R-squared:  0.1522, Adjusted R-squared:  0.1515 
## F-statistic: 224.2 on 4 and 4995 DF,  p-value: < 2.2e-16

summary(new_mod)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs + 
##     gen_health, data = brfss_dv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5175  -3.5549  -1.6999   0.4316  31.3631 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          9.592982   0.630441  15.216  < 2e-16 ***
## age                 -0.086672   0.005982 -14.489  < 2e-16 ***
## sexFemale            1.725379   0.205472   8.397  < 2e-16 ***
## physhlth_days        0.231420   0.016177  14.306  < 2e-16 ***
## sleep_hrs           -0.586595   0.076572  -7.661 2.21e-14 ***
## gen_healthVery good  0.789947   0.279661   2.825  0.00475 ** 
## gen_healthGood       1.843601   0.297260   6.202 6.03e-10 ***
## gen_healthFair       3.395283   0.417964   8.123 5.66e-16 ***
## gen_healthPoor       5.335347   0.682949   7.812 6.80e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1681 
## F-statistic: 127.3 on 8 and 4991 DF,  p-value: < 2.2e-16

f_test <- anova(reduced_mod, new_mod)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does General Health Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Does General Health Improve the Model?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	264715.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health	4991	259335.4	4	5379.751	25.8838	0

# Type III using car::Anova()
anova_type3 <- Anova(new_mod, type = "III")
print(anova_type3)

## Anova Table (Type III tests)
## 
## Response: menthlth_days
##               Sum Sq   Df F value    Pr(>F)    
## (Intercept)    12031    1 231.536 < 2.2e-16 ***
## age            10908    1 209.926 < 2.2e-16 ***
## sex             3664    1  70.512 < 2.2e-16 ***
## physhlth_days  10634    1 204.654 < 2.2e-16 ***
## sleep_hrs       3049    1  58.687 2.207e-14 ***
## gen_health      5380    4  25.884 < 2.2e-16 ***
## Residuals     259335 4991                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a). The \(R^2\) for the full model is 0.1694 and the \(R^2\) for the reduced model is 0.1522. The adjusted \(R^2\) for the full model is 0.1681 and the adjusted \(R^2\) for the reduced model is 0.1515. 5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion. The null hypothesis is that general health does not significantly improve the model, and the alternative hypothesis is that general health does significantly improve the model. The F-statistic is 25.8838, the degrees of freedom are 4, and the p-value is 0. This means you reject the null hypothesis. General health does significantly improve the model. 5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent? Yes, they are consistent. —

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

Identify which general health groups differ most from the reference
State the direction and approximate magnitude of the association
Appropriately acknowledge the cross-sectional nature of the data
Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”) Poor and fair general health groups differ most from the excellent general health groups based on the linear regression run. As the general health level worsens, the more bad mental health days that are expected compared to the excellent general health. Compared to those with excellent general health, those with poor general health have 3.3953 more poor mental health days, holding all other variables constant. Compared to those with excellent general health, those with poor general health have 5.3353 more poor mental health days, holding all other variables constant. Since BRFSS is cross-sectional, it is hard to establish causality between general health and the number of bad mental health days, as one cannot determine if the bad mental health days came first or the general health status of the person. 6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison. General health appears to be more strongly associated with mental health days. In order to decide which to include in the final model, you could test each different model and determine their AIC/BIC in order to see which is a better fit. In addition, you could do partial F-tests to see if adding each categorical predictor improves the model or not. —

End of Lab Activity

Lab09_DummyVariables_AbigailMiller

Abigail Miller

2026-03-26