Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

Part 2: In-Class Lab Activity

EPI 553 — Dummy Variables Lab Due: End of class, March 23, 2026

Instructions

In this lab, you will practice constructing, fitting, and interpreting regression models with dummy variables using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(ggeffects)

brfss_dv <- readRDS(
  "C:/Users/tahia/OneDrive/Desktop/UAlbany PhD/Epi 553/Lab 9/brfss_dv_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

brfss_dv |>
  select(menthlth_days, age, sex, gen_health, marital_status) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      gen_health     ~ "General health status",
      marital_status ~ "Marital Status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample, For Specific Variables (n = 5,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample, For Specific Variables (n = 5,000)**
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
Marital Status	5,000
Married		2,708 (54%)
Divorced		622 (12%)
Widowed		534 (11%)
Separated		109 (2.2%)
Never married		848 (17%)
Unmarried couple		179 (3.6%)
1Mean (SD); n (%)

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

1b Ans

The boxplot shows that people in the “Poor” general health group report the highest number of mentally unhealthy days, with a much higher median and wider spread compared to other groups. As general health moves from “Excellent” to “Poor,” the number of mentally unhealthy days steadily increases. This pattern is consistent with what we would expect, since individuals with poorer physical health are more likely to experience stress, illness, and emotional difficulties.

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

1c Ans The separated group has the highest mean number of mentally unhealthy days, while the widowed group has the lowest mean.

mean_table <- brfss_dv %>%
  group_by(marital_status) %>%
  summarise(
    mean_menthlth_days = mean(menthlth_days, na.rm = TRUE)
  )

mean_table

## # A tibble: 6 × 2
##   marital_status   mean_menthlth_days
##   <fct>                         <dbl>
## 1 Married                        3.10
## 2 Divorced                       4.49
## 3 Widowed                        2.67
## 4 Separated                      6.22
## 5 Never married                  5.28
## 6 Unmarried couple               6.07

ggplot(mean_table, aes(x = marital_status, y = mean_menthlth_days)) +
  geom_bar(stat = "identity", fill = "gold") +
  labs(
    title = "Mean Mentally Unhealthy Days by Marital Status",
    x = "Marital Status",
    y = "Mean Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

21 Ans

The coefficient for gen_health_numeric is 1.8578. This means that for every one-category worsening in general health (for example, from “Good” to “Fair”), the number of mentally unhealthy days increases by approximately 1.86 days on average. The result is statistically significant (p < 0.001), indicating a strong positive association between poorer general health and more mentally unhealthy days.

# Simple regression model

brfss_dv1 <- brfss_dv %>%
  mutate(
    gen_health_numeric = case_when(
      gen_health == "Excellent" ~ 1,
      gen_health == "Very good" ~ 2,
      gen_health == "Good" ~ 3,
      gen_health == "Fair" ~ 4,
      gen_health == "Poor" ~ 5,
      TRUE ~ NA_real_
    )
  )

model_naive <- lm(menthlth_days ~ gen_health_numeric, data = brfss_dv1)

summary(model_naive)

## 
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = brfss_dv1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6173 -4.9016 -3.0438 -0.0438 28.8140 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.6718     0.2705  -2.484    0.013 *  
## gen_health_numeric   1.8578     0.1036  17.926   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.661 on 4998 degrees of freedom
## Multiple R-squared:  0.06041,    Adjusted R-squared:  0.06022 
## F-statistic: 321.3 on 1 and 4998 DF,  p-value: < 2.2e-16

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

2b ANs

The factor model uses 4 coefficients because there are 5 categories, and one (“Excellent”) is the reference. Each coefficient shows how much higher the mentally unhealthy days are compared to “Excellent.” For example, people with Poor health have about 9.66 more days.

The numeric model assumes each step increases by the same amount (1.86), but the factor model shows the increases are not equal. So, the numeric approach can be misleading because it oversimplifies the relationship.

# Fit model with gen_health as a factor

model_factor <- lm(menthlth_days ~ gen_health, data = brfss_dv1)

summary(model_factor)

## 
## Call:
## lm(formula = menthlth_days ~ gen_health, data = brfss_dv1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7814  -4.0708  -2.7077  -0.1174  27.8826 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.1174     0.2332   9.079  < 2e-16 ***
## gen_healthVery good   0.5903     0.2941   2.007   0.0448 *  
## gen_healthGood        1.9535     0.3082   6.337 2.54e-10 ***
## gen_healthFair        5.0624     0.4064  12.457  < 2e-16 ***
## gen_healthPoor        9.6640     0.6090  15.868  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.611 on 4995 degrees of freedom
## Multiple R-squared:  0.07334,    Adjusted R-squared:  0.07259 
## F-statistic: 98.83 on 4 and 4995 DF,  p-value: < 2.2e-16

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

  `menthlth_days^ = B0 +B1(age) + B2(sex) + B3(physhlth_days) + B4(sleep_hrs) + B5(Very good)  + B6(Good) + B7(fair) + B8(Poor)`
  
  Here, "Excellent" is the reference group

mod_gen <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
               data = brfss_dv1)

tidy(mod_gen, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model with General Health Dummy Variables (Reference: Excellent)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model with General Health Dummy Variables (Reference: Excellent)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.5930	0.6304	15.2163	0.0000	8.3570	10.8289
age	-0.0867	0.0060	-14.4888	0.0000	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0.0000	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0.0000	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0.0000	-0.7367	-0.4365
gen_healthVery good	0.7899	0.2797	2.8247	0.0048	0.2417	1.3382
gen_healthGood	1.8436	0.2973	6.2020	0.0000	1.2608	2.4264
gen_healthFair	3.3953	0.4180	8.1234	0.0000	2.5759	4.2147
gen_healthPoor	5.3353	0.6829	7.8122	0.0000	3.9965	6.6742

b_gen <- round(coef(mod_gen), 4)
ci_gen <- round(confint(mod_gen), 4)

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

3b Ans In this model, “Excellent” health is the reference group. Compared to this group, individuals with Very good health report about 0.79 more mentally unhealthy days, those with Good health report about 1.84 more days, those with Fair health report about 3.40 more days, and those with Poor health report about 5.34 more days. This shows a clear pattern where mentally unhealthy days increase as general health worsens, holding all other variables constant.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

3c Ans The “Poor” general health group differs the most from the reference group (“Excellent”), as it has the largest coefficient (farthest from zero).

coef_df <- tidy(mod_gen, conf.int = TRUE) %>%
  filter(grepl("gen_health", term))

coef_df$term <- gsub("gen_health", "", coef_df$term)

ggplot(coef_df, aes(x = estimate, y = term)) +
  geom_point(color = "purple") +
  geom_errorbar(
    aes(xmin = conf.low, xmax = conf.high),
    width = 0.2,
    orientation = "y"
  ) +
  labs(
    title = "Effect of General Health on Mentally Unhealthy Days",
    x = "Coefficient (Estimate)",
    y = "General Health Category"
  ) +
  theme_minimal()

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

#Change reference group to "Good"


brfss_dv1$gen_health <- relevel(brfss_dv1$gen_health, ref = "Good")


mod_gen_ref_good <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
  data = brfss_dv1)

tidy(mod_gen_ref_good, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Same Model, Different Reference Group (Reference: Good)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Same Model, Different Reference Group (Reference: Good)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	11.4366	0.6298	18.1584	0e+00	10.2019	12.6713
age	-0.0867	0.0060	-14.4888	0e+00	-0.0984	-0.0749
sexFemale	1.7254	0.2055	8.3971	0e+00	1.3226	2.1282
physhlth_days	0.2314	0.0162	14.3057	0e+00	0.1997	0.2631
sleep_hrs	-0.5866	0.0766	-7.6607	0e+00	-0.7367	-0.4365
gen_healthExcellent	-1.8436	0.2973	-6.2020	0e+00	-2.4264	-1.2608
gen_healthVery good	-1.0537	0.2581	-4.0819	0e+00	-1.5597	-0.5476
gen_healthFair	1.5517	0.3861	4.0186	1e-04	0.7947	2.3087
gen_healthPoor	3.4917	0.6506	5.3673	0e+00	2.2164	4.7671

summary(mod_gen_ref_good)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs + 
##     gen_health, data = brfss_dv1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5175  -3.5549  -1.6999   0.4316  31.3631 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         11.436584   0.629825  18.158  < 2e-16 ***
## age                 -0.086672   0.005982 -14.489  < 2e-16 ***
## sexFemale            1.725379   0.205472   8.397  < 2e-16 ***
## physhlth_days        0.231420   0.016177  14.306  < 2e-16 ***
## sleep_hrs           -0.586595   0.076572  -7.661 2.21e-14 ***
## gen_healthExcellent -1.843601   0.297260  -6.202 6.03e-10 ***
## gen_healthVery good -1.053654   0.258126  -4.082 4.54e-05 ***
## gen_healthFair       1.551682   0.386128   4.019 5.94e-05 ***
## gen_healthPoor       3.491746   0.650560   5.367 8.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1681 
## F-statistic: 127.3 on 8 and 4991 DF,  p-value: < 2.2e-16

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

4b Ans

The coefficients for age, sex, physically unhealthy days, and sleep hours are the same in both models. This is because changing the reference group only affects the gen_health categories, not the other variables. The data and model remain the same, so the relationship between these continuous variables and the outcome does not change.

tribble(
  ~Quantity, ~`Ref: Excellent`, ~`Ref: Good`,
  "Intercept", round(coef(mod_gen)[1], 3), round(coef(mod_gen_ref_good)[1], 3),
  "Age coefficient", round(coef(mod_gen)[2], 3), round(coef(mod_gen_ref_good)[2], 3),
  "Sex coefficient", round(coef(mod_gen)[3], 3), round(coef(mod_gen_ref_good)[3], 3),
  "Physical health days", round(coef(mod_gen)[4], 3), round(coef(mod_gen_ref_good)[4], 3),
  "Sleep hours", round(coef(mod_gen)[5], 3), round(coef(mod_gen_ref_good)[5], 3),
  "R-squared", round(summary(mod_gen)$r.squared, 4), round(summary(mod_gen_ref_good)$r.squared, 4),
  "Residual SE", round(summary(mod_gen)$sigma, 3), round(summary(mod_gen_ref_good)$sigma, 3)
) |>
  kable(caption = "Comparing Models with Different Reference Groups") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparing Models with Different Reference Groups
Quantity	Ref: Excellent	Ref: Good
Intercept	9.5930	11.4370
Age coefficient	-0.0870	-0.0870
Sex coefficient	1.7250	1.7250
Physical health days	0.2310	0.2310
Sleep hours	-0.5870	-0.5870
R-squared	0.1694	0.1694
Residual SE	7.2080	7.2080

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

4c Ans The predicted values from both models are identical, as shown by a correlation of 1 and a maximum difference of 0. This happens because changing the reference group only changes how the categorical variable (gen_health) is coded, not the actual information in the data. The model is just re-expressing the same relationships in a different way, so the final predicted values remain exactly the same.

# Verify that predicted values are identical
pred_orig <- predict(mod_gen)
pred_reref <- predict(mod_gen_ref_good)

tibble(
  Check = c("Maximum absolute difference in predictions",
            "Correlation between predictions"),
  Value = c(max(abs(pred_orig - pred_reref)),
            cor(pred_orig, pred_reref))
) |>
  kable(caption = "Verification: Predicted Values Are Identical") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Verification: Predicted Values Are Identical
Check	Value
Maximum absolute difference in predictions	0
Correlation between predictions	1

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

5a Ans

The reduced model has an R2 of approximately 0.153 and an adjusted R2 of approximately 0.152. The full model has an R2 of 0.169 and an adjusted R2 of 0.168.

# Reduced model (no gen_health)
mod_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = brfss_dv1)

summary(mod_reduced)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs, 
##     data = brfss_dv1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.314  -3.520  -1.805   0.159  32.013 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.504082   0.612374  17.153  < 2e-16 ***
## age           -0.077335   0.005962 -12.972  < 2e-16 ***
## sexFemale      1.686413   0.207440   8.130 5.38e-16 ***
## physhlth_days  0.317247   0.013210  24.015  < 2e-16 ***
## sleep_hrs     -0.633026   0.077171  -8.203 2.96e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.28 on 4995 degrees of freedom
## Multiple R-squared:  0.1522, Adjusted R-squared:  0.1515 
## F-statistic: 224.2 on 4 and 4995 DF,  p-value: < 2.2e-16

summary(mod_gen)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs + 
##     gen_health, data = brfss_dv1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5175  -3.5549  -1.6999   0.4316  31.3631 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          9.592982   0.630441  15.216  < 2e-16 ***
## age                 -0.086672   0.005982 -14.489  < 2e-16 ***
## sexFemale            1.725379   0.205472   8.397  < 2e-16 ***
## physhlth_days        0.231420   0.016177  14.306  < 2e-16 ***
## sleep_hrs           -0.586595   0.076572  -7.661 2.21e-14 ***
## gen_healthVery good  0.789947   0.279661   2.825  0.00475 ** 
## gen_healthGood       1.843601   0.297260   6.202 6.03e-10 ***
## gen_healthFair       3.395283   0.417964   8.123 5.66e-16 ***
## gen_healthPoor       5.335347   0.682949   7.812 6.80e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1681 
## F-statistic: 127.3 on 8 and 4991 DF,  p-value: < 2.2e-16

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

5b Ans The null hypothesis is that all coefficients for gen_health are equal to zero (gen_health does not improve the model), while the alternative hypothesis is that at least one of the coefficients is not zero. The partial F-test gives an F-statistic of 25.88 with 4 and 4991 degrees of freedom, and a p-value less than 0.001. Since the p-value is very small, we reject the null hypothesis. This means that general health as a whole significantly improves the model and contributes to explaining variation in mentally unhealthy days.

# Partial F-test
f_test <- anova(mod_reduced, mod_gen)

f_test |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Does General Health Improve the Model?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Does General Health Improve the Model?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ age + sex + physhlth_days + sleep_hrs	4995	264715.2	NA	NA	NA	NA
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health	4991	259335.4	4	5379.751	25.8838	0

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

5c Ans The results are consistent. The Type III ANOVA shows an F-statistic of 25.88 for gen_health with a p-value less than 0.001, which is exactly the same as the partial F-test. This confirms that gen_health as a group significantly improves the model, and both methods lead to the same conclusion.

Anova(mod_gen, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type III ANOVA: Testing Each Predictor's Contribution") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Type III ANOVA: Testing Each Predictor’s Contribution
term	sumsq	df	statistic	p.value
(Intercept)	12030.737	1	231.5357	0
age	10907.874	1	209.9258	0
sex	3663.847	1	70.5120	0
physhlth_days	10633.920	1	204.6535	0
sleep_hrs	3049.400	1	58.6868	0
gen_health	5379.751	4	25.8838	0
Residuals	259335.435	4991	NA	NA

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

Identify which general health groups differ most from the reference
State the direction and approximate magnitude of the association
Appropriately acknowledge the cross-sectional nature of the data
Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

6a Ans

People who reported poorer general health tended to have more mentally unhealthy days compared to those in excellent health. Those in poor health had about 5.3 more days, those in fair health had about 3.4 more days, and those in good health had about 1.8 more days, while those in very good health had about 0.8 more days. This shows a clear pattern where mental health worsens as general health declines. However, since the data was collected at one point in time, we cannot say that surely poor health causes worse mental health, only that they are related.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

6b Ans General health appears to be more strongly associated with mentally unhealthy days than education. The differences across general health groups are much larger (for example, up to about 5.3 more days for those in poor health), while the education groups show smaller differences, with only college graduates having about 1.1 fewer days compared to the reference group. This suggests that general health has a stronger relationship with mental health outcomes. If building a final model, I would prioritize including general health, but I would also consider keeping education if it is important for controlling background differences or potential confounding.

End of Lab Activity

Dummy Variables in Regression

EPI 553 —Lab Activity

Tahia Sufyani

March 23, 2026