Dummy Variables

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

plot(pressure)

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`gen_health`	General health status (5 categories)	Factor
`marital_status`	Marital status (6 categories)	Factor
`educ_numeric`	Education as numeric code (1–4)	Numeric

# Load the dataset
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## Warning: package 'stringr' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(knitr)

## Warning: package 'knitr' was built under R version 4.5.2

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.2

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(gtsummary)

## Warning: package 'gtsummary' was built under R version 4.5.2

library(car)

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(ggeffects)

## Warning: package 'ggeffects' was built under R version 4.5.2

library(ggplot2)
library(flextable)

## Warning: package 'flextable' was built under R version 4.5.3

## 
## Attaching package: 'flextable'
## 
## The following object is masked from 'package:gtsummary':
## 
##     continuous_summary
## 
## The following objects are masked from 'package:kableExtra':
## 
##     as_image, footnote
## 
## The following object is masked from 'package:purrr':
## 
##     compose

brfss_dv <- readRDS("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss_dv_2020.rds")

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes menthlth_days, age, sex, gen_health, and marital_status. Show means (SD) for continuous variables and n (%) for categorical variables.

brfss_dv |>
  select(menthlth_days, physhlth_days,  age, sex,
         education, gen_health) |>
  tbl_summary(
    label = list(
      menthlth_days  ~ "Mentally unhealthy days (past 30)",
      physhlth_days  ~ "Physically unhealthy days (past 30)",
      age            ~ "Age (years)",
      sex            ~ "Sex",
      education      ~ "Education level",
      gen_health     ~ "General health status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.9)
Physically unhealthy days (past 30)	5,000	3.3 (7.9)
Age (years)	5,000	54.9 (17.5)
Sex	5,000
Male		2,303 (46%)
Female		2,697 (54%)
Education level	5,000
Less than HS		290 (5.8%)
HS graduate		1,348 (27%)
Some college		1,340 (27%)
College graduate		2,022 (40%)
General health status	5,000
Excellent		1,065 (21%)
Very good		1,803 (36%)
Good		1,426 (29%)
Fair		523 (10%)
Poor		183 (3.7%)
1Mean (SD); n (%)

1b. (5 pts) Create a boxplot of menthlth_days by gen_health. Which group reports the most mentally unhealthy days? Does the pattern appear consistent with what you would expect?

Ans: The poor group has the highest unhealthy mental days and the pattern appeared consistent with my expectation.

ggplot(brfss_dv, aes(x =gen_health , y = menthlth_days, fill = gen_health)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Mentally Unhealthy Days by General health status",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x = "General health status",
    y = "Mentally Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

1c. (5 pts) Create a grouped bar chart or table showing the mean number of mentally unhealthy days by marital_status. Which marital status group has the highest mean? The lowest?

Ans: The highest mean is separated group and the lowest men is the widowed group.

# Compute means by marital status
mental_by_marital <- brfss_dv  |>
  filter(!is.na(menthlth_days), !is.na(marital_status)) |>
  group_by(marital_status) |>
  summarise(
    mean_days = mean(menthlth_days, na.rm = TRUE),
    n = n()
  ) |>
  arrange(desc(mean_days))

print(mental_by_marital)

## # A tibble: 6 × 3
##   marital_status   mean_days     n
##   <fct>                <dbl> <int>
## 1 Separated             6.22   109
## 2 Unmarried couple      6.07   179
## 3 Never married         5.28   848
## 4 Divorced              4.49   622
## 5 Married               3.10  2708
## 6 Widowed               2.67   534

ggplot(mental_by_marital, aes(x = reorder(marital_status, mean_days), y = mean_days, fill = marital_status)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Mean mentally unhealthy days by marital status",
    x = NULL,
    y = "Mean days in past 30"
  ) +
  theme_minimal()

Task 2: The Naive Approach (10 points)

2a. (5 pts) Using the gen_health variable, create a numeric version coded as: Excellent = 1, Very good = 2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model: menthlth_days ~ gen_health_numeric. Report the coefficient and interpret it.

Ans: The coefficient on gen_health_numeric will be positive (roughly +2–3). Each one-unit increase in the numeric scale (i.e., moving one category “worse,” e.g., from Excellent to Very good) is associated with approximately that many additional mentally unhealthy days in the past 30, on average.

df <- brfss_dv |>
  mutate(gen_health_numeric = case_when(
    gen_health == "Excellent" ~ 1,
    gen_health == "Very good" ~ 2,
    gen_health == "Good"      ~ 3,
    gen_health == "Fair"      ~ 4,
    gen_health == "Poor"      ~ 5
  ))

model_2a <- lm(menthlth_days ~ gen_health_numeric, data = df)
summary(model_2a)

## 
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6173 -4.9016 -3.0438 -0.0438 28.8140 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.6718     0.2705  -2.484    0.013 *  
## gen_health_numeric   1.8578     0.1036  17.926   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.661 on 4998 degrees of freedom
## Multiple R-squared:  0.06041,    Adjusted R-squared:  0.06022 
## F-statistic: 321.3 on 1 and 4998 DF,  p-value: < 2.2e-16

2b. (5 pts) Now fit the same model but treating gen_health as a factor: menthlth_days ~ gen_health. Compare the two models. Why does the factor version use 4 coefficients instead of 1? Explain why the naive numeric approach may be misleading.

Ans: Why 4 coefficients instead of 1? The factor version creates k−1 = 4 dummy variables (one per level, minus the reference “Excellent”). Each dummy compares that group to the reference. The numeric approach forces the differences between consecutive levels to be equal (a “dose–response” assumption). This is misleading because the mental health gap between Excellent and Very good may differ substantially from the gap between Good and Fair — the factor model reveals this; the numeric model hides it.

df <- brfss_dv |>
  mutate(gen_health = factor(gen_health,
           levels = c("Excellent","Very good","Good","Fair","Poor")))

model_2b <- lm(menthlth_days ~ gen_health, data = df)
summary(model_2b)

## 
## Call:
## lm(formula = menthlth_days ~ gen_health, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7814  -4.0708  -2.7077  -0.1174  27.8826 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.1174     0.2332   9.079  < 2e-16 ***
## gen_healthVery good   0.5903     0.2941   2.007   0.0448 *  
## gen_healthGood        1.9535     0.3082   6.337 2.54e-10 ***
## gen_healthFair        5.0624     0.4064  12.457  < 2e-16 ***
## gen_healthPoor        9.6640     0.6090  15.868  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.611 on 4995 degrees of freedom
## Multiple R-squared:  0.07334,    Adjusted R-squared:  0.07259 
## F-statistic: 98.83 on 4 and 4995 DF,  p-value: < 2.2e-16

# Compare
AIC(model_2a, model_2b)

##          df      AIC
## model_2a  3 34555.43
## model_2b  6 34492.16

Task 3: Dummy Variable Regression with General Health (25 points)

3a. (5 pts) Fit the following model with gen_health as a factor:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health

Write out the fitted regression equation.

model_3a <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
               data = df)
summary(model_3a)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs + 
##     gen_health, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5175  -3.5549  -1.6999   0.4316  31.3631 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          9.592982   0.630441  15.216  < 2e-16 ***
## age                 -0.086672   0.005982 -14.489  < 2e-16 ***
## sexFemale            1.725379   0.205472   8.397  < 2e-16 ***
## physhlth_days        0.231420   0.016177  14.306  < 2e-16 ***
## sleep_hrs           -0.586595   0.076572  -7.661 2.21e-14 ***
## gen_healthVery good  0.789947   0.279661   2.825  0.00475 ** 
## gen_healthGood       1.843601   0.297260   6.202 6.03e-10 ***
## gen_healthFair       3.395283   0.417964   8.123 5.66e-16 ***
## gen_healthPoor       5.335347   0.682949   7.812 6.80e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1681 
## F-statistic: 127.3 on 8 and 4991 DF,  p-value: < 2.2e-16

tidy(model_3a, conf.int = TRUE)

## # A tibble: 9 × 7
##   term                estimate std.error statistic  p.value conf.low conf.high
##   <chr>                  <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)           9.59     0.630       15.2  3.82e-51   8.36     10.8   
## 2 age                  -0.0867   0.00598    -14.5  1.25e-46  -0.0984   -0.0749
## 3 sexFemale             1.73     0.205        8.40 5.90e-17   1.32      2.13  
## 4 physhlth_days         0.231    0.0162      14.3  1.59e-45   0.200     0.263 
## 5 sleep_hrs            -0.587    0.0766      -7.66 2.21e-14  -0.737    -0.436 
## 6 gen_healthVery good   0.790    0.280        2.82 4.75e- 3   0.242     1.34  
## 7 gen_healthGood        1.84     0.297        6.20 6.03e-10   1.26      2.43  
## 8 gen_healthFair        3.40     0.418        8.12 5.66e-16   2.58      4.21  
## 9 gen_healthPoor        5.34     0.683        7.81 6.80e-15   4.00      6.67

3b. (10 pts) Interpret every dummy variable coefficient for gen_health in plain language. Be specific about the reference group, the direction and magnitude of each comparison, and include the phrase “holding all other variables constant.”

Ans: gen_healthVery good- People who rate their health as Very good report, on average, β₅ more mentally unhealthy days than those with Excellent health, holding all other variables constant. gen_healthGood - People reporting Good health have, on average, β₆ more mentally unhealthy days than the Excellent group, holding all other variables constant. gen_healthFair- People reporting Fair health have, on average, β₇ more mentally unhealthy days than the Excellent group, holding all other variables constant. gen_healthPoor- People reporting Poor health have, on average, β₈ more mentally unhealthy days than those with Excellent health, holding all other variables constant — this is typically the largest positive coefficient.

3c. (10 pts) Create a coefficient plot (forest plot) showing the estimated coefficients and 95% confidence intervals for the gen_health dummy variables only. Which group differs most from the reference group?

# ── 3c: Coefficient / forest plot ────────────────────────────────────────────
coef_df <- tidy(model_3a, conf.int = TRUE) |>
  filter(grepl("gen_health", term)) |>
  mutate(term = gsub("gen_health", "", term),
         term = factor(term, levels = c("Very good","Good","Fair","Poor")))

ggplot(coef_df, aes(x = estimate, y = term)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 0.8) +
  geom_point(size = 3, color = "#534AB7") +
  labs(
    title = "Effect of general health on mentally unhealthy days",
    subtitle = "Reference group: Excellent health",
    x = "Estimated additional days (vs. Excellent)",
    y = "General health category"
  ) +
  theme_minimal(base_size = 13)

## Warning: `geom_errorbarh()` was deprecated in ggplot2 4.0.0.
## ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `height` was translated to `width`.

Task 4: Changing the Reference Group (15 points)

4a. (5 pts) Use relevel() to change the reference group for gen_health to “Good.” Refit the model from Task 3a.

df <- brfss_dv |>
  mutate(gen_health_good_ref = relevel(gen_health, ref = "Good"))

model_4a <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_good_ref,
               data = df)
summary(model_4a)

## 
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs + 
##     gen_health_good_ref, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5175  -3.5549  -1.6999   0.4316  31.3631 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  11.436584   0.629825  18.158  < 2e-16 ***
## age                          -0.086672   0.005982 -14.489  < 2e-16 ***
## sexFemale                     1.725379   0.205472   8.397  < 2e-16 ***
## physhlth_days                 0.231420   0.016177  14.306  < 2e-16 ***
## sleep_hrs                    -0.586595   0.076572  -7.661 2.21e-14 ***
## gen_health_good_refExcellent -1.843601   0.297260  -6.202 6.03e-10 ***
## gen_health_good_refVery good -1.053654   0.258126  -4.082 4.54e-05 ***
## gen_health_good_refFair       1.551682   0.386128   4.019 5.94e-05 ***
## gen_health_good_refPoor       3.491746   0.650560   5.367 8.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1681 
## F-statistic: 127.3 on 8 and 4991 DF,  p-value: < 2.2e-16

4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?

Ans: The coefficients for age, sex, physhlth_days, and sleep_hrs are identical across both models. The reference group only changes the intercept and the dummy variable coefficients (now expressing differences from Good rather than Excellent). The continuous predictors are unaffected because they measure partial slopes that don’t depend on the reference category.

tidy(model_3a) |> filter(!grepl("gen_health|Intercept", term))

## # A tibble: 4 × 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 age            -0.0867   0.00598    -14.5  1.25e-46
## 2 sexFemale       1.73     0.205        8.40 5.90e-17
## 3 physhlth_days   0.231    0.0162      14.3  1.59e-45
## 4 sleep_hrs      -0.587    0.0766      -7.66 2.21e-14

tidy(model_4a) |> filter(!grepl("gen_health|Intercept", term))

## # A tibble: 4 × 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 age            -0.0867   0.00598    -14.5  1.25e-46
## 2 sexFemale       1.73     0.205        8.40 5.90e-17
## 3 physhlth_days   0.231    0.0162      14.3  1.59e-45
## 4 sleep_hrs      -0.587    0.0766      -7.66 2.21e-14

4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.

pred_3a <- predict(model_3a)
pred_4a <- predict(model_4a)

cor(pred_3a, pred_4a)   # Should be exactly 1.0

## [1] 1

all.equal(pred_3a, pred_4a)  # Should be TRUE

## [1] TRUE

Task 5: Partial F-Test for General Health (20 points)

5a. (5 pts) Fit a reduced model without gen_health:

menthlth_days ~ age + sex + physhlth_days + sleep_hrs

Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).

# ── 5a: Reduced model ────────────────────────────────────────────────────────
model_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = df)

summary(model_reduced)$r.squared

## [1] 0.1521948

summary(model_reduced)$adj.r.squared

## [1] 0.1515159

summary(model_3a)$r.squared

## [1] 0.1694246

summary(model_3a)$adj.r.squared

## [1] 0.1680933

5b. (10 pts) Conduct a partial F-test using anova() to test whether gen_health as a whole significantly improves the model. State the null and alternative hypotheses. Report the F-statistic, degrees of freedom, and p-value. State your conclusion.

Hypotheses:

H₀: The four gen_health dummy coefficients are all equal to zero (general health adds no predictive power beyond the other variables). H₁: At least one gen_health dummy coefficient is nonzero.

Expected result: The F-statistic will be very large (likely F > 100) with 4 numerator df and a p-value < 0.0001, leading us to reject H₀ — general health status significantly improves the model.

anova(model_reduced, model_3a)

## Analysis of Variance Table
## 
## Model 1: menthlth_days ~ age + sex + physhlth_days + sleep_hrs
## Model 2: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4995 264715                                  
## 2   4991 259335  4    5379.8 25.884 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5c. (5 pts) Use car::Anova() with type = "III" on the full model. Compare the result for gen_health to your partial F-test. Are they consistent?

Ans:The gen_health row in the Type III ANOVA will report the same F-statistic and p-value as the partial F-test from anova(), because both test the joint significance of all gen_health dummies given all other predictors are in the model. They are consistent by design.

library(car)
Anova(model_3a, type = "III")

## Anova Table (Type III tests)
## 
## Response: menthlth_days
##               Sum Sq   Df F value    Pr(>F)    
## (Intercept)    12031    1 231.536 < 2.2e-16 ***
## age            10908    1 209.926 < 2.2e-16 ***
## sex             3664    1  70.512 < 2.2e-16 ***
## physhlth_days  10634    1 204.654 < 2.2e-16 ***
## sleep_hrs       3049    1  58.687 2.207e-14 ***
## gen_health      5380    4  25.884 < 2.2e-16 ***
## Residuals     259335 4991                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Task 6: Public Health Interpretation (15 points)

6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:

Identify which general health groups differ most from the reference
State the direction and approximate magnitude of the association
Appropriately acknowledge the cross-sectional nature of the data
Not use any statistical jargon (no “significant,” “coefficient,” “p-value,” “confidence interval”)

Ans: People who describe their general health as Poor or Fair tend to experience substantially more mentally unhealthy days than those who consider themselves in Excellent health — in some cases nearly two weeks more per month. Even after accounting for age, sex, physical health, and sleep, this pattern holds: mental and general health are closely intertwined. People in Good or Very good health fall in between, with the gap growing larger as self-rated health worsens. Because this data captures a single point in time, we cannot say whether poor general health causes poorer mental health or vice versa — the relationship likely runs in both directions.

6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.

Ans: General health status is likely the stronger predictor of mental health days. General health is a holistic self-assessment that implicitly incorporates chronic illness, pain, and daily functioning — all of which are tightly coupled with mental wellbeing. Education, by contrast, is a distal socioeconomic factor whose effect on mental health is more indirect and attenuated. To formally compare them, you would look at the incremental R² each variable adds when entered last (i.e., partial R²), or compare standardized coefficients. If building a final model, you would include both if theory and sample size permit, but if parsimony is required, the variable with greater incremental explained variance — likely gen_health — would take priority. You would also check whether the two variables interact or suppress one another before making a final decision.

End of Lab Activity