Model Selection

R Markdown

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

plot(pressure)

Data for the Lab

Use the saved analytic dataset from today’s lecture.

Variable	Description	Type
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`exercise`	Any physical activity (Yes/No)	Factor
`gen_health`	General health status (5 categories)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric
`bmi`	Body mass index	Continuous

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## Warning: package 'stringr' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(knitr)

## Warning: package 'knitr' was built under R version 4.5.2

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.2

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(car)

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(leaps)

## Warning: package 'leaps' was built under R version 4.5.3

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

brfss_ms <- readRDS(
  "C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss_ms_2020.rds"
)

Task 1: Maximum Model and Criteria Comparison (15 points)

1a. (5 pts) Fit the maximum model predicting physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.

# Maximum model
model_max <- lm(physhlth_days ~ ., data = brfss_ms )

summary(model_max)

## 
## Call:
## lm(formula = physhlth_days ~ ., data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.7449  -2.3217  -0.9099   0.0283  30.2398 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.690206   0.855629   3.144 0.001676 ** 
## menthlth_days              0.147208   0.012117  12.149  < 2e-16 ***
## sleep_hrs                 -0.192971   0.067288  -2.868 0.004150 ** 
## age                        0.018041   0.005472   3.297 0.000985 ***
## sexFemale                 -0.188895   0.182044  -1.038 0.299491    
## educationHS graduate       0.250794   0.429738   0.584 0.559518    
## educationSome college      0.346303   0.432417   0.801 0.423254    
## educationCollege graduate  0.333607   0.435715   0.766 0.443918    
## exerciseYes               -1.286634   0.237391  -5.420 6.24e-08 ***
## gen_healthVery good        0.437297   0.245340   1.782 0.074743 .  
## gen_healthGood             1.591341   0.265127   6.002 2.08e-09 ***
## gen_healthFair             7.017579   0.368212  19.059  < 2e-16 ***
## gen_healthPoor            20.437395   0.546860  37.372  < 2e-16 ***
## income_cat                -0.181654   0.050331  -3.609 0.000310 ***
## bmi                        0.013017   0.014468   0.900 0.368323    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.321 on 4985 degrees of freedom
## Multiple R-squared:  0.386,  Adjusted R-squared:  0.3843 
## F-statistic: 223.9 on 14 and 4985 DF,  p-value: < 2.2e-16

# Extract criteria
R2_max <- summary(model_max)$r.squared
AdjR2_max <- summary(model_max)$adj.r.squared
AIC_max <- AIC(model_max)
BIC_max <- BIC(model_max)

R2_max; AdjR2_max; AIC_max; BIC_max

## [1] 0.3860047

## [1] 0.3842803

## [1] 32645.79

## [1] 32750.06

Ans: \(R^2\)-0.3860047 Adjusted \(R^2\) 32645.79 AIC 32645.79 BIC 2750.06

1b. (5 pts) Now fit a “minimal” model using only menthlth_days and age. Report the same four criteria. How do the two models compare?

model_min <- lm(physhlth_days ~ menthlth_days + age, data = brfss_ms)

summary(model_min)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + age, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7405  -3.3856  -2.0996  -0.3782  29.9082 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.698341   0.364082  -4.665 3.17e-06 ***
## menthlth_days  0.323680   0.013478  24.015  < 2e-16 ***
## age            0.071605   0.006239  11.476  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.58 on 4997 degrees of freedom
## Multiple R-squared:  0.115,  Adjusted R-squared:  0.1146 
## F-statistic: 324.7 on 2 and 4997 DF,  p-value: < 2.2e-16

R2_min <- summary(model_min)$r.squared
AdjR2_min <- summary(model_min)$adj.r.squared
AIC_min <- AIC(model_min)
BIC_min <- BIC(model_min)

R2_min; AdjR2_min; AIC_min; BIC_min

## [1] 0.1150016

## [1] 0.1146474

## [1] 34449.78

## [1] 34475.85

Ans: \(R^2\) 0.1150016 Adjusted \(R^2\) 0.1146474 AIC 34449.78 BIC 34475.85

Ans: The maximum model will always have higher R² because it includes more variables.The AIC and BIC are lower in maximum model. So it is more compatible

1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?

Ans: R² is a poor criterion because it always increases (or stays the same) when additional predictors are added, regardless of whether those variables are meaningful. This makes it biased toward overly complex models. In contrast, Adjusted R² penalizes unnecessary predictors, and AIC/BIC explicitly balance model fit with model complexity, helping prevent over fitting. BIC is especially strict, favoring more parsimonious models.

Task 2: Best Subsets Regression (20 points)

2a. (5 pts) Use leaps::regsubsets() to perform best subsets regression with nvmax = 15. Create a plot of Adjusted \(R^2\) vs. number of variables. At what model size does Adjusted \(R^2\) plateau?

library(leaps)

regfit <- regsubsets(physhlth_days ~ ., data = brfss_ms, nvmax = 15)
reg_summary <- summary(regfit)

plot(reg_summary$adjr2, type = "b",
     xlab = "Number of Variables",
     ylab = "Adjusted R^2")

Ans: Adjusted R² typically plateaus at 8 number of variables

2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?

plot(reg_summary$bic, type = "b",
     xlab = "Number of Variables",
     ylab = "BIC")

which.min(reg_summary$bic)

## [1] 8

BIC is minimized at 8.

2c. (5 pts) Identify the variables included in the BIC-best model. Fit this model explicitly using lm() and report its coefficients.

best_bic_index <- which.min(reg_summary$bic)
coef(regfit, best_bic_index)

##    (Intercept)  menthlth_days      sleep_hrs            age    exerciseYes 
##     3.45633850     0.14723660    -0.19857360     0.01824813    -1.30319973 
## gen_healthGood gen_healthFair gen_healthPoor     income_cat 
##     1.34888847     6.77927209    20.19749774    -0.16542425

Ans: 3.45633850

2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?

Ans: BIC tends to select simpler models, while Adjusted R² may include more predictors. I would prefer the BIC model because it reduces over fitting and improves interpretability, which is especially important for generalization results —

Task 3: Automated Selection Methods (20 points)

3a. (5 pts) Perform backward elimination using step() with AIC as the criterion. Which variables are removed? Which remain?

model_full <- lm(physhlth_days ~ ., data = brfss_ms)

backward <- step(model_full, direction = "backward")

## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

summary(backward)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age + 
##     exercise + gen_health + income_cat, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5956  -2.3238  -0.9004   0.0081  30.3580 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.18636    0.66634   4.782 1.79e-06 ***
## menthlth_days        0.14608    0.01204  12.135  < 2e-16 ***
## sleep_hrs           -0.19515    0.06720  -2.904  0.00370 ** 
## age                  0.01740    0.00544   3.198  0.00139 ** 
## exerciseYes         -1.28774    0.23360  -5.513 3.71e-08 ***
## gen_healthVery good  0.46171    0.24411   1.891  0.05863 .  
## gen_healthGood       1.63676    0.26000   6.295 3.33e-10 ***
## gen_healthFair       7.07865    0.36164  19.573  < 2e-16 ***
## gen_healthPoor      20.50841    0.54234  37.815  < 2e-16 ***
## income_cat          -0.16570    0.04719  -3.511  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.32 on 4990 degrees of freedom
## Multiple R-squared:  0.3857, Adjusted R-squared:  0.3846 
## F-statistic: 348.1 on 9 and 4990 DF,  p-value: < 2.2e-16

3b. (5 pts) Perform forward selection using step(). Does it arrive at the same model as backward elimination?

model_null <- lm(physhlth_days ~ 1, data = brfss_ms)

forward <- step(model_null,
                scope = formula(model_full),
                direction = "forward")

## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452

summary(forward)

## 
## Call:
## lm(formula = physhlth_days ~ gen_health + menthlth_days + exercise + 
##     income_cat + age + sleep_hrs, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5956  -2.3238  -0.9004   0.0081  30.3580 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.18636    0.66634   4.782 1.79e-06 ***
## gen_healthVery good  0.46171    0.24411   1.891  0.05863 .  
## gen_healthGood       1.63676    0.26000   6.295 3.33e-10 ***
## gen_healthFair       7.07865    0.36164  19.573  < 2e-16 ***
## gen_healthPoor      20.50841    0.54234  37.815  < 2e-16 ***
## menthlth_days        0.14608    0.01204  12.135  < 2e-16 ***
## exerciseYes         -1.28774    0.23360  -5.513 3.71e-08 ***
## income_cat          -0.16570    0.04719  -3.511  0.00045 ***
## age                  0.01740    0.00544   3.198  0.00139 ** 
## sleep_hrs           -0.19515    0.06720  -2.904  0.00370 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.32 on 4990 degrees of freedom
## Multiple R-squared:  0.3857, Adjusted R-squared:  0.3846 
## F-statistic: 348.1 on 9 and 4990 DF,  p-value: < 2.2e-16

Ans: It arrive at same model.

3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.

library(broom)

get_metrics <- function(model) {
  c(
    vars = length(coef(model)) - 1,
    adjR2 = summary(model)$adj.r.squared,
    AIC = AIC(model),
    BIC = BIC(model)
  )
}

rbind(
  backward = get_metrics(backward),
  forward = get_metrics(forward)
)

##          vars   adjR2      AIC      BIC
## backward    9 0.38458 32638.37 32710.06
## forward     9 0.38458 32638.37 32710.06

3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research?

Ans: It can select variables based on random noise (overfitting). Results are unstable—small data changes can lead to different models. It ignores causal structure and confounding, focusing only on prediction —

Task 4: Associative Model Building (25 points)

For this task, the exposure is sleep_hrs and the outcome is physhlth_days. You are building an associative model to estimate the effect of sleep on physical health.

4a. (5 pts) Fit the crude model: physhlth_days ~ sleep_hrs. Report the sleep coefficient.

model_crude <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)
summary(model_crude)

## 
## Call:
## lm(formula = physhlth_days ~ sleep_hrs, data = brfss_ms)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.279 -3.486 -2.854 -1.590 30.938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.91103    0.59591   13.28  < 2e-16 ***
## sleep_hrs   -0.63209    0.08306   -7.61 3.25e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.011 on 4998 degrees of freedom
## Multiple R-squared:  0.01146,    Adjusted R-squared:  0.01126 
## F-statistic: 57.92 on 1 and 4998 DF,  p-value: 3.245e-14

4b. (10 pts) Fit the maximum associative model: physhlth_days ~ sleep_hrs + [all other covariates]. Note the adjusted sleep coefficient and compute the 10% interval. Then systematically remove each covariate one at a time and determine which are confounders using the 10% rule. Present your results in a summary table.

model_full_assoc <- lm(physhlth_days ~ sleep_hrs + ., data = brfss_ms)

beta_full <- coef(model_full_assoc)["sleep_hrs"]

# ✅ Use term labels instead of coef names
vars <- attr(terms(model_full_assoc), "term.labels")
vars <- vars[vars != "sleep_hrs"]

results <- data.frame()

for (v in vars) {
  
  remaining_vars <- setdiff(vars, v)
  
  formula_temp <- as.formula(
    paste("physhlth_days ~ sleep_hrs +",
          paste(remaining_vars, collapse = " + "))
  )
  
  model_temp <- lm(formula_temp, data = brfss_ms)
  beta_temp <- coef(model_temp)["sleep_hrs"]
  
  change <- abs((beta_temp - beta_full) / beta_full) * 100
  
  results <- rbind(results,
                   data.frame(variable = v,
                              percent_change = change))
}

results

##                 variable percent_change
## sleep_hrs  menthlth_days     49.9894579
## sleep_hrs1           age     14.7190080
## sleep_hrs2           sex      0.3961030
## sleep_hrs3     education      0.3407284
## sleep_hrs4      exercise      1.4096399
## sleep_hrs5    gen_health     86.1809956
## sleep_hrs6    income_cat      0.3497520
## sleep_hrs7           bmi      1.0431711

4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.

model_final <- lm(physhlth_days ~ sleep_hrs + age + bmi, data = brfss_ms)
summary(model_final)

## 
## Call:
## lm(formula = physhlth_days ~ sleep_hrs + age + bmi, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5547  -3.9118  -2.7218  -0.7358  30.8569 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.379449   0.835764   1.651   0.0989 .  
## sleep_hrs   -0.694039   0.082779  -8.384  < 2e-16 ***
## age          0.060176   0.006507   9.248  < 2e-16 ***
## bmi          0.131155   0.017354   7.558 4.85e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.9 on 4996 degrees of freedom
## Multiple R-squared:  0.03897,    Adjusted R-squared:  0.03839 
## F-statistic: 67.53 on 3 and 4996 DF,  p-value: < 2.2e-16

confint(model_final)

##                   2.5 %      97.5 %
## (Intercept) -0.25901497  3.01791344
## sleep_hrs   -0.85632217 -0.53175540
## age          0.04741919  0.07293281
## bmi          0.09713384  0.16517665

4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.

Ans: Automated selection methods such as step wise regression are designed for prediction, not causal inference. They may exclude important confounders simply because they are not statistically significant, leading to biased estimates of the exposure effect. In this analysis, variables were selected based on their impact on the sleep coefficient using the 10% rule, which is more appropriate for identifying confounding. Therefore, manual confounder assessment ensures a more valid estimate of the association between sleep and physical health.

Task 5: Synthesis (20 points)

5a. (10 pts) You have now built two models for the same data:

A predictive model (from Task 2 or 3, the best model by AIC/BIC)
An associative model (from Task 4, focused on sleep)

Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ? Ans: The predictive model and associative model differ in their included variables. The predictive model focuses on maximizing model fit and may include variables that improve prediction but are not confounders. In contrast, the associative model includes only variables that meaningfully affect the relationship between sleep and physical health. As a result, the sleep coefficient may differ between models because the predictive model does not prioritize causal interpretation.

5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:

The adjusted effect of sleep on physical health days
Which variables needed to be accounted for (confounders)
The direction and approximate magnitude of the association
A caveat about cross-sectional data

Ans: After adjusting for key confounders including age and body mass index (BMI), sleep duration was significantly associated with physical health. Each additional hour of sleep was associated with approximately 0.53 to 0.86 fewer physically unhealthy days, indicating a protective effect of longer sleep duration. This suggests that individuals who sleep more tend to report better physical health outcomes, even after accounting for differences in age and BMI. The magnitude of this association is modest but meaningful from a population health perspective. However, because these data are cross-sectional, the results do not establish causality and it is not possible to determine whether sleep duration directly improves physical health or is influenced by underlying health conditions. Do not use statistical jargon.

End of Lab Activity