In the previous lectures, we used Simple Linear Regression (SLR) to model a continuous outcome as a function of a single predictor. But real-world health phenomena are never driven by one variable alone. A person’s mental health is shaped by how much they sleep, how physically ill they are, how old they are, their income, their sex, and much more — simultaneously.
Multiple Linear Regression (MLR) extends SLR to accommodate this complexity. It allows us to:
In epidemiology, MLR is our primary tool for multivariable analysis of continuous outcomes.
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(ggstats)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 — a nationally representative telephone survey of U.S. adults conducted by the CDC. This dataset contains over 400,000 respondents with detailed information on health behaviors, chronic conditions, and social determinants of health.
Research question for today:
What individual, behavioral, and socioeconomic factors are associated with the number of mentally unhealthy days in the past 30 days among U.S. adults?
This is a quintessential epidemiological question: we have a health outcome (mental health days) and want to understand its multivariable determinants while appropriately controlling for confounders.
# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
mutate(
# Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days (key predictor)
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours (practical cap at 14)
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age (capped at 80 per BRFSS coding)
age = age80,
# Income category (ordinal 1–8)
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Exercise in past 30 days (any physical activity)
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
# BMI (stored as integer × 100 in BRFSS)
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
# Income as labeled factor (for display)
income_f = factor(income2, levels = 1:8,
labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
"$25-35k", "$35-50k", "$50-75k", ">$75k"))
) %>%
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(age), age >= 18,
!is.na(income_cat),
!is.na(sex),
!is.na(exercise)
)
# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_cat, income_f, sex, exercise, bmi) %>%
slice_sample(n = 5000)
# Save for lab activity
saveRDS(brfss_mlr,
"D:/fizza documents/Epi 553/R Lab/Lab 7/brfss_mlr_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
kable(caption = "Analytic Dataset Dimensions") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
EPI 553 — Multiple Linear Regression Lab
In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) | Ordinal |
sex |
Sex (Male/Female) | Factor |
exercise |
Any physical activity past 30 days (Yes/No) | Factor |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)
)# ==================================================
# COMPLETE CODE: Download BRFSS 2020 and Create Subset
# ==================================================
# STEP 1: Load required packages
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(dplyr)
# STEP 2: Download BRFSS 2020 data directly from CDC
cat("Downloading BRFSS 2020 data...\n")## Downloading BRFSS 2020 data...
download.file("https://www.cdc.gov/brfss/annual_data/2020/files/LLCP2020XPT.zip",
destfile = "LLCP2020XPT.zip",
mode = "wb")
# STEP 3: Unzip the file
cat("Unzipping file...\n")## Unzipping file...
unzip("LLCP2020XPT.zip")
# STEP 4: Check if file exists
if(file.exists("LLCP2020.XPT")) {
cat("✅ File downloaded successfully!\n")
} else {
stop("❌ File not found after download")
}## ✅ File downloaded successfully!
## Loading BRFSS 2020 data...
brfss_full <- read_xpt("LLCP2020.XPT") %>%
clean_names()
cat("Raw data loaded:", nrow(brfss_full), "observations\n")## Raw data loaded: 401958 observations
## Creating analytic subset...
brfss_mlr <- brfss_full %>%
mutate(
# Outcome: mentally unhealthy days
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age
age = age80,
# Income category
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Exercise
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
# BMI
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
# Income as labeled factor
income_f = factor(income2, levels = 1:8,
labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
"$25-35k", "$35-50k", "$50-75k", ">$75k"))
) %>%
filter(
!is.na(menthlth_days), !is.na(physhlth_days), !is.na(sleep_hrs),
!is.na(age), age >= 18, !is.na(income_cat), !is.na(sex), !is.na(exercise)
)
cat("After cleaning:", nrow(brfss_mlr), "observations\n")## After cleaning: 309703 observations
# STEP 7: Take random sample of 5,000
set.seed(553)
brfss_mlr <- brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_cat, income_f, sex, exercise, bmi) %>%
slice_sample(n = 5000)
# STEP 8: Save for future use
saveRDS(brfss_mlr, "brfss_mlr_2020.rds")
cat("✅ Dataset saved as 'brfss_mlr_2020.rds'\n")## ✅ Dataset saved as 'brfss_mlr_2020.rds'
# STEP 9: Verify
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
kable(caption = "Analytic Dataset Dimensions") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
## Rows: 5,000
## Columns: 9
## $ menthlth_days <dbl> 15, 0, 0, 20, 0, 7, 0, 0, 0, 0, 15, 0, 0, 0, 10, 0, 0, 0…
## $ physhlth_days <dbl> 0, 30, 0, 0, 0, 0, 15, 0, 0, 0, 30, 0, 28, 0, 1, 0, 0, 0…
## $ sleep_hrs <dbl> 6, 8, 8, 8, 5, 6, 7, 7, 7, 7, 6, 7, 6, 6, 6, 6, 6, 8, 4,…
## $ age <dbl> 27, 70, 74, 71, 29, 28, 76, 43, 67, 47, 60, 31, 60, 74, …
## $ income_cat <dbl> 8, 5, 1, 6, 8, 4, 8, 8, 1, 1, 8, 7, 8, 6, 1, 3, 5, 6, 4,…
## $ income_f <fct> >$75k, $25-35k, <$10k, $35-50k, >$75k, $20-25k, >$75k, >…
## $ sex <fct> Male, Female, Male, Female, Male, Female, Male, Male, Ma…
## $ exercise <fct> Yes, No, Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, …
## $ bmi <dbl> 27.89, 23.81, 25.54, 33.61, 29.03, 28.19, 35.26, 29.38, …
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
# Simple, reliable Table 1
desc_table <- brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_f, sex, exercise) %>%
tbl_summary(
label = list(
menthlth_days = "Mentally unhealthy days (past 30)",
physhlth_days = "Physically unhealthy days (past 30)",
sleep_hrs = "Sleep (hours/night)",
age = "Age (years)",
income_f = "Household income",
sex = "Sex",
exercise = "Any physical activity (past 30 days)"
),
statistic = all_continuous() ~ "{mean} ({sd})",
digits = all_continuous() ~ 1
) %>%
add_n() %>%
bold_labels()
# Print with a manual caption
cat("\n\nTable 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)\n")##
##
## Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)
| Characteristic | N | N = 5,0001 |
|---|---|---|
| Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.7) |
| Physically unhealthy days (past 30) | 5,000 | 3.3 (7.8) |
| Sleep (hours/night) | 5,000 | 7.1 (1.3) |
| Age (years) | 5,000 | 54.3 (17.2) |
| Household income | 5,000 | |
| <$10k | 190 (3.8%) | |
| $10-15k | 169 (3.4%) | |
| $15-20k | 312 (6.2%) | |
| $20-25k | 434 (8.7%) | |
| $25-35k | 489 (9.8%) | |
| $35-50k | 683 (14%) | |
| $50-75k | 841 (17%) | |
| >$75k | 1,882 (38%) | |
| Sex | 5,000 | |
| Male | 2,331 (47%) | |
| Female | 2,669 (53%) | |
| Any physical activity (past 30 days) | 5,000 | 3,874 (77%) |
| 1 Mean (SD); n (%) | ||
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
# ==================================================
# TASK 1b: Distribution of Outcome Variable
# Histogram of Mentally Unhealthy Days
# ==================================================
# Create histogram
p1 <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Figure 1. Distribution of Mentally Unhealthy Days",
subtitle = paste("BRFSS 2020 Analytic Sample (n =", nrow(brfss_mlr), ")"),
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "gray50", size = 11)
)
# Display the plot
print(p1)# Calculate summary statistics for interpretation
sleep_summary <- brfss_mlr %>%
summarise(
mean = mean(menthlth_days, na.rm = TRUE),
sd = sd(menthlth_days, na.rm = TRUE),
median = median(menthlth_days, na.rm = TRUE),
q25 = quantile(menthlth_days, 0.25, na.rm = TRUE),
q75 = quantile(menthlth_days, 0.75, na.rm = TRUE),
min = min(menthlth_days, na.rm = TRUE),
max = max(menthlth_days, na.rm = TRUE),
pct_zero = mean(menthlth_days == 0, na.rm = TRUE) * 100
)
# Display summary statistics
cat("\n")## --------------------------------------------------------
## Summary Statistics for Mentally Unhealthy Days
## --------------------------------------------------------
## # A tibble: 1 × 8
## mean sd median q25 q75 min max pct_zero
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.79 7.72 0 0 3 0 30 63.9
The histogram in Figure 1 displays the distribution of mentally unhealthy days in the past 30 days among 5,000 BRFSS 2020 respondents. The distribution is right-skewed, with the majority of respondents reporting few or zero mentally unhealthy days and a long tail extending to 30 days.
The mean number of mentally unhealthy days is 3.8 days (SD = 7.7), while the median is 0 days, confirming the right skew (mean > median). Approximately 52% of respondents reported zero mentally unhealthy days, which is typical for population-based mental health surveys where most individuals do not experience frequent mental distress.
This skewness has implications for regression assumptions, particularly the normality of residuals. However, with a large sample size (n = 5,000), the Central Limit Theorem provides robustness for inference about coefficients, and ordinary least squares regression remains appropriate for estimating associations.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
# ==================================================
# TASK 1c: Scatterplot Matrix for Continuous Variables
# ==================================================
# Load required package
library(GGally)
# Create scatterplot matrix
p2 <- brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.1, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4, color = "black")),
title = "Figure 2. Pairwise Relationships Among Continuous Variables"
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
strip.background = element_rect(fill = "lightgray", color = NA),
strip.text = element_text(face = "bold", size = 10)
)
print(p2)# Calculate exact correlations
cor_matrix <- brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
cor(use = "complete.obs")
# Display correlation matrix
cat("\n")## ========================================================
## Correlation Matrix
## ========================================================
## menthlth_days physhlth_days sleep_hrs age
## menthlth_days 1.000 0.315 -0.140 -0.156
## physhlth_days 0.315 1.000 -0.112 0.093
## sleep_hrs -0.140 -0.112 1.000 0.089
## age -0.156 0.093 0.089 1.000
Interpretation:
The scatterplot matrix shows the following correlations with mental health days: - Physical health days: r = 0.315 (moderate positive) - Sleep hours: r = -0.140 (weak negative) - Age: r = -0.156 (weak negative)
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
# ==================================================
# TASK 2a: Simple Linear Regression - Sleep Only
# Model A: menthlth_days ~ sleep_hrs
# ==================================================
# Fit simple linear regression model
model_A <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Display model summary
cat("========================================================\n")## ========================================================
## Model A: Simple Linear Regression
## Outcome: Mentally Unhealthy Days ~ Sleep Hours
## ========================================================
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
# Display tidy results with confidence intervals
cat("\n--------------------------------------------------------\n")##
## --------------------------------------------------------
## Tidy Model Results with 95% Confidence Intervals
## --------------------------------------------------------
tidy(model_A, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Table 2a: Simple Linear Regression Results",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "CI Lower", "CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | Std. Error | t-statistic | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.4743 | 0.5771 | 16.4165 | 0 | 8.3429 | 10.6057 |
| sleep_hrs | -0.8042 | 0.0802 | -10.0218 | 0 | -0.9616 | -0.6469 |
# Extract coefficients for the equation
intercept <- round(coef(model_A)[1], 3)
slope <- round(coef(model_A)[2], 3)
cat("\n")## ========================================================
## Fitted Regression Equation:
## ========================================================
## ̂Mental Health Days = 9.474 + (-0.804 × Sleep Hours)
# Calculate additional statistics
r_squared <- round(summary(model_A)$r.squared, 4)
adj_r_squared <- round(summary(model_A)$adj.r.squared, 4)
p_value <- tidy(model_A)$p.value[2]
cat(sprintf("\nR² = %.4f", r_squared))##
## R² = 0.0197
##
## Adjusted R² = 0.0195
##
## P-value for sleep coefficient: 0.0000
2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).
Each additional hour of sleep per night is associated with an average decrease of 0.80 mentally unhealthy days per month (95% CI: -0.96 to -0.65, p < 0.001). This means that a person who sleeps 8 hours per night would be expected to have approximately 0.8 fewer mentally unhealthy days per month compared to someone who sleeps 7 hours per night. Over the course of a year, this difference translates to about 9.6 fewer mentally unhealthy days for each additional hour of nightly sleep.
2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?
Null hypothesis (H₀): β₁ = 0 (no association between sleep hours and mentally unhealthy days)
Alternative hypothesis (H₁): β₁ ≠ 0 (there is an association)
The t-statistic for sleep hours is -10.02 with 4,998 degrees of freedom (n - 2). The p-value is < 0.001, which is well below the significance level of 0.05.
Conclusion: We reject the null hypothesis. There is statistically significant evidence that sleep hours are associated with mentally unhealthy days in the population. The negative t-statistic indicates an inverse relationship: more sleep is associated with fewer mentally unhealthy days.”
3a. (5 pts) Fit three models:
menthlth_days ~ sleep_hrsmenthlth_days ~ sleep_hrs + age + sexmenthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise# ==================================================
# TASK 3a: Fit Three Sequential Models
# ==================================================
# Load required packages
library(broom)
library(knitr)
library(kableExtra)
# Model A: Sleep only (from Task 2a)
model_A <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model B: Add age and sex
model_B <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model C: Full model with all covariates
model_C <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
data = brfss_mlr)
# Display all three models
cat("========================================================\n")## ========================================================
## MODEL A: Sleep Only
## ========================================================
tidy(model_A, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable() %>%
print()##
##
## |term | estimate| std.error| statistic| p.value| conf.low| conf.high|
## |:-----------|--------:|---------:|---------:|-------:|--------:|---------:|
## |(Intercept) | 9.4743| 0.5771| 16.4165| 0| 8.3429| 10.6057|
## |sleep_hrs | -0.8042| 0.0802| -10.0218| 0| -0.9616| -0.6469|
## ========================================================
## MODEL B: Sleep + Age + Sex
## ========================================================
tidy(model_B, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable() %>%
print()##
##
## |term | estimate| std.error| statistic| p.value| conf.low| conf.high|
## |:-----------|--------:|---------:|---------:|-------:|--------:|---------:|
## |(Intercept) | 11.7656| 0.6445| 18.2563| 0| 10.5022| 13.0291|
## |sleep_hrs | -0.7339| 0.0793| -9.2528| 0| -0.8894| -0.5784|
## |age | -0.0665| 0.0062| -10.6877| 0| -0.0787| -0.0543|
## |sexFemale | 1.5399| 0.2134| 7.2166| 0| 1.1216| 1.9583|
## ========================================================
## MODEL C: Full Model (all covariates)
## ========================================================
tidy(model_C, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable() %>%
print()##
##
## |term | estimate| std.error| statistic| p.value| conf.low| conf.high|
## |:-------------|--------:|---------:|---------:|-------:|--------:|---------:|
## |(Intercept) | 12.4755| 0.7170| 17.4006| 0.0000| 11.0699| 13.8810|
## |sleep_hrs | -0.5092| 0.0753| -6.7574| 0.0000| -0.6569| -0.3614|
## |age | -0.0823| 0.0059| -13.8724| 0.0000| -0.0939| -0.0707|
## |sexFemale | 1.2451| 0.2023| 6.1535| 0.0000| 0.8484| 1.6417|
## |physhlth_days | 0.2917| 0.0136| 21.4779| 0.0000| 0.2650| 0.3183|
## |income_cat | -0.3213| 0.0520| -6.1778| 0.0000| -0.4233| -0.2194|
## |exerciseYes | -0.3427| 0.2531| -1.3537| 0.1759| -0.8389| 0.1536|
# Create comparison table for sleep coefficient
sleep_coef_A <- tidy(model_A, conf.int = TRUE) %>% filter(term == "sleep_hrs")
sleep_coef_B <- tidy(model_B, conf.int = TRUE) %>% filter(term == "sleep_hrs")
sleep_coef_C <- tidy(model_C, conf.int = TRUE) %>% filter(term == "sleep_hrs")
sleep_compare <- data.frame(
Model = c("Model A: Sleep only",
"Model B: + Age + Sex",
"Model C: Full model"),
Estimate = c(round(sleep_coef_A$estimate, 3),
round(sleep_coef_B$estimate, 3),
round(sleep_coef_C$estimate, 3)),
SE = c(round(sleep_coef_A$std.error, 3),
round(sleep_coef_B$std.error, 3),
round(sleep_coef_C$std.error, 3)),
CI = c(paste0("[", round(sleep_coef_A$conf.low, 3), ", ", round(sleep_coef_A$conf.high, 3), "]"),
paste0("[", round(sleep_coef_B$conf.low, 3), ", ", round(sleep_coef_B$conf.high, 3), "]"),
paste0("[", round(sleep_coef_C$conf.low, 3), ", ", round(sleep_coef_C$conf.high, 3), "]")),
p_value = c(format.pval(sleep_coef_A$p.value, digits = 3),
format.pval(sleep_coef_B$p.value, digits = 3),
format.pval(sleep_coef_C$p.value, digits = 3))
)
sleep_compare %>%
kable(
caption = "Table 3a: Sleep Coefficient Across Sequential Models",
col.names = c("Model", "β (Sleep)", "SE", "95% CI", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | β (Sleep) | SE | 95% CI | p-value |
|---|---|---|---|---|
| Model A: Sleep only | -0.804 | 0.080 | [-0.962, -0.647] | <2e-16 |
| Model B: + Age + Sex | -0.734 | 0.079 | [-0.889, -0.578] | <2e-16 |
| Model C: Full model | -0.509 | 0.075 | [-0.657, -0.361] | 1.57e-11 |
3b. (10 pts) Create a table comparing the sleep coefficient (\(\hat{\beta}\), SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?
# ==================================================
# TASK 3b: Compare Sleep Coefficient Across Models
# ==================================================
# Your coefficients from the output
coef_A <- -0.804
coef_B <- -0.734
coef_C <- -0.509
ci_A_lower <- -0.962
ci_A_upper <- -0.647
ci_B_lower <- -0.889
ci_B_upper <- -0.578
ci_C_lower <- -0.657
ci_C_upper <- -0.361
# Calculate percent changes
change_AB <- round((coef_B - coef_A) / coef_A * 100, 1)
change_AC <- round((coef_C - coef_A) / coef_A * 100, 1)
change_BC <- round((coef_C - coef_B) / coef_B * 100, 1)
# Create comparison table
comparison_table <- data.frame(
Model = c("Model A: Sleep only",
"Model B: + Age + Sex",
"Model C: Full model"),
Coefficient = c(coef_A, coef_B, coef_C),
CI = c(paste0("[", ci_A_lower, ", ", ci_A_upper, "]"),
paste0("[", ci_B_lower, ", ", ci_B_upper, "]"),
paste0("[", ci_C_lower, ", ", ci_C_upper, "]")),
Change = c("Reference",
paste0(change_AB, "%"),
paste0(change_AC, "%"))
)
comparison_table %>%
kable(
caption = "Table 3b: Sleep Coefficient Across Sequential Models",
col.names = c("Model", "β (Sleep)", "95% CI", "Change from Model A"),
digits = 3
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(1, background = "#F0F0F0")| Model | β (Sleep) | 95% CI | Change from Model A |
|---|---|---|---|
| Model A: Sleep only | -0.804 | [-0.962, -0.647] | Reference |
| Model B: + Age + Sex | -0.734 | [-0.889, -0.578] | -8.7% |
| Model C: Full model | -0.509 | [-0.657, -0.361] | -36.7% |
Interpretation:
The sleep coefficient attenuated from -0.804 in the unadjusted model (Model A) to -0.734 after adjusting for age and sex (Model B), a reduction of 8.7%. After full adjustment including physical health days, income, and exercise (Model C), the coefficient further attenuated to -0.509, a total reduction of 36.7% from the unadjusted estimate.
The largest drop occurred when adding physical health days, income, and exercise (Model B → Model C: 30.7% reduction), indicating that these variables—particularly physical health days—are important confounders of the sleep-mental health relationship. Physical health likely confounds this association because poor physical health both disrupts sleep and directly contributes to poorer mental health.
Despite the substantial attenuation, sleep remains statistically significant in all models (p < 0.001), suggesting an independent association with mental health even after comprehensive adjustment. This pattern—attenuation but persistence of significance—is classic evidence of partial confounding: some, but not all, of the observed association is explained by other variables. 3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?
# ==================================================
# TASK 4a: R² and Adjusted R² Comparison
# ==================================================
# Extract R² and adjusted R² for each model
r2_A <- summary(model_A)$r.squared
r2_B <- summary(model_B)$r.squared
r2_C <- summary(model_C)$r.squared
adj_r2_A <- summary(model_A)$adj.r.squared
adj_r2_B <- summary(model_B)$adj.r.squared
adj_r2_C <- summary(model_C)$adj.r.squared
# Calculate variance explained as percentages
pct_A <- round(r2_A * 100, 1)
pct_B <- round(r2_B * 100, 1)
pct_C <- round(r2_C * 100, 1)
# Create comparison table
r2_table <- data.frame(
Model = c("Model A: Sleep only",
"Model B: + Age + Sex",
"Model C: Full model"),
R_squared = c(round(r2_A, 4), round(r2_B, 4), round(r2_C, 4)),
`Variance Explained` = c(paste0(pct_A, "%"), paste0(pct_B, "%"), paste0(pct_C, "%")),
Adjusted_R_squared = c(round(adj_r2_A, 4), round(adj_r2_B, 4), round(adj_r2_C, 4)),
check.names = FALSE
)
r2_table %>%
kable(
caption = "Table 4a: R² and Adjusted R² Across Models",
col.names = c("Model", "R²", "Variance Explained", "Adjusted R²"),
align = "lccc"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")| Model | R² | Variance Explained | Adjusted R² |
|---|---|---|---|
| Model A: Sleep only | 0.0197 | 2% | 0.0195 |
| Model B: + Age + Sex | 0.0504 | 5% | 0.0498 |
| Model C: Full model | 0.1569 | 15.7% | 0.1559 |
## ========================================================
## Model Improvement Summary
## ========================================================
## Model A explains 2.0% of the variance in mentally unhealthy days
cat(sprintf("Model B explains %.1f%% of the variance (%.1f%% increase from Model A)\n",
pct_B, pct_B - pct_A))## Model B explains 5.0% of the variance (3.0% increase from Model A)
cat(sprintf("Model C explains %.1f%% of the variance (%.1f%% increase from Model A)\n\n",
pct_C, pct_C - pct_A))## Model C explains 15.7% of the variance (13.7% increase from Model A)
## Which model explains the most variance?
## --------------------------------------------------------
## Model C (full model) explains the most variance at 15.7%,
## which is typical for mental health outcomes in population surveys.
## Mental health is influenced by many unmeasured factors including genetics,
## life events, social support, and trauma history, which is why even our best
## model leaves much of the variance unexplained.
Interpretation:
Model C (full model) explains the most variance in mentally unhealthy days, with an R² of 0.1569 (15.7%). This means that the combination of sleep hours, age, sex, physical health days, income, and exercise collectively explains about 15.7% of the variation in mental health days.
While this may seem low, it is typical for mental health outcomes in population-based surveys. Mental health is influenced by hundreds of unmeasured factors—genetics, life events, social support, trauma history, personality traits—that are not captured in this model. A low R² does not mean the model is “wrong” or “useless” for estimating associations. For etiologic questions, the magnitude and direction of coefficients and their statistical significance matter more than the proportion of variance explained.
The increase from Model A (sleep only) to Model C represents a substantial improvement in explanatory power, with the full model explaining about 13.7 percentage points more variance than sleep alone.
4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?
# ==================================================
# TASK 4b: Root MSE (Residual Standard Error) for Model C
# ==================================================
# Extract Root MSE from all models
rmse_A <- summary(model_A)$sigma
rmse_B <- summary(model_B)$sigma
rmse_C <- summary(model_C)$sigma
# Create summary table
rmse_summary <- data.frame(
Model = c("Model A: Sleep only",
"Model B: + Age + Sex",
"Model C: Full model"),
RMSE = c(round(rmse_A, 2), round(rmse_B, 2), round(rmse_C, 2)),
`Prediction Error` = c(paste0("±", round(rmse_A, 2), " days"),
paste0("±", round(rmse_B, 2), " days"),
paste0("±", round(rmse_C, 2), " days"))
)
rmse_summary %>%
kable(
caption = "Table 4b: Root Mean Square Error (RMSE) Across Models",
col.names = c("Model", "RMSE", "Average Prediction Error"),
align = "lcc"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | RMSE | Average Prediction Error |
|---|---|---|
| Model A: Sleep only | 7.64 | ±7.64 days |
| Model B: + Age + Sex | 7.52 | ±7.52 days |
| Model C: Full model | 7.09 | ±7.09 days |
## ========================================================
## Interpretation of RMSE for Model C
## ========================================================
## The Root MSE for Model C is **7.09 mentally unhealthy days**.
## **What does this mean?**
## The Root MSE (also called Residual Standard Error) represents the average
## distance between the observed values and the model's predicted values.
## In other words, on average, Model C's predictions are off by about
## 7.09 days.
## **Context:**
## The outcome (mentally unhealthy days) ranges from 0 to 30 days,
cat(sprintf("with a mean of %.1f days and standard deviation of %.1f days.\n",
mean(brfss_mlr$menthlth_days, na.rm = TRUE),
sd(brfss_mlr$menthlth_days, na.rm = TRUE)))## with a mean of 3.8 days and standard deviation of 7.7 days.
## An RMSE of 7.09 days represents a moderate level of prediction error.
## **Comparison with simpler models:**
## Model A (sleep only) had RMSE = 7.64 days
## Model C reduces prediction error by 7.2% compared to Model A.
## **Practical interpretation for public health:**
## While our model cannot perfectly predict an individual's mental health days,
## it identifies important population-level associations. The RMSE tells us that
## even our best model leaves substantial individual variation unexplained,
## which is expected given the complex, multifactorial nature of mental health.
## This is why epidemiologic research focuses on estimating associations
## (coefficients and confidence intervals) rather than making individual predictions.
Interpretation:
The Root MSE for Model C is 7.1 mentally unhealthy days. This means that, on average, the model’s predictions are off by about 7.1 days from the actual observed values.
Given that the outcome ranges from 0 to 30 days with a mean of approximately 3.8 days, an RMSE of 7.1 represents a moderate level of prediction error. This is expected for mental health outcomes, which are influenced by numerous unmeasured factors including genetics, life events, social support, trauma history, and personality traits.
Compared to Model A (sleep only), which had an RMSE of 7.7 days, Model C reduces prediction error by about 7.8%. While this improvement is modest, it demonstrates that the additional covariates help explain some of the variation in mental health outcomes.
In public health research, the primary goal is typically not individual prediction but rather estimating population-level associations. The RMSE provides context for understanding the limitations of our model while the coefficients and confidence intervals remain our primary focus for inference.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
# ==================================================
# TASK 4c: ANOVA Table for Model C
# ==================================================
# Load required package
library(broom)
# Get ANOVA output
anova_C <- anova(model_C)
# Display the ANOVA table
cat("========================================================\n")## ========================================================
## ANOVA Table for Model C (Full Model)
## ========================================================
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Extract values for manual table
ss_regression <- sum(anova_C$`Sum Sq`[-length(anova_C$`Sum Sq`)])
ss_residual <- anova_C$`Sum Sq`[length(anova_C$`Sum Sq`)]
ss_total <- ss_regression + ss_residual
df_regression <- sum(anova_C$Df[-length(anova_C$Df)])
df_residual <- anova_C$Df[length(anova_C$Df)]
df_total <- df_regression + df_residual
ms_regression <- ss_regression / df_regression
ms_residual <- ss_residual / df_residual
f_stat <- ms_regression / ms_residual
f_pvalue <- pf(f_stat, df_regression, df_residual, lower.tail = FALSE)
# Create formatted ANOVA table
cat("\n")## ========================================================
## Manual ANOVA Table for Model C
## ========================================================
anova_table <- data.frame(
Source = c("Model (Regression)", "Residual (Error)", "Total"),
DF = c(df_regression, df_residual, df_total),
SS = c(round(ss_regression, 2), round(ss_residual, 2), round(ss_total, 2)),
MS = c(round(ms_regression, 2), round(ms_residual, 2), NA),
F = c(round(f_stat, 2), NA, NA),
`p-value` = c(format.pval(f_pvalue, digits = 4), NA, NA)
)
anova_table %>%
kable(
caption = "Table 4c: ANOVA Table for Model C",
col.names = c("Source", "df", "Sum of Squares", "Mean Square", "F", "p-value"),
align = "lcccc"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Source | df | Sum of Squares | Mean Square | F | p-value |
|---|---|---|---|---|---|
| Model (Regression) | 6 | 46718.54 | 7786.42 | 154.9 | < 2.2e-16 |
| Residual (Error) | 4993 | 250992.80 | 50.27 | NA | NA |
| Total | 4999 | 297711.34 | NA | NA | NA |
## ========================================================
## Hypothesis Test for Overall Model
## ========================================================
## Null Hypothesis (H₀): β₁ = β₂ = β₃ = β₄ = β₅ = β₆ = 0
## (None of the predictors are associated with mentally unhealthy days)
## Alternative Hypothesis (H₁): At least one β ≠ 0
## (At least one predictor is associated with mentally unhealthy days)
## Test Results:
## --------------------------------------------------------
cat(sprintf("F-statistic = %.2f with %d and %d degrees of freedom\n", f_stat, df_regression, df_residual))## F-statistic = 154.90 with 6 and 4993 degrees of freedom
## p-value = < 2.2e-16
## Decision at α = 0.05 :
## --------------------------------------------------------
if(f_pvalue < alpha) {
cat("✅ Reject the null hypothesis\n\n")
cat("CONCLUSION:\n")
cat("There is statistically significant evidence that at least one predictor\n")
cat("is associated with mentally unhealthy days. The model as a whole provides\n")
cat("significant prediction of the outcome.\n")
} else {
cat("❌ Fail to reject the null hypothesis\n\n")
cat("CONCLUSION:\n")
cat("There is not enough evidence to conclude that any predictor is associated\n")
cat("with mentally unhealthy days. The model does not provide significant\n")
cat("prediction of the outcome.\n")
}## ✅ Reject the null hypothesis
##
## CONCLUSION:
## There is statistically significant evidence that at least one predictor
## is associated with mentally unhealthy days. The model as a whole provides
## significant prediction of the outcome.
## ========================================================
## Interpretation of ANOVA Decomposition
## ========================================================
## Total Sum of Squares (SS_total = 297711.34) represents the total
## variability in mentally unhealthy days.
## Model Sum of Squares (SS_regression = 46718.54) represents the
## variability explained by the predictors (sleep, age, sex, physical health,
## income, and exercise).
## Residual Sum of Squares (SS_residual = 250992.80) represents the
## variability NOT explained by the model.
## The F-statistic of 154.90 indicates that the model explains
## significantly more variability than would be expected by chance alone.
Interpretation: ### Task 4c: ANOVA Table for Model C
ANOVA Table for Model C:
| Source | df | Sum of Squares | Mean Square | F | p-value |
|---|---|---|---|---|---|
| Model (Regression) | 6 | 46,718.54 | 7,786.42 | 154.9 | < 2.2e-16 |
| Residual (Error) | 4,993 | 250,992.80 | 50.27 | ||
| Total | 4,999 | 297,711.34 |
Null hypothesis for overall F-test:
H₀: β₁ = β₂ = β₃ = β₄ = β₅ = β₆ = 0 (none of the predictors are
associated with mentally unhealthy days)
Alternative hypothesis:
H₁: At least one β ≠ 0 (at least one predictor is associated with
mentally unhealthy days)
Test results:
F-statistic = 154.9 with 6 and 4,993 degrees of freedom
p-value < 2.2e-16
Conclusion: We reject the null hypothesis. There is statistically significant evidence that at least one predictor is associated with mentally unhealthy days. The model as a whole provides significant prediction of the outcome.
Interpretation of the decomposition:
The Total Sum of Squares (297,711.34) represents the total variability in mentally unhealthy days. This variability is partitioned into two components:
Model Sum of Squares (46,718.54): The variability explained by the predictors (sleep, age, sex, physical health, income, and exercise). This accounts for 15.7% of the total variability, which matches the R² value from Task 4a.
Residual Sum of Squares (250,992.80): The variability NOT explained by the model. This accounts for 84.3% of the total variability, representing the influence of unmeasured factors (genetics, life events, social support, trauma history, etc.).
The F-statistic of 154.9 is highly significant (p < 2.2e-16), indicating that the model explains significantly more variability than would be expected by chance alone. The ratio MS_model / MS_residual = 7,786.42 / 50.27 = 154.9 confirms that the explained variance is substantially larger than the unexplained variance after accounting for degrees of freedom.
This significant F-test justifies proceeding to interpret individual coefficients, as we can be confident that the model captures meaningful associations with mental health outcomes.
---
## ✅ **Summary Table**
| **Source** | **df** | **Sum of Squares** | **Mean Square** | **F** | **p-value** |
|------------|--------|---------------------|-----------------|-------|-------------|
| Model | 6 | 46,718.54 | 7,786.42 | 154.9 | < 2.2e-16 |
| Residual | 4,993 | 250,992.80 | 50.27 | | |
| Total | 4,999 | 297,711.34 | | | |
- **R² =** 46,718.54 / 297,711.34 = **0.157** (15.7%)
- **F-statistic interpretation:** The model explains significantly more variability than chance
---
## Task 5: Residual Diagnostics (15 points)
**5a. (5 pts)** For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook's Distance). Comment on what each plot tells you about the LINE assumptions.
``` r
# ==================================================
# TASK 5a: Residual Diagnostics for Model C
# ==================================================
# Create diagnostic plots
par(mfrow = c(2, 2))
plot(model_C)
Interpretation:
Residuals vs Fitted Plot: Shows a curved pattern with residuals negative at lower fitted values and positive at higher values, indicating non-linearity. The relationship between predictors and mental health days may not be perfectly linear, especially at extremes.
Normal Q-Q Plot: Points deviate substantially from the diagonal line with an “S” shape, confirming non-normality of residuals. This is expected given the outcome is bounded (0-30) and right-skewed, with many zeros and a long tail.
Scale-Location Plot: Shows increasing spread of residuals with fitted values, indicating heteroscedasticity (non-constant variance). The model is less precise when predicting higher numbers of mentally unhealthy days.
Residuals vs Leverage Plot: No points fall outside Cook’s distance contours, indicating no highly influential observations. This is good—coefficient estimates are stable and not driven by outliers.
Overall Assessment: Given the nature of the outcome (bounded count data), some assumption violations are expected. The large sample size (n = 5,000) provides robustness against non-normality, and the absence of influential observations supports coefficient stability. While a Poisson or negative binomial model might technically be more appropriate for count data, the linear model provides interpretable coefficients and is robust to these mild violations for population-level inference.
5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain. ### Task 5b: Assumption Violations
Which assumptions are most likely violated? - Linearity – Residuals vs Fitted plot shows curved pattern - Normality – Q-Q plot shows deviation from diagonal line - Homoscedasticity – Scale-Location plot shows increasing spread
These violations occur because mentally unhealthy days is a count variable bounded 0-30 with right skew and many zero responses.
Does this invalidate the analysis?
NO – The analysis remains valid because: - Large sample
size (n = 5,000) provides robustness via Central Limit Theorem -
Coefficient estimates remain unbiased despite assumption violations - No
influential observations were detected - These patterns are expected and
acceptable for this type of outcome
5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis?
# ==================================================
# TASK 5c: Influential Observations (Cook's Distance)
# ==================================================
# Calculate Cook's distance
cooks_dist <- cooks.distance(model_C)
# Summary statistics
cat("========================================================\n")## ========================================================
## TASK 5c: Influential Observations (Cook's Distance)
## ========================================================
## Cook's Distance Summary:
## --------------------------------------------------------
## Minimum: 0
## Maximum: 0.0127
## Mean: 3e-04
## Median: 0
# Count observations with Cook's D > thresholds
n_influential <- sum(cooks_dist > 1, na.rm = TRUE)
cat("Observations with Cook's D > 1.0:", n_influential, "\n\n")## Observations with Cook's D > 1.0: 0
if(n_influential == 0) {
cat("✅ NO highly influential observations detected.\n")
cat(" All observations have Cook's D < 1.\n\n")
} else {
cat("⚠️ WARNING:", n_influential, "influential observations found.\n\n")
}## ✅ NO highly influential observations detected.
## All observations have Cook's D < 1.
# Plot Cook's distance
par(mfrow = c(1, 1))
plot(cooks_dist, type = "h",
main = "Cook's Distance for Model C",
ylab = "Cook's Distance",
xlab = "Observation Number",
col = "steelblue")
abline(h = 1, col = "red", lty = 2, lwd = 2)
abline(h = 0.5, col = "orange", lty = 3, lwd = 1.5)
legend("topright", legend = c("Cook's D", "Threshold (1.0)", "Moderate (0.5)"),
col = c("steelblue", "red", "orange"), lty = c(1, 2, 3), lwd = c(1, 2, 1.5))
### Task 5c: Influential Observations
Findings: - Maximum Cook’s Distance: 0.0127 (threshold = 1.0) - Mean Cook’s Distance: 0.0003 - Number of observations with Cook’s D > 1.0: 0
Interpretation:
There are no highly influential observations in Model
C. The maximum Cook’s D of 0.0127 is far below the conventional
threshold of 1.0, indicating that no single observation
disproportionately influences the regression coefficients. The results
are stable and generalizable.
What would you do if you found serious
violations?
If influential observations were detected, I would: 1. Examine them for
data entry errors or implausible values 2. Run sensitivity analyses with
and without these observations 3. Consider robust regression methods if
results change substantially 4. Document all decisions transparently in
the methods section
Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should: