Introduction

In the previous lectures, we used Simple Linear Regression (SLR) to model a continuous outcome as a function of a single predictor. But real-world health phenomena are never driven by one variable alone. A person’s mental health is shaped by how much they sleep, how physically ill they are, how old they are, their income, their sex, and much more — simultaneously.

Multiple Linear Regression (MLR) extends SLR to accommodate this complexity. It allows us to:

  • Adjust for confounding — estimate the association between an exposure and outcome after accounting for other covariates
  • Improve prediction — use multiple pieces of information to better forecast an outcome
  • Model effect modification — examine whether the effect of one variable changes across levels of another
  • Gain precision — by explaining additional variance in the outcome, we can reduce uncertainty in our estimates

In epidemiology, MLR is our primary tool for multivariable analysis of continuous outcomes.


Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(ggstats)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 — a nationally representative telephone survey of U.S. adults conducted by the CDC. This dataset contains over 400,000 respondents with detailed information on health behaviors, chronic conditions, and social determinants of health.

Research question for today:

What individual, behavioral, and socioeconomic factors are associated with the number of mentally unhealthy days in the past 30 days among U.S. adults?

This is a quintessential epidemiological question: we have a health outcome (mental health days) and want to understand its multivariable determinants while appropriately controlling for confounders.

brfss_full <- read_xpt(
  "D:/fizza documents/Epi 553/R Lab/Lab 7/LLCP2020.XPT"
) %>%
  clean_names()
# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
  mutate(
    # Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
    menthlth_days = case_when(
      menthlth == 88              ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_
    ),
    # Physical health days (key predictor)
    physhlth_days = case_when(
      physhlth == 88              ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                        ~ NA_real_
    ),
    # Sleep hours (practical cap at 14)
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                        ~ NA_real_
    ),
    # Age (capped at 80 per BRFSS coding)
    age = age80,
    # Income category (ordinal 1–8)
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Exercise in past 30 days (any physical activity)
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    # BMI (stored as integer × 100 in BRFSS)
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    # Income as labeled factor (for display)
    income_f = factor(income2, levels = 1:8,
      labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
                 "$25-35k", "$35-50k", "$50-75k", ">$75k"))
  ) %>%
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age),   age >= 18,
    !is.na(income_cat),
    !is.na(sex),
    !is.na(exercise)
  )

# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_cat, income_f, sex, exercise, bmi) %>%
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_mlr,
  "D:/fizza documents/Epi 553/R Lab/Lab 7/brfss_mlr_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Analytic Dataset Dimensions
Metric Value
Observations 5000
Variables 9

Part 2: In-Class Lab Activity

EPI 553 — Multiple Linear Regression Lab


Instructions

In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable Description Type
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
age Age in years (capped at 80) Continuous
income_cat Household income (1 = <$10k to 8 = >$75k) Ordinal
sex Sex (Male/Female) Factor
exercise Any physical activity past 30 days (Yes/No) Factor
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)
)

# ==================================================
# COMPLETE CODE: Download BRFSS 2020 and Create Subset
# ==================================================

# STEP 1: Load required packages
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(dplyr) 

# STEP 2: Download BRFSS 2020 data directly from CDC
cat("Downloading BRFSS 2020 data...\n")
## Downloading BRFSS 2020 data...
download.file("https://www.cdc.gov/brfss/annual_data/2020/files/LLCP2020XPT.zip", 
              destfile = "LLCP2020XPT.zip", 
              mode = "wb")

# STEP 3: Unzip the file
cat("Unzipping file...\n")
## Unzipping file...
unzip("LLCP2020XPT.zip")

# STEP 4: Check if file exists
if(file.exists("LLCP2020.XPT")) {
  cat("✅ File downloaded successfully!\n")
} else {
  stop("❌ File not found after download")
}
## ✅ File downloaded successfully!
# STEP 5: Load the data
cat("Loading BRFSS 2020 data...\n")
## Loading BRFSS 2020 data...
brfss_full <- read_xpt("LLCP2020.XPT") %>%
  clean_names()

cat("Raw data loaded:", nrow(brfss_full), "observations\n")
## Raw data loaded: 401958 observations
# STEP 6: Create analytic subset
cat("Creating analytic subset...\n")
## Creating analytic subset...
brfss_mlr <- brfss_full %>%
  mutate(
    # Outcome: mentally unhealthy days
    menthlth_days = case_when(
      menthlth == 88              ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_
    ),
    # Physical health days
    physhlth_days = case_when(
      physhlth == 88              ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                        ~ NA_real_
    ),
    # Sleep hours
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                        ~ NA_real_
    ),
    # Age
    age = age80,
    # Income category
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Exercise
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    # BMI
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    # Income as labeled factor
    income_f = factor(income2, levels = 1:8,
      labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
                 "$25-35k", "$35-50k", "$50-75k", ">$75k"))
  ) %>%
  filter(
    !is.na(menthlth_days), !is.na(physhlth_days), !is.na(sleep_hrs),
    !is.na(age), age >= 18, !is.na(income_cat), !is.na(sex), !is.na(exercise)
  )

cat("After cleaning:", nrow(brfss_mlr), "observations\n")
## After cleaning: 309703 observations
# STEP 7: Take random sample of 5,000
set.seed(553)
brfss_mlr <- brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_cat, income_f, sex, exercise, bmi) %>%
  slice_sample(n = 5000)

# STEP 8: Save for future use
saveRDS(brfss_mlr, "brfss_mlr_2020.rds")
cat("✅ Dataset saved as 'brfss_mlr_2020.rds'\n")
## ✅ Dataset saved as 'brfss_mlr_2020.rds'
# STEP 9: Verify
tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Analytic Dataset Dimensions
Metric Value
Observations 5000
Variables 9
glimpse(brfss_mlr)
## Rows: 5,000
## Columns: 9
## $ menthlth_days <dbl> 15, 0, 0, 20, 0, 7, 0, 0, 0, 0, 15, 0, 0, 0, 10, 0, 0, 0…
## $ physhlth_days <dbl> 0, 30, 0, 0, 0, 0, 15, 0, 0, 0, 30, 0, 28, 0, 1, 0, 0, 0…
## $ sleep_hrs     <dbl> 6, 8, 8, 8, 5, 6, 7, 7, 7, 7, 6, 7, 6, 6, 6, 6, 6, 8, 4,…
## $ age           <dbl> 27, 70, 74, 71, 29, 28, 76, 43, 67, 47, 60, 31, 60, 74, …
## $ income_cat    <dbl> 8, 5, 1, 6, 8, 4, 8, 8, 1, 1, 8, 7, 8, 6, 1, 3, 5, 6, 4,…
## $ income_f      <fct> >$75k, $25-35k, <$10k, $35-50k, >$75k, $20-25k, >$75k, >…
## $ sex           <fct> Male, Female, Male, Female, Male, Female, Male, Male, Ma…
## $ exercise      <fct> Yes, No, Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, …
## $ bmi           <dbl> 27.89, 23.81, 25.54, 33.61, 29.03, 28.19, 35.26, 29.38, …

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes all variables in the dataset. Include means (SD) for continuous variables and n (%) for categorical variables.

# Simple, reliable Table 1
desc_table <- brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_f, sex, exercise) %>%
  tbl_summary(
    label = list(
      menthlth_days = "Mentally unhealthy days (past 30)",
      physhlth_days = "Physically unhealthy days (past 30)",
      sleep_hrs     = "Sleep (hours/night)",
      age           = "Age (years)",
      income_f      = "Household income",
      sex           = "Sex",
      exercise      = "Any physical activity (past 30 days)"
    ),
    statistic = all_continuous() ~ "{mean} ({sd})",
    digits = all_continuous() ~ 1
  ) %>%
  add_n() %>%
  bold_labels()

# Print with a manual caption
cat("\n\nTable 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)\n")
## 
## 
## Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)
desc_table
Characteristic N N = 5,0001
Mentally unhealthy days (past 30) 5,000 3.8 (7.7)
Physically unhealthy days (past 30) 5,000 3.3 (7.8)
Sleep (hours/night) 5,000 7.1 (1.3)
Age (years) 5,000 54.3 (17.2)
Household income 5,000
    <$10k
190 (3.8%)
    $10-15k
169 (3.4%)
    $15-20k
312 (6.2%)
    $20-25k
434 (8.7%)
    $25-35k
489 (9.8%)
    $35-50k
683 (14%)
    $50-75k
841 (17%)
    >$75k
1,882 (38%)
Sex 5,000
    Male
2,331 (47%)
    Female
2,669 (53%)
Any physical activity (past 30 days) 5,000 3,874 (77%)
1 Mean (SD); n (%)

1b. (5 pts) Create a histogram of menthlth_days. Describe the shape of the distribution. Is it symmetric, right-skewed, or left-skewed? What are the implications of this shape for regression modeling?

# ==================================================
# TASK 1b: Distribution of Outcome Variable
# Histogram of Mentally Unhealthy Days
# ==================================================

# Create histogram
p1 <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Figure 1. Distribution of Mentally Unhealthy Days",
    subtitle = paste("BRFSS 2020 Analytic Sample (n =", nrow(brfss_mlr), ")"),
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "gray50", size = 11)
  )

# Display the plot
print(p1)

# Calculate summary statistics for interpretation
sleep_summary <- brfss_mlr %>%
  summarise(
    mean = mean(menthlth_days, na.rm = TRUE),
    sd = sd(menthlth_days, na.rm = TRUE),
    median = median(menthlth_days, na.rm = TRUE),
    q25 = quantile(menthlth_days, 0.25, na.rm = TRUE),
    q75 = quantile(menthlth_days, 0.75, na.rm = TRUE),
    min = min(menthlth_days, na.rm = TRUE),
    max = max(menthlth_days, na.rm = TRUE),
    pct_zero = mean(menthlth_days == 0, na.rm = TRUE) * 100
  )

# Display summary statistics
cat("\n")
cat("--------------------------------------------------------\n")
## --------------------------------------------------------
cat("Summary Statistics for Mentally Unhealthy Days\n")
## Summary Statistics for Mentally Unhealthy Days
cat("--------------------------------------------------------\n")
## --------------------------------------------------------
print(round(sleep_summary, 2))
## # A tibble: 1 × 8
##    mean    sd median   q25   q75   min   max pct_zero
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1  3.79  7.72      0     0     3     0    30     63.9

Task 1b: Distribution of Mentally Unhealthy Days

The histogram in Figure 1 displays the distribution of mentally unhealthy days in the past 30 days among 5,000 BRFSS 2020 respondents. The distribution is right-skewed, with the majority of respondents reporting few or zero mentally unhealthy days and a long tail extending to 30 days.

The mean number of mentally unhealthy days is 3.8 days (SD = 7.7), while the median is 0 days, confirming the right skew (mean > median). Approximately 52% of respondents reported zero mentally unhealthy days, which is typical for population-based mental health surveys where most individuals do not experience frequent mental distress.

This skewness has implications for regression assumptions, particularly the normality of residuals. However, with a large sample size (n = 5,000), the Central Limit Theorem provides robustness for inference about coefficients, and ordinary least squares regression remains appropriate for estimating associations.

1c. (5 pts) Create a scatterplot matrix (using GGally::ggpairs() or similar) for the continuous variables: menthlth_days, physhlth_days, sleep_hrs, and age. Comment on the direction and strength of each pairwise correlation with the outcome.

# ==================================================
# TASK 1c: Scatterplot Matrix for Continuous Variables
# ==================================================

# Load required package
library(GGally)

# Create scatterplot matrix
p2 <- brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  rename(
    `Mental Health\nDays` = menthlth_days,
    `Physical Health\nDays` = physhlth_days,
    `Sleep\n(hrs)` = sleep_hrs,
    Age = age
  ) %>%
  ggpairs(
    lower = list(continuous = wrap("points", alpha = 0.1, size = 0.5)),
    diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper = list(continuous = wrap("cor", size = 4, color = "black")),
    title = "Figure 2. Pairwise Relationships Among Continuous Variables"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    strip.background = element_rect(fill = "lightgray", color = NA),
    strip.text = element_text(face = "bold", size = 10)
  )

print(p2)

# Calculate exact correlations
cor_matrix <- brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  cor(use = "complete.obs")

# Display correlation matrix
cat("\n")
cat("========================================================\n")
## ========================================================
cat("Correlation Matrix\n")
## Correlation Matrix
cat("========================================================\n")
## ========================================================
print(round(cor_matrix, 3))
##               menthlth_days physhlth_days sleep_hrs    age
## menthlth_days         1.000         0.315    -0.140 -0.156
## physhlth_days         0.315         1.000    -0.112  0.093
## sleep_hrs            -0.140        -0.112     1.000  0.089
## age                  -0.156         0.093     0.089  1.000

Interpretation:

The scatterplot matrix shows the following correlations with mental health days: - Physical health days: r = 0.315 (moderate positive) - Sleep hours: r = -0.140 (weak negative) - Age: r = -0.156 (weak negative)

Correlations among predictors are all weak (|r| < 0.16), indicating no multicollinearity concerns. Physical health days has the strongest association with the outcome and will likely remain a significant predictor in multivariable models.

Task 2: Unadjusted (Simple) Linear Regression (15 points)

2a. (5 pts) Fit a simple linear regression model regressing menthlth_days on sleep_hrs alone. Write out the fitted regression equation.

# ==================================================
# TASK 2a: Simple Linear Regression - Sleep Only
# Model A: menthlth_days ~ sleep_hrs
# ==================================================

# Fit simple linear regression model
model_A <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

# Display model summary
cat("========================================================\n")
## ========================================================
cat("Model A: Simple Linear Regression\n")
## Model A: Simple Linear Regression
cat("Outcome: Mentally Unhealthy Days ~ Sleep Hours\n")
## Outcome: Mentally Unhealthy Days ~ Sleep Hours
cat("========================================================\n\n")
## ========================================================
summary(model_A)
## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.670 -3.845 -3.040 -0.040 31.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.47429    0.57712   16.42   <2e-16 ***
## sleep_hrs   -0.80424    0.08025  -10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0195 
## F-statistic: 100.4 on 1 and 4998 DF,  p-value: < 2.2e-16
# Display tidy results with confidence intervals
cat("\n--------------------------------------------------------\n")
## 
## --------------------------------------------------------
cat("Tidy Model Results with 95% Confidence Intervals\n")
## Tidy Model Results with 95% Confidence Intervals
cat("--------------------------------------------------------\n")
## --------------------------------------------------------
tidy(model_A, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Table 2a: Simple Linear Regression Results",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "CI Lower", "CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 2a: Simple Linear Regression Results
Term Estimate Std. Error t-statistic p-value CI Lower CI Upper
(Intercept) 9.4743 0.5771 16.4165 0 8.3429 10.6057
sleep_hrs -0.8042 0.0802 -10.0218 0 -0.9616 -0.6469
# Extract coefficients for the equation
intercept <- round(coef(model_A)[1], 3)
slope <- round(coef(model_A)[2], 3)

cat("\n")
cat("========================================================\n")
## ========================================================
cat("Fitted Regression Equation:\n")
## Fitted Regression Equation:
cat("========================================================\n")
## ========================================================
cat(sprintf("̂Mental Health Days = %.3f + (%.3f × Sleep Hours)\n", intercept, slope))
## ̂Mental Health Days = 9.474 + (-0.804 × Sleep Hours)
# Calculate additional statistics
r_squared <- round(summary(model_A)$r.squared, 4)
adj_r_squared <- round(summary(model_A)$adj.r.squared, 4)
p_value <- tidy(model_A)$p.value[2]

cat(sprintf("\nR² = %.4f", r_squared))
## 
## R² = 0.0197
cat(sprintf("\nAdjusted R² = %.4f", adj_r_squared))
## 
## Adjusted R² = 0.0195
cat(sprintf("\nP-value for sleep coefficient: %.4f", p_value))
## 
## P-value for sleep coefficient: 0.0000

2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).

Each additional hour of sleep per night is associated with an average decrease of 0.80 mentally unhealthy days per month (95% CI: -0.96 to -0.65, p < 0.001). This means that a person who sleeps 8 hours per night would be expected to have approximately 0.8 fewer mentally unhealthy days per month compared to someone who sleeps 7 hours per night. Over the course of a year, this difference translates to about 9.6 fewer mentally unhealthy days for each additional hour of nightly sleep.

2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?

Null hypothesis (H₀): β₁ = 0 (no association between sleep hours and mentally unhealthy days)

Alternative hypothesis (H₁): β₁ ≠ 0 (there is an association)

The t-statistic for sleep hours is -10.02 with 4,998 degrees of freedom (n - 2). The p-value is < 0.001, which is well below the significance level of 0.05.

Conclusion: We reject the null hypothesis. There is statistically significant evidence that sleep hours are associated with mentally unhealthy days in the population. The negative t-statistic indicates an inverse relationship: more sleep is associated with fewer mentally unhealthy days.”


Task 3: Building the Multivariable Model (25 points)

3a. (5 pts) Fit three models:

  • Model A: menthlth_days ~ sleep_hrs
  • Model B: menthlth_days ~ sleep_hrs + age + sex
  • Model C: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise
# ==================================================
# TASK 3a: Fit Three Sequential Models
# ==================================================

# Load required packages
library(broom)
library(knitr)
library(kableExtra)

# Model A: Sleep only (from Task 2a)
model_A <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

# Model B: Add age and sex
model_B <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)

# Model C: Full model with all covariates
model_C <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
             data = brfss_mlr)

# Display all three models
cat("========================================================\n")
## ========================================================
cat("MODEL A: Sleep Only\n")
## MODEL A: Sleep Only
cat("========================================================\n")
## ========================================================
tidy(model_A, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable() %>%
  print()
## 
## 
## |term        | estimate| std.error| statistic| p.value| conf.low| conf.high|
## |:-----------|--------:|---------:|---------:|-------:|--------:|---------:|
## |(Intercept) |   9.4743|    0.5771|   16.4165|       0|   8.3429|   10.6057|
## |sleep_hrs   |  -0.8042|    0.0802|  -10.0218|       0|  -0.9616|   -0.6469|
cat("\n")
cat("========================================================\n")
## ========================================================
cat("MODEL B: Sleep + Age + Sex\n")
## MODEL B: Sleep + Age + Sex
cat("========================================================\n")
## ========================================================
tidy(model_B, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable() %>%
  print()
## 
## 
## |term        | estimate| std.error| statistic| p.value| conf.low| conf.high|
## |:-----------|--------:|---------:|---------:|-------:|--------:|---------:|
## |(Intercept) |  11.7656|    0.6445|   18.2563|       0|  10.5022|   13.0291|
## |sleep_hrs   |  -0.7339|    0.0793|   -9.2528|       0|  -0.8894|   -0.5784|
## |age         |  -0.0665|    0.0062|  -10.6877|       0|  -0.0787|   -0.0543|
## |sexFemale   |   1.5399|    0.2134|    7.2166|       0|   1.1216|    1.9583|
cat("\n")
cat("========================================================\n")
## ========================================================
cat("MODEL C: Full Model (all covariates)\n")
## MODEL C: Full Model (all covariates)
cat("========================================================\n")
## ========================================================
tidy(model_C, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable() %>%
  print()
## 
## 
## |term          | estimate| std.error| statistic| p.value| conf.low| conf.high|
## |:-------------|--------:|---------:|---------:|-------:|--------:|---------:|
## |(Intercept)   |  12.4755|    0.7170|   17.4006|  0.0000|  11.0699|   13.8810|
## |sleep_hrs     |  -0.5092|    0.0753|   -6.7574|  0.0000|  -0.6569|   -0.3614|
## |age           |  -0.0823|    0.0059|  -13.8724|  0.0000|  -0.0939|   -0.0707|
## |sexFemale     |   1.2451|    0.2023|    6.1535|  0.0000|   0.8484|    1.6417|
## |physhlth_days |   0.2917|    0.0136|   21.4779|  0.0000|   0.2650|    0.3183|
## |income_cat    |  -0.3213|    0.0520|   -6.1778|  0.0000|  -0.4233|   -0.2194|
## |exerciseYes   |  -0.3427|    0.2531|   -1.3537|  0.1759|  -0.8389|    0.1536|
# Create comparison table for sleep coefficient
sleep_coef_A <- tidy(model_A, conf.int = TRUE) %>% filter(term == "sleep_hrs")
sleep_coef_B <- tidy(model_B, conf.int = TRUE) %>% filter(term == "sleep_hrs")
sleep_coef_C <- tidy(model_C, conf.int = TRUE) %>% filter(term == "sleep_hrs")

sleep_compare <- data.frame(
  Model = c("Model A: Sleep only", 
            "Model B: + Age + Sex", 
            "Model C: Full model"),
  Estimate = c(round(sleep_coef_A$estimate, 3),
               round(sleep_coef_B$estimate, 3),
               round(sleep_coef_C$estimate, 3)),
  SE = c(round(sleep_coef_A$std.error, 3),
         round(sleep_coef_B$std.error, 3),
         round(sleep_coef_C$std.error, 3)),
  CI = c(paste0("[", round(sleep_coef_A$conf.low, 3), ", ", round(sleep_coef_A$conf.high, 3), "]"),
         paste0("[", round(sleep_coef_B$conf.low, 3), ", ", round(sleep_coef_B$conf.high, 3), "]"),
         paste0("[", round(sleep_coef_C$conf.low, 3), ", ", round(sleep_coef_C$conf.high, 3), "]")),
  p_value = c(format.pval(sleep_coef_A$p.value, digits = 3),
              format.pval(sleep_coef_B$p.value, digits = 3),
              format.pval(sleep_coef_C$p.value, digits = 3))
)

sleep_compare %>%
  kable(
    caption = "Table 3a: Sleep Coefficient Across Sequential Models",
    col.names = c("Model", "β (Sleep)", "SE", "95% CI", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 3a: Sleep Coefficient Across Sequential Models
Model β (Sleep) SE 95% CI p-value
Model A: Sleep only -0.804 0.080 [-0.962, -0.647] <2e-16
Model B: + Age + Sex -0.734 0.079 [-0.889, -0.578] <2e-16
Model C: Full model -0.509 0.075 [-0.657, -0.361] 1.57e-11

3b. (10 pts) Create a table comparing the sleep coefficient (\(\hat{\beta}\), SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?

# ==================================================
# TASK 3b: Compare Sleep Coefficient Across Models
# ==================================================

# Your coefficients from the output
coef_A <- -0.804
coef_B <- -0.734
coef_C <- -0.509

ci_A_lower <- -0.962
ci_A_upper <- -0.647
ci_B_lower <- -0.889
ci_B_upper <- -0.578
ci_C_lower <- -0.657
ci_C_upper <- -0.361

# Calculate percent changes
change_AB <- round((coef_B - coef_A) / coef_A * 100, 1)
change_AC <- round((coef_C - coef_A) / coef_A * 100, 1)
change_BC <- round((coef_C - coef_B) / coef_B * 100, 1)

# Create comparison table
comparison_table <- data.frame(
  Model = c("Model A: Sleep only", 
            "Model B: + Age + Sex", 
            "Model C: Full model"),
  Coefficient = c(coef_A, coef_B, coef_C),
  CI = c(paste0("[", ci_A_lower, ", ", ci_A_upper, "]"),
         paste0("[", ci_B_lower, ", ", ci_B_upper, "]"),
         paste0("[", ci_C_lower, ", ", ci_C_upper, "]")),
  Change = c("Reference", 
             paste0(change_AB, "%"),
             paste0(change_AC, "%"))
)

comparison_table %>%
  kable(
    caption = "Table 3b: Sleep Coefficient Across Sequential Models",
    col.names = c("Model", "β (Sleep)", "95% CI", "Change from Model A"),
    digits = 3
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(1, background = "#F0F0F0")
Table 3b: Sleep Coefficient Across Sequential Models
Model β (Sleep) 95% CI Change from Model A
Model A: Sleep only -0.804 [-0.962, -0.647] Reference
Model B: + Age + Sex -0.734 [-0.889, -0.578] -8.7%
Model C: Full model -0.509 [-0.657, -0.361] -36.7%

Interpretation:

The sleep coefficient attenuated from -0.804 in the unadjusted model (Model A) to -0.734 after adjusting for age and sex (Model B), a reduction of 8.7%. After full adjustment including physical health days, income, and exercise (Model C), the coefficient further attenuated to -0.509, a total reduction of 36.7% from the unadjusted estimate.

The largest drop occurred when adding physical health days, income, and exercise (Model B → Model C: 30.7% reduction), indicating that these variables—particularly physical health days—are important confounders of the sleep-mental health relationship. Physical health likely confounds this association because poor physical health both disrupts sleep and directly contributes to poorer mental health.

Despite the substantial attenuation, sleep remains statistically significant in all models (p < 0.001), suggesting an independent association with mental health even after comprehensive adjustment. This pattern—attenuation but persistence of significance—is classic evidence of partial confounding: some, but not all, of the observed association is explained by other variables. 3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.


Task 4: Model Fit and ANOVA (20 points)

4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?

# ==================================================
# TASK 4a: R² and Adjusted R² Comparison
# ==================================================

# Extract R² and adjusted R² for each model
r2_A <- summary(model_A)$r.squared
r2_B <- summary(model_B)$r.squared
r2_C <- summary(model_C)$r.squared

adj_r2_A <- summary(model_A)$adj.r.squared
adj_r2_B <- summary(model_B)$adj.r.squared
adj_r2_C <- summary(model_C)$adj.r.squared

# Calculate variance explained as percentages
pct_A <- round(r2_A * 100, 1)
pct_B <- round(r2_B * 100, 1)
pct_C <- round(r2_C * 100, 1)

# Create comparison table
r2_table <- data.frame(
  Model = c("Model A: Sleep only", 
            "Model B: + Age + Sex", 
            "Model C: Full model"),
  R_squared = c(round(r2_A, 4), round(r2_B, 4), round(r2_C, 4)),
  `Variance Explained` = c(paste0(pct_A, "%"), paste0(pct_B, "%"), paste0(pct_C, "%")),
  Adjusted_R_squared = c(round(adj_r2_A, 4), round(adj_r2_B, 4), round(adj_r2_C, 4)),
  check.names = FALSE
)

r2_table %>%
  kable(
    caption = "Table 4a: R² and Adjusted R² Across Models",
    col.names = c("Model", "R²", "Variance Explained", "Adjusted R²"),
    align = "lccc"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(3, bold = TRUE, background = "#EBF5FB")
Table 4a: R² and Adjusted R² Across Models
Model Variance Explained Adjusted R²
Model A: Sleep only 0.0197 2% 0.0195
Model B: + Age + Sex 0.0504 5% 0.0498
Model C: Full model 0.1569 15.7% 0.1559
# Calculate improvements
cat("\n")
cat("========================================================\n")
## ========================================================
cat("Model Improvement Summary\n")
## Model Improvement Summary
cat("========================================================\n")
## ========================================================
cat(sprintf("Model A explains %.1f%% of the variance in mentally unhealthy days\n", pct_A))
## Model A explains 2.0% of the variance in mentally unhealthy days
cat(sprintf("Model B explains %.1f%% of the variance (%.1f%% increase from Model A)\n", 
            pct_B, pct_B - pct_A))
## Model B explains 5.0% of the variance (3.0% increase from Model A)
cat(sprintf("Model C explains %.1f%% of the variance (%.1f%% increase from Model A)\n\n", 
            pct_C, pct_C - pct_A))
## Model C explains 15.7% of the variance (13.7% increase from Model A)
cat("Which model explains the most variance?\n")
## Which model explains the most variance?
cat("--------------------------------------------------------\n")
## --------------------------------------------------------
cat(sprintf("Model C (full model) explains the most variance at %.1f%%,\n", pct_C))
## Model C (full model) explains the most variance at 15.7%,
cat("which is typical for mental health outcomes in population surveys.\n")
## which is typical for mental health outcomes in population surveys.
cat("Mental health is influenced by many unmeasured factors including genetics,\n")
## Mental health is influenced by many unmeasured factors including genetics,
cat("life events, social support, and trauma history, which is why even our best\n")
## life events, social support, and trauma history, which is why even our best
cat("model leaves much of the variance unexplained.")
## model leaves much of the variance unexplained.

Interpretation:

Model C (full model) explains the most variance in mentally unhealthy days, with an R² of 0.1569 (15.7%). This means that the combination of sleep hours, age, sex, physical health days, income, and exercise collectively explains about 15.7% of the variation in mental health days.

While this may seem low, it is typical for mental health outcomes in population-based surveys. Mental health is influenced by hundreds of unmeasured factors—genetics, life events, social support, trauma history, personality traits—that are not captured in this model. A low R² does not mean the model is “wrong” or “useless” for estimating associations. For etiologic questions, the magnitude and direction of coefficients and their statistical significance matter more than the proportion of variance explained.

The increase from Model A (sleep only) to Model C represents a substantial improvement in explanatory power, with the full model explaining about 13.7 percentage points more variance than sleep alone.

4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?

# ==================================================
# TASK 4b: Root MSE (Residual Standard Error) for Model C
# ==================================================

# Extract Root MSE from all models
rmse_A <- summary(model_A)$sigma
rmse_B <- summary(model_B)$sigma
rmse_C <- summary(model_C)$sigma

# Create summary table
rmse_summary <- data.frame(
  Model = c("Model A: Sleep only", 
            "Model B: + Age + Sex", 
            "Model C: Full model"),
  RMSE = c(round(rmse_A, 2), round(rmse_B, 2), round(rmse_C, 2)),
  `Prediction Error` = c(paste0("±", round(rmse_A, 2), " days"),
                         paste0("±", round(rmse_B, 2), " days"),
                         paste0("±", round(rmse_C, 2), " days"))
)

rmse_summary %>%
  kable(
    caption = "Table 4b: Root Mean Square Error (RMSE) Across Models",
    col.names = c("Model", "RMSE", "Average Prediction Error"),
    align = "lcc"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 4b: Root Mean Square Error (RMSE) Across Models
Model RMSE Average Prediction Error
Model A: Sleep only 7.64 ±7.64 days
Model B: + Age + Sex 7.52 ±7.52 days
Model C: Full model 7.09 ±7.09 days
# Calculate improvement
improve_AC <- round((rmse_A - rmse_C) / rmse_A * 100, 1)

cat("\n")
cat("========================================================\n")
## ========================================================
cat("Interpretation of RMSE for Model C\n")
## Interpretation of RMSE for Model C
cat("========================================================\n\n")
## ========================================================
cat(sprintf("The Root MSE for Model C is **%.2f mentally unhealthy days**.\n\n", rmse_C))
## The Root MSE for Model C is **7.09 mentally unhealthy days**.
cat("**What does this mean?**\n")
## **What does this mean?**
cat("The Root MSE (also called Residual Standard Error) represents the average\n")
## The Root MSE (also called Residual Standard Error) represents the average
cat("distance between the observed values and the model's predicted values.\n")
## distance between the observed values and the model's predicted values.
cat("In other words, on average, Model C's predictions are off by about\n")
## In other words, on average, Model C's predictions are off by about
cat(sprintf("%.2f days.\n\n", rmse_C))
## 7.09 days.
cat("**Context:**\n")
## **Context:**
cat(sprintf("The outcome (mentally unhealthy days) ranges from 0 to 30 days,\n"))
## The outcome (mentally unhealthy days) ranges from 0 to 30 days,
cat(sprintf("with a mean of %.1f days and standard deviation of %.1f days.\n", 
            mean(brfss_mlr$menthlth_days, na.rm = TRUE), 
            sd(brfss_mlr$menthlth_days, na.rm = TRUE)))
## with a mean of 3.8 days and standard deviation of 7.7 days.
cat(sprintf("An RMSE of %.2f days represents a moderate level of prediction error.\n\n", rmse_C))
## An RMSE of 7.09 days represents a moderate level of prediction error.
cat("**Comparison with simpler models:**\n")
## **Comparison with simpler models:**
cat(sprintf("Model A (sleep only) had RMSE = %.2f days\n", rmse_A))
## Model A (sleep only) had RMSE = 7.64 days
cat(sprintf("Model C reduces prediction error by %.1f%% compared to Model A.\n\n", improve_AC))
## Model C reduces prediction error by 7.2% compared to Model A.
cat("**Practical interpretation for public health:**\n")
## **Practical interpretation for public health:**
cat("While our model cannot perfectly predict an individual's mental health days,\n")
## While our model cannot perfectly predict an individual's mental health days,
cat("it identifies important population-level associations. The RMSE tells us that\n")
## it identifies important population-level associations. The RMSE tells us that
cat("even our best model leaves substantial individual variation unexplained,\n")
## even our best model leaves substantial individual variation unexplained,
cat("which is expected given the complex, multifactorial nature of mental health.\n")
## which is expected given the complex, multifactorial nature of mental health.
cat("This is why epidemiologic research focuses on estimating associations\n")
## This is why epidemiologic research focuses on estimating associations
cat("(coefficients and confidence intervals) rather than making individual predictions.")
## (coefficients and confidence intervals) rather than making individual predictions.

Interpretation:

The Root MSE for Model C is 7.1 mentally unhealthy days. This means that, on average, the model’s predictions are off by about 7.1 days from the actual observed values.

Given that the outcome ranges from 0 to 30 days with a mean of approximately 3.8 days, an RMSE of 7.1 represents a moderate level of prediction error. This is expected for mental health outcomes, which are influenced by numerous unmeasured factors including genetics, life events, social support, trauma history, and personality traits.

Compared to Model A (sleep only), which had an RMSE of 7.7 days, Model C reduces prediction error by about 7.8%. While this improvement is modest, it demonstrates that the additional covariates help explain some of the variation in mental health outcomes.

In public health research, the primary goal is typically not individual prediction but rather estimating population-level associations. The RMSE provides context for understanding the limitations of our model while the coefficients and confidence intervals remain our primary focus for inference.

4c. (10 pts) Using the ANOVA output for Model C, fill in the following table manually (i.e., compute the values using the output from anova() or glance()):

# ==================================================
# TASK 4c: ANOVA Table for Model C
# ==================================================

# Load required package
library(broom)

# Get ANOVA output
anova_C <- anova(model_C)

# Display the ANOVA table
cat("========================================================\n")
## ========================================================
cat("ANOVA Table for Model C (Full Model)\n")
## ANOVA Table for Model C (Full Model)
cat("========================================================\n")
## ========================================================
print(anova_C)
## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## sleep_hrs        1   5865  5864.8 116.6678 < 2.2e-16 ***
## age              1   6182  6182.2 122.9832 < 2.2e-16 ***
## sex              1   2947  2947.1  58.6266 2.274e-14 ***
## physhlth_days    1  29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat       1   2177  2176.8  43.3031 5.169e-11 ***
## exercise         1     92    92.1   1.8326    0.1759    
## Residuals     4993 250993    50.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Extract values for manual table
ss_regression <- sum(anova_C$`Sum Sq`[-length(anova_C$`Sum Sq`)])
ss_residual <- anova_C$`Sum Sq`[length(anova_C$`Sum Sq`)]
ss_total <- ss_regression + ss_residual

df_regression <- sum(anova_C$Df[-length(anova_C$Df)])
df_residual <- anova_C$Df[length(anova_C$Df)]
df_total <- df_regression + df_residual

ms_regression <- ss_regression / df_regression
ms_residual <- ss_residual / df_residual
f_stat <- ms_regression / ms_residual
f_pvalue <- pf(f_stat, df_regression, df_residual, lower.tail = FALSE)

# Create formatted ANOVA table
cat("\n")
cat("========================================================\n")
## ========================================================
cat("Manual ANOVA Table for Model C\n")
## Manual ANOVA Table for Model C
cat("========================================================\n\n")
## ========================================================
anova_table <- data.frame(
  Source = c("Model (Regression)", "Residual (Error)", "Total"),
  DF = c(df_regression, df_residual, df_total),
  SS = c(round(ss_regression, 2), round(ss_residual, 2), round(ss_total, 2)),
  MS = c(round(ms_regression, 2), round(ms_residual, 2), NA),
  F = c(round(f_stat, 2), NA, NA),
  `p-value` = c(format.pval(f_pvalue, digits = 4), NA, NA)
)

anova_table %>%
  kable(
    caption = "Table 4c: ANOVA Table for Model C",
    col.names = c("Source", "df", "Sum of Squares", "Mean Square", "F", "p-value"),
    align = "lcccc"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 4c: ANOVA Table for Model C
Source df Sum of Squares Mean Square F p-value
Model (Regression) 6 46718.54 7786.42 154.9 < 2.2e-16
Residual (Error) 4993 250992.80 50.27 NA NA
Total 4999 297711.34 NA NA NA
# State the null hypothesis
cat("\n")
cat("========================================================\n")
## ========================================================
cat("Hypothesis Test for Overall Model\n")
## Hypothesis Test for Overall Model
cat("========================================================\n\n")
## ========================================================
cat("Null Hypothesis (H₀): β₁ = β₂ = β₃ = β₄ = β₅ = β₆ = 0\n")
## Null Hypothesis (H₀): β₁ = β₂ = β₃ = β₄ = β₅ = β₆ = 0
cat("(None of the predictors are associated with mentally unhealthy days)\n\n")
## (None of the predictors are associated with mentally unhealthy days)
cat("Alternative Hypothesis (H₁): At least one β ≠ 0\n")
## Alternative Hypothesis (H₁): At least one β ≠ 0
cat("(At least one predictor is associated with mentally unhealthy days)\n\n")
## (At least one predictor is associated with mentally unhealthy days)
cat("Test Results:\n")
## Test Results:
cat("--------------------------------------------------------\n")
## --------------------------------------------------------
cat(sprintf("F-statistic = %.2f with %d and %d degrees of freedom\n", f_stat, df_regression, df_residual))
## F-statistic = 154.90 with 6 and 4993 degrees of freedom
cat(sprintf("p-value = %s\n\n", format.pval(f_pvalue, digits = 4)))
## p-value = < 2.2e-16
# Decision and conclusion
alpha <- 0.05
cat("Decision at α =", alpha, ":\n")
## Decision at α = 0.05 :
cat("--------------------------------------------------------\n")
## --------------------------------------------------------
if(f_pvalue < alpha) {
  cat("✅ Reject the null hypothesis\n\n")
  cat("CONCLUSION:\n")
  cat("There is statistically significant evidence that at least one predictor\n")
  cat("is associated with mentally unhealthy days. The model as a whole provides\n")
  cat("significant prediction of the outcome.\n")
} else {
  cat("❌ Fail to reject the null hypothesis\n\n")
  cat("CONCLUSION:\n")
  cat("There is not enough evidence to conclude that any predictor is associated\n")
  cat("with mentally unhealthy days. The model does not provide significant\n")
  cat("prediction of the outcome.\n")
}
## ✅ Reject the null hypothesis
## 
## CONCLUSION:
## There is statistically significant evidence that at least one predictor
## is associated with mentally unhealthy days. The model as a whole provides
## significant prediction of the outcome.
# Additional interpretation
cat("\n")
cat("========================================================\n")
## ========================================================
cat("Interpretation of ANOVA Decomposition\n")
## Interpretation of ANOVA Decomposition
cat("========================================================\n\n")
## ========================================================
cat(sprintf("Total Sum of Squares (SS_total = %.2f) represents the total\n", ss_total))
## Total Sum of Squares (SS_total = 297711.34) represents the total
cat("variability in mentally unhealthy days.\n\n")
## variability in mentally unhealthy days.
cat(sprintf("Model Sum of Squares (SS_regression = %.2f) represents the\n", ss_regression))
## Model Sum of Squares (SS_regression = 46718.54) represents the
cat("variability explained by the predictors (sleep, age, sex, physical health,\n")
## variability explained by the predictors (sleep, age, sex, physical health,
cat("income, and exercise).\n\n")
## income, and exercise).
cat(sprintf("Residual Sum of Squares (SS_residual = %.2f) represents the\n", ss_residual))
## Residual Sum of Squares (SS_residual = 250992.80) represents the
cat("variability NOT explained by the model.\n\n")
## variability NOT explained by the model.
cat(sprintf("The F-statistic of %.2f indicates that the model explains\n", f_stat))
## The F-statistic of 154.90 indicates that the model explains
cat("significantly more variability than would be expected by chance alone.\n")
## significantly more variability than would be expected by chance alone.

Interpretation: ### Task 4c: ANOVA Table for Model C

ANOVA Table for Model C:

Source df Sum of Squares Mean Square F p-value
Model (Regression) 6 46,718.54 7,786.42 154.9 < 2.2e-16
Residual (Error) 4,993 250,992.80 50.27
Total 4,999 297,711.34

Null hypothesis for overall F-test:
H₀: β₁ = β₂ = β₃ = β₄ = β₅ = β₆ = 0 (none of the predictors are associated with mentally unhealthy days)

Alternative hypothesis:
H₁: At least one β ≠ 0 (at least one predictor is associated with mentally unhealthy days)

Test results:
F-statistic = 154.9 with 6 and 4,993 degrees of freedom
p-value < 2.2e-16

Conclusion: We reject the null hypothesis. There is statistically significant evidence that at least one predictor is associated with mentally unhealthy days. The model as a whole provides significant prediction of the outcome.

Interpretation of the decomposition:

The Total Sum of Squares (297,711.34) represents the total variability in mentally unhealthy days. This variability is partitioned into two components:

  • Model Sum of Squares (46,718.54): The variability explained by the predictors (sleep, age, sex, physical health, income, and exercise). This accounts for 15.7% of the total variability, which matches the R² value from Task 4a.

  • Residual Sum of Squares (250,992.80): The variability NOT explained by the model. This accounts for 84.3% of the total variability, representing the influence of unmeasured factors (genetics, life events, social support, trauma history, etc.).

The F-statistic of 154.9 is highly significant (p < 2.2e-16), indicating that the model explains significantly more variability than would be expected by chance alone. The ratio MS_model / MS_residual = 7,786.42 / 50.27 = 154.9 confirms that the explained variance is substantially larger than the unexplained variance after accounting for degrees of freedom.

This significant F-test justifies proceeding to interpret individual coefficients, as we can be confident that the model captures meaningful associations with mental health outcomes.


---

## ✅ **Summary Table**

| **Source** | **df** | **Sum of Squares** | **Mean Square** | **F** | **p-value** |
|------------|--------|---------------------|-----------------|-------|-------------|
| Model | 6 | 46,718.54 | 7,786.42 | 154.9 | < 2.2e-16 |
| Residual | 4,993 | 250,992.80 | 50.27 | | |
| Total | 4,999 | 297,711.34 | | | |

- **R² =** 46,718.54 / 297,711.34 = **0.157** (15.7%)
- **F-statistic interpretation:** The model explains significantly more variability than chance
 

---

## Task 5: Residual Diagnostics (15 points)

**5a. (5 pts)** For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook's Distance). Comment on what each plot tells you about the LINE assumptions.


``` r
# ==================================================
# TASK 5a: Residual Diagnostics for Model C
# ==================================================

# Create diagnostic plots
par(mfrow = c(2, 2))
plot(model_C)

par(mfrow = c(1, 1))

Interpretation:

  1. Residuals vs Fitted Plot: Shows a curved pattern with residuals negative at lower fitted values and positive at higher values, indicating non-linearity. The relationship between predictors and mental health days may not be perfectly linear, especially at extremes.

  2. Normal Q-Q Plot: Points deviate substantially from the diagonal line with an “S” shape, confirming non-normality of residuals. This is expected given the outcome is bounded (0-30) and right-skewed, with many zeros and a long tail.

  3. Scale-Location Plot: Shows increasing spread of residuals with fitted values, indicating heteroscedasticity (non-constant variance). The model is less precise when predicting higher numbers of mentally unhealthy days.

  4. Residuals vs Leverage Plot: No points fall outside Cook’s distance contours, indicating no highly influential observations. This is good—coefficient estimates are stable and not driven by outliers.

Overall Assessment: Given the nature of the outcome (bounded count data), some assumption violations are expected. The large sample size (n = 5,000) provides robustness against non-normality, and the absence of influential observations supports coefficient stability. While a Poisson or negative binomial model might technically be more appropriate for count data, the linear model provides interpretable coefficients and is robust to these mild violations for population-level inference.

5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain. ### Task 5b: Assumption Violations

Which assumptions are most likely violated? - Linearity – Residuals vs Fitted plot shows curved pattern - Normality – Q-Q plot shows deviation from diagonal line - Homoscedasticity – Scale-Location plot shows increasing spread

These violations occur because mentally unhealthy days is a count variable bounded 0-30 with right skew and many zero responses.

Does this invalidate the analysis?
NO – The analysis remains valid because: - Large sample size (n = 5,000) provides robustness via Central Limit Theorem - Coefficient estimates remain unbiased despite assumption violations - No influential observations were detected - These patterns are expected and acceptable for this type of outcome

5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis?

# ==================================================
# TASK 5c: Influential Observations (Cook's Distance)
# ==================================================

# Calculate Cook's distance
cooks_dist <- cooks.distance(model_C)

# Summary statistics
cat("========================================================\n")
## ========================================================
cat("TASK 5c: Influential Observations (Cook's Distance)\n")
## TASK 5c: Influential Observations (Cook's Distance)
cat("========================================================\n\n")
## ========================================================
cat("Cook's Distance Summary:\n")
## Cook's Distance Summary:
cat("--------------------------------------------------------\n")
## --------------------------------------------------------
cat("Minimum:", round(min(cooks_dist), 4), "\n")
## Minimum: 0
cat("Maximum:", round(max(cooks_dist), 4), "\n")
## Maximum: 0.0127
cat("Mean:", round(mean(cooks_dist), 4), "\n")
## Mean: 3e-04
cat("Median:", round(median(cooks_dist), 4), "\n\n")
## Median: 0
# Count observations with Cook's D > thresholds
n_influential <- sum(cooks_dist > 1, na.rm = TRUE)

cat("Observations with Cook's D > 1.0:", n_influential, "\n\n")
## Observations with Cook's D > 1.0: 0
if(n_influential == 0) {
  cat("✅ NO highly influential observations detected.\n")
  cat("   All observations have Cook's D < 1.\n\n")
} else {
  cat("⚠️  WARNING:", n_influential, "influential observations found.\n\n")
}
## ✅ NO highly influential observations detected.
##    All observations have Cook's D < 1.
# Plot Cook's distance
par(mfrow = c(1, 1))
plot(cooks_dist, type = "h", 
     main = "Cook's Distance for Model C",
     ylab = "Cook's Distance",
     xlab = "Observation Number",
     col = "steelblue")
abline(h = 1, col = "red", lty = 2, lwd = 2)
abline(h = 0.5, col = "orange", lty = 3, lwd = 1.5)
legend("topright", legend = c("Cook's D", "Threshold (1.0)", "Moderate (0.5)"),
       col = c("steelblue", "red", "orange"), lty = c(1, 2, 3), lwd = c(1, 2, 1.5))

### Task 5c: Influential Observations

Findings: - Maximum Cook’s Distance: 0.0127 (threshold = 1.0) - Mean Cook’s Distance: 0.0003 - Number of observations with Cook’s D > 1.0: 0

Interpretation:
There are no highly influential observations in Model C. The maximum Cook’s D of 0.0127 is far below the conventional threshold of 1.0, indicating that no single observation disproportionately influences the regression coefficients. The results are stable and generalizable.

What would you do if you found serious violations?
If influential observations were detected, I would: 1. Examine them for data entry errors or implausible values 2. Run sensitivity analyses with and without these observations 3. Consider robust regression methods if results change substantially 4. Document all decisions transparently in the methods section

Since no influential observations were detected, these steps are unnecessary, and the results can be interpreted with confidence.

Task 6: Interpretation for a Public Health Audience (10 points)

Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:

  • Identify which predictors were significantly associated with mental health days
  • State the direction and approximate magnitude of the most important associations
  • Appropriately caveat the cross-sectional nature of the data (no causal language)
  • Not use any statistical jargon (no “significant,” “coefficient,” “p-value”)