Introduction

In the previous lectures, we used Simple Linear Regression (SLR) to model a continuous outcome as a function of a single predictor. But real-world health phenomena are never driven by one variable alone. A person’s mental health is shaped by how much they sleep, how physically ill they are, how old they are, their income, their sex, and much more — simultaneously.

Multiple Linear Regression (MLR) extends SLR to accommodate this complexity. It allows us to:

Adjust for confounding — estimate the association between an exposure and outcome after accounting for other covariates
Improve prediction — use multiple pieces of information to better forecast an outcome
Model effect modification — examine whether the effect of one variable changes across levels of another
Gain precision — by explaining additional variance in the outcome, we can reduce uncertainty in our estimates

In epidemiology, MLR is our primary tool for multivariable analysis of continuous outcomes.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`income_cat`	Household income (1 = <$10k to 8 = >$75k)	Ordinal
`sex`	Sex (Male/Female)	Factor
`exercise`	Any physical activity past 30 days (Yes/No)	Factor

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(ggstats)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 — a nationally representative telephone survey of U.S. adults conducted by the CDC. This dataset contains over 400,000 respondents with detailed information on health behaviors, chronic conditions, and social determinants of health.

Research question for today:

What individual, behavioral, and socioeconomic factors are associated with the number of mentally unhealthy days in the past 30 days among U.S. adults?

This is a quintessential epidemiological question: we have a health outcome (mental health days) and want to understand its multivariable determinants while appropriately controlling for confounders.

brfss_full <- read_xpt(
  "C:/Users/suruc/OneDrive/Desktop/R/LLCP2020.XPT"
) %>%
  clean_names()

# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
  mutate(
    # Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
    menthlth_days = case_when(
      menthlth == 88              ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_
    ),
    # Physical health days (key predictor)
    physhlth_days = case_when(
      physhlth == 88              ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                        ~ NA_real_
    ),
    # Sleep hours (practical cap at 14)
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                        ~ NA_real_
    ),
    # Age (capped at 80 per BRFSS coding)
    age = age80,
    # Income category (ordinal 1–8)
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Exercise in past 30 days (any physical activity)
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    # BMI (stored as integer × 100 in BRFSS)
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    # Income as labeled factor (for display)
    income_f = factor(income2, levels = 1:8,
      labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
                 "$25-35k", "$35-50k", "$50-75k", ">$75k"))
  ) %>%
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age),   age >= 18,
    !is.na(income_cat),
    !is.na(sex),
    !is.na(exercise)
  )

# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_cat, income_f, sex, exercise, bmi) %>%
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_mlr,
  "C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/brfss_mlr_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

Part 2: In-Class Lab Activity

EPI 553 — Multiple Linear Regression Lab Due: End of class, March 10, 2026

Instructions

In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes all variables in the dataset. Include means (SD) for continuous variables and n (%) for categorical variables.

Descriptive Statistics

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_f, sex, exercise) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      sleep_hrs     ~ "Sleep (hours/night)",
      age           ~ "Age (years)",
      income_f      ~ "Household income",
      sex           ~ "Sex",
      exercise      ~ "Any physical activity (past 30 days)"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  # add_overall() %>%
  bold_labels() %>%
  #italicize_levels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,000¹
Mentally unhealthy days (past 30)	5,000	3.8 (7.7)
Physically unhealthy days (past 30)	5,000	3.3 (7.8)
Sleep (hours/night)	5,000	7.1 (1.3)
Age (years)	5,000	54.3 (17.2)
Household income	5,000
<$10k		190 (3.8%)
$10-15k		169 (3.4%)
$15-20k		312 (6.2%)
$20-25k		434 (8.7%)
$25-35k		489 (9.8%)
$35-50k		683 (14%)
$50-75k		841 (17%)
>$75k		1,882 (38%)
Sex	5,000
Male		2,331 (47%)
Female		2,669 (53%)
Any physical activity (past 30 days)	5,000	3,874 (77%)
¹ Mean (SD); n (%)

1b. (5 pts) Create a histogram of menthlth_days. Describe the shape of the distribution. Is it symmetric, right-skewed, or left-skewed? What are the implications of this shape for regression modeling?

p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)

Distribution of Mentally Unhealthy Days (BRFSS 2020)

Note: The outcome is right-skewed — most respondents report zero mentally unhealthy days, with a tail toward 30. The majority of respondents (the mode at 0) report zero mentally unhealthy days in the past month, with progressively fewer people reporting higher values. There is a small secondary peak at 30 days, representing individuals who experienced poor mental health throughout the entire month.

Implications for regression modeling:

Normality assumption: The residuals from a linear model are unlikely to be perfectly normal because the outcome itself is highly non-normal. However, with our large sample size (n = 5,000), the Central Limit Theorem provides robustness— our coefficient estimates and standard errors will still be approximately valid.
Homoscedasticity concern: The bounded nature of the outcome (0-30 days) and the clustering at 0 may lead to heteroscedasticity (unequal variance of residuals). We should check residual plots carefully.
Alternative models: For count outcomes with excess zeros, zero-inflated models, negative binomial regression, or Poisson regression might be more appropriate. However, linear regression remains interpretable and is often used as a starting point.

Study Note: Right-skewed outcomes are common in health data (hospital days, sick days, healthcare costs). Linear regression can still provide valid estimates of associations, especially with large samples, though predictions at extreme values should be interpreted cautiously.

1c. (5 pts) Create a scatterplot matrix (using GGally::ggpairs() or similar) for the continuous variables: menthlth_days, physhlth_days, sleep_hrs, and age. Comment on the direction and strength of each pairwise correlation with the outcome.

# Create scatterplot matrix for continuous variables
brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  rename(
    "Mentally unhealthy days (past 30)" = menthlth_days,
     "Physically unhealthy days (past 30)" = physhlth_days,
      "Sleep (hours/night)" = sleep_hrs,
     "Age (years)"= age ,
  ) %>%
  ggpairs(
    lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper = list(continuous = wrap("cor", size = 4)),
    title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
  ) +
  theme_minimal(base_size = 11)

Mental Health Days vs. Physical Health Days (r ≈ 0.315):
This is the strongest correlation in the matrix. People with more mental distress also report more physical distress.

Mental Health Days vs. Sleep Hours (r ≈ -0.14):
Very weak negative relation. Less sleep is associated with slightly more mentally unhealthy days, but the relationship is modest. Sleep may be both a cause (poor sleep worsens mood) and consequence of poor mental health.

Mental Health Days vs. Age (r ≈ -0.156):
Also very weak negative relation. Older adults report slightly fewer mentally unhealthy days on average.

These are unadjusted (crude) correlations. In multivariable regression, these relationships may change after accounting for confounding.

Task 2: Unadjusted (Simple) Linear Regression (15 points)

2a. (5 pts) Fit a simple linear regression model regressing menthlth_days on sleep_hrs alone. Write out the fitted regression equation.

2a. Fit Simple Linear Regression (5 pts)

# Fit simple linear regression: menthlth_days ~ sleep_hrs
model_slr <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

# Display model summary
summary(model_slr)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.670 -3.845 -3.040 -0.040 31.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.47429    0.57712   16.42   <2e-16 ***
## sleep_hrs   -0.80424    0.08025  -10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0195 
## F-statistic: 100.4 on 1 and 4998 DF,  p-value: < 2.2e-16

# Extract coefficients
coef_slr <- coef(model_slr)

Fitted Regression Equation:

Recall the SLR model:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

\[\widehat{\text{Mental Health Days}} = \beta_0 + \beta_1 \times (\text{Sleep Hours}) + \varepsilon_i\]

\[\widehat{Y} = 9.47 - 0.804 \times \text{Sleep} + \varepsilon_i\]

Symbol	Meaning
$Y_i$	Outcome for subject $i$ (mentally unhealthy days)
$\beta_0$	Intercept — expected value of $Y$ when all $X$s = 0
$\beta_j$	Partial regression coefficient for predictor $j$
$X_{ji}$	Value of predictor $j$ for subject $i$
$p$	Number of predictors in the model
$\varepsilon_i$	Random error for subject $i$, $\varepsilon_i \overset{iid}{\sim} N(0,\sigma^2)$

The word partial is critical. Each $\beta_j$ represents the estimated change in $E(Y)$ for a one-unit increase in $X_j$, holding all other predictors constant. This “holding constant” is the mathematical mechanism by which we adjust for confounding.

2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).

Interpretation: The negative slope (-0.332) means that for every additional hour of sleep per night, people report about 0.8 fewer mentally unhealthy days per month on average. The intercept (9.47) would be the predicted mental health days for someone with 0 hours of sleep, which is biologically impossible and thus not meaningful.

2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?

Null and Alternative Hypotheses

\[H_0: \beta_1 = 0 \quad \text{(Sleep is not associated with mentally unhealthy days)}\] \[H_A: \beta_1 \neq 0 \quad \text{(Sleep is associated with mentally unhealthy days)}\] ### Test Results:

# Get confidence interval
ci_slr <- confint(model_slr)

# Extract test statistics
tidy_slr <- tidy(model_slr)

# Display results
tidy_slr %>%
  filter(term == "sleep_hrs") %>%
  mutate(
    `95% CI Lower` = ci_slr[2, 1],
    `95% CI Upper` = ci_slr[2, 2]
  ) %>%
  select(term, estimate, std.error, statistic, p.value, `95% CI Lower`, `95% CI Upper`) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(caption = "Hypothesis Test for Sleep Coefficient",
        col.names = c("Term", "Estimate", "SE", "t-statistic", "p-value", "CI Lower", "CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Hypothesis Test for Sleep Coefficient
Term	Estimate	SE	t-statistic	p-value	CI Lower	CI Upper
sleep_hrs	-0.8042	0.0802	-10.0218	0	-0.9616	-0.6469

Test Statistic and p‑value From the output: - t-statistic: -10.02 - p-value: < 2.2e-16

Degrees of Freedom df = n-2 = 4998

Conclusion: Because the p-value is below 0.05, we reject the null hypothesis. There is strong statistical evidence that sleep hours are associated with mentally unhealthy days.

Task 3: Building the Multivariable Model (25 points)

3a. (5 pts) Fit three models:

Model A: menthlth_days ~ sleep_hrs
Model B: menthlth_days ~ sleep_hrs + age + sex
Model C: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise

# Model 1: Unadjusted
mA <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

# Model 2: Add sleep
mB <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)

# Model 3: Full multivariable model
mC <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
         data = brfss_mlr)

##Examining Model A B and C Output

summary(mA)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.670 -3.845 -3.040 -0.040 31.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.47429    0.57712   16.42   <2e-16 ***
## sleep_hrs   -0.80424    0.08025  -10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0195 
## F-statistic: 100.4 on 1 and 4998 DF,  p-value: < 2.2e-16

summary (mB)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.306 -3.980 -2.515 -0.307 32.360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.76563    0.64447  18.256  < 2e-16 ***
## sleep_hrs   -0.73391    0.07932  -9.253  < 2e-16 ***
## age         -0.06648    0.00622 -10.688  < 2e-16 ***
## sexFemale    1.53992    0.21339   7.217 6.13e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.523 on 4996 degrees of freedom
## Multiple R-squared:  0.05036,    Adjusted R-squared:  0.04979 
## F-statistic: 88.32 on 3 and 4996 DF,  p-value: < 2.2e-16

summary (mC)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days + 
##     income_cat + exercise, data = brfss_mlr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9192  -3.4262  -1.7803   0.2948  30.0568 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.475489   0.716959  17.401  < 2e-16 ***
## sleep_hrs     -0.509160   0.075348  -6.757 1.57e-11 ***
## age           -0.082307   0.005933 -13.872  < 2e-16 ***
## sexFemale      1.245053   0.202333   6.153 8.17e-10 ***
## physhlth_days  0.291657   0.013579  21.478  < 2e-16 ***
## income_cat    -0.321323   0.052012  -6.178 7.02e-10 ***
## exerciseYes   -0.342685   0.253138  -1.354    0.176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared:  0.1569, Adjusted R-squared:  0.1559 
## F-statistic: 154.9 on 6 and 4993 DF,  p-value: < 2.2e-16

tidy(mC, conf.int = TRUE) %>%
  mutate(
    term = dplyr::recode(term,
      "(Intercept)"   = "Intercept",
      "physhlth_days" = "Physical health days",
      "sleep_hrs"     = "Sleep (hours/night)",
      "age"           = "Age (years)",
      "income_cat"    = "Income (ordinal 1-8)",
      "sexFemale"     = "Sex: Female (ref = Male)",
      "exerciseYes"   = "Exercise: Yes (ref = No)"
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption   = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    format = "html"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(c(2, 3), background = "#EBF5FB")

Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
Intercept	12.4755	0.7170	17.4006	0.0000	11.0699	13.8810
Sleep (hours/night)	-0.5092	0.0753	-6.7574	0.0000	-0.6569	-0.3614
Age (years)	-0.0823	0.0059	-13.8724	0.0000	-0.0939	-0.0707
Sex: Female (ref = Male)	1.2451	0.2023	6.1535	0.0000	0.8484	1.6417
Physical health days	0.2917	0.0136	21.4779	0.0000	0.2650	0.3183
Income (ordinal 1-8)	-0.3213	0.0520	-6.1778	0.0000	-0.4233	-0.2194
Exercise: Yes (ref = No)	-0.3427	0.2531	-1.3537	0.1759	-0.8389	0.1536

3b. (10 pts) Create a table comparing the sleep coefficient ($\hat{\beta}$, SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?

b <- round(coef(mC), 3)
ci <- round(confint(mC), 3)

# Compare the sleep coefficient across models
tribble(
  ~Model, ~`Sleep β̂`, ~`95% CI`, ~`Adj. R²`,
  "mA (unadjusted)", round(coef(mA)[2], 3),
    paste0("(", round(confint(mA)[2,1],3), ", ", round(confint(mA)[2,2],3), ")"),
    round(summary(mA)$adj.r.squared, 3),
  "mB (+sleep)",     round(coef(mB)[2], 3),
    paste0("(", round(confint(mB)[2,1],3), ", ", round(confint(mB)[2,2],3), ")"),
    round(summary(mB)$adj.r.squared, 3),
  "mC (full)",       round(coef(mC)[2], 3),
    paste0("(", round(confint(mC)[2,1],3), ", ", round(confint(mC)[2,2],3), ")"),
    round(summary(mC)$adj.r.squared, 3)
) %>%
  kable(caption = "Table 4. Sleep Coefficient Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Table 4. Sleep Coefficient Across Sequential Models
Model	Sleep β	\|95% CI	Adj. R
mA (unadjusted)	-0.804	(-0.962, -0.647)	0.020
mB (+sleep)	-0.734	(-0.889, -0.578)	0.050
mC (full)	-0.509	(-0.657, -0.361)	0.156

3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.

####The Fitted Equation

\[\widehat{\text{Mental Health Days}} = 12.475 + -0.509(\text{Sleep}) + -0.082(\text{age}) + 1.245(\text{Female}) + 0.292(\text{Phys. Health Days}) + -0.321(\text{income_cat}) + -0.343(\text{Exercise: Yes})\] menthlth_days = 12.4755 - 0.5092sleep_hrs - 0.0823age + 1.2451sexFemale + 0.2917physhlth_days - 0.3213income_cat - 0.3427exerciseYes

5.3 Interpreting Each Term

**Intercept ($\hat{\beta}$ = 12.475) Only mathematical starting point. If a person slept 0 hours, was age 0, male, did not exercise, had zero physically unhealthy days, and was in the lowest income category, the model predicts about 12.5 mentally unhealthy days. Not very meaningful.

Sleep hours ($\hat{\beta}$ = -0.509): For each additional hour of sleep per night, people report about half a day (-0.5) fewer mentally unhealthy days per month, holding all other factors constant.

Age ($\hat{\beta}$ = -0.082): For every additional year of age, people report slightly fewer mentally unhealthy days.(-0.082) Younger adults report more mental distress than older adults, even after accounting for sleep, income, physical health, and exercise. The negative sign indicates a protective association.

Sex : Female ($\hat{\beta}$ = 1.245): Compared to males (the reference group), females report an estimated of 1.25 more mentally unhealthy days per month , after adjusting for all other variables.

Income category ($\hat{\beta}$ = 0.292): Each one-unit increase in the income category (on the 1–8 ordinal scale) is associated with 0.32 fewer mentally unhealthy days on average, consistent with the well-established socioeconomic gradient in mental health.

Physically Unhealthy days ($\hat{\beta}$ = -0.321): Each additional physically unhealthy day is associated with about 0.29 more mentally unhealthy days. People experiencing more physical health problems also tend to report more mental distress.

Exercise: Yes ($\hat{\beta}$ = -0.343): People who engaged in any physical activity in the past 30 days report an estimated 0.34 fewer mentally unhealthy days** compared to those who did not exercise, adjusting for all other covariates.

> Epidemiological insight: In Model C, mentally unhealthy days were predicted by sleep, age, sex, physical health, income, and exercise. The model suggests that sleep, age, sex, physical health, and income are meaningful predictors of mental health burden.

Task 4: Model Fit and ANOVA (20 points)

4a. (5 pts) Report $R^2$ and Adjusted $R^2$ for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?

anova_mA <- anova (mA)
anova_mB <- anova (mB)
anova_mC <- anova(mC)
print(anova_mA)

## Analysis of Variance Table
## 
## Response: menthlth_days
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## sleep_hrs    1   5865  5864.8  100.44 < 2.2e-16 ***
## Residuals 4998 291847    58.4                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

print(anova_mB)

## Analysis of Variance Table
## 
## Response: menthlth_days
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## sleep_hrs    1   5865  5864.8 103.638 < 2.2e-16 ***
## age          1   6182  6182.2 109.249 < 2.2e-16 ***
## sex          1   2947  2947.1  52.079 6.131e-13 ***
## Residuals 4996 282717    56.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

print(anova_mC)

## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## sleep_hrs        1   5865  5864.8 116.6678 < 2.2e-16 ***
## age              1   6182  6182.2 122.9832 < 2.2e-16 ***
## sex              1   2947  2947.1  58.6266 2.274e-14 ***
## physhlth_days    1  29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat       1   2177  2176.8  43.3031 5.169e-11 ***
## exercise         1     92    92.1   1.8326    0.1759    
## Residuals     4993 250993    50.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tribble(
  ~Model, ~Predictors, ~R2, ~`Adj. R²`,
  "mA: Sleep only",      1, round(summary(mA)$r.squared, 4), round(summary(mA)$adj.r.squared, 4),
  "mB: + Sleep + age + sex",                   2, round(summary(mB)$r.squared, 4), round(summary(mB)$adj.r.squared, 4),
  "mC: Full", 6,
    round(summary(mC)$r.squared, 4), round(summary(mC)$adj.r.squared, 4)
) %>%
  kable(caption = "Table 7. R² and Adjusted R² Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(3, bold = TRUE, background = "#EBF5FB")

Table 7. R² and Adjusted R² Across Sequential Models
Model	Predictors	R2	Adj. R²
mA: Sleep only	1	0.0197	0.0195
mB: + Sleep + age + sex	2	0.0504	0.0498
mC: Full	6	0.1569	0.1559

Interpretation: Model C explains the most variance in mentally unhealthy days (about 15.7%). As expected, adding more meaningful predictors improves explanatory power.

4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?

glance(mC) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Statistic = dplyr::recode(Statistic,
    "r.squared"     = "R²",
    "adj.r.squared" = "Adjusted R²",
    "sigma"         = "Residual Std. Error (Root MSE)",
    "statistic"     = "F-statistic",
    "p.value"       = "p-value (overall F-test)",
    "df"            = "Model df (p)",
    "df.residual"   = "Residual df (n − p − 1)",
    "nobs"          = "n (observations)"
  )) %>%
  kable(caption = "Table 6. Overall Model Summary — Model C") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 6. Overall Model Summary — Model C
Statistic	Value
R²	0.1569
Adjusted R²	0.1559
Residual Std. Error (Root MSE)	7.0901
F-statistic	154.8953
p-value (overall F-test)	0.0000
Model df (p)	6.0000
Residual df (n − p − 1)	4993.0000
n (observations)	5000.0000

rmse_mC <- round(summary(mC)$sigma, 2)
cat("Root MSE (Model 3):", rmse_mC, "mentally unhealthy days\n")

## Root MSE (Model 3): 7.09 mentally unhealthy days

##Interpretation On average, the model’s predictions are off by about 7 mentally unhealthy days per month. Even with six predictors, the model still has substantial prediction error — people’s mental health days vary widely, and the model captures only part of that variability.

4c. (10 pts) Using the ANOVA output for Model C, fill in the following table manually (i.e., compute the values using the output from anova() or glance()):

# Get ANOVA table for Model C
anova_C <- anova(mC)
print(anova_C)

## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## sleep_hrs        1   5865  5864.8 116.6678 < 2.2e-16 ***
## age              1   6182  6182.2 122.9832 < 2.2e-16 ***
## sex              1   2947  2947.1  58.6266 2.274e-14 ***
## physhlth_days    1  29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat       1   2177  2176.8  43.3031 5.169e-11 ***
## exercise         1     92    92.1   1.8326    0.1759    
## Residuals     4993 250993    50.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Calculate total SS and df manually
ss_model <- sum(anova_C$`Sum Sq`[1:6])  # Sum of all predictor SS
ss_residual <- anova_C$`Sum Sq`[7]      # Residual SS
ss_total <- ss_model + ss_residual

df_model <- sum(anova_C$Df[1:6])        # Sum of all predictor df
df_residual <- anova_C$Df[7]
df_total <- df_model + df_residual

# Calculate Mean Squares
ms_model <- ss_model / df_model
ms_residual <- ss_residual / df_residual

# Calculate F-statistic
f_statistic <- ms_model / ms_residual

# Get p-value
f_pvalue <- glance(mC)$p.value

# Create formatted table
tibble(
  Source = c("Model (Regression)", "Residual (Error)", "Total"),
  df = c(df_model, df_residual, df_total),
  SS = c(ss_model, ss_residual, ss_total),
  MS = c(ms_model, ms_residual, NA),
  F = c(f_statistic, NA, NA)
) %>%
  mutate(across(where(is.numeric), ~round(., 2))) %>%
  kable(caption = "Table 5. ANOVA Decomposition for Model C",
        align = c("l", "r", "r", "r", "r")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#EBF5FB")

Table 5. ANOVA Decomposition for Model C
Source	df	SS	MS	F
Model (Regression)	6	46718.54	7786.42	154.9
Residual (Error)	4993	250992.80	50.27	NA
Total	4999	297711.34	NA	NA

Source	df	SS	MS	F
Model	6	46718.54	7786.42	154.9
Residual	4993	250992.80	50.27	NA
Total	4999	297711.34	NA	NA

State the null hypothesis for the overall F-test and your conclusion.

$$H_0: _j = 0

H_A: _j $$

(given all other predictors are in the model)

The null hypothesis states that none of the predictors in Model C improve prediction of mentally unhealthy days. The alternative says that at least one predictor contributes meaningful explanatory power.

The test statistic is:

\[t = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \sim t_{n-p-1}\]

where $n$ is sample size and $p$ is the number of predictors. We compare this to a $t$-distribution with $n - p - 1$ degrees of freedom.

n=5000 observations
p=6 predictors
Degrees of freedom: 4993

A predictor may be significant alone (in a simple regression), but not significant in the multivariable model because its effect overlaps with other predictors (confounding or shared variance).

Task 5: Residual Diagnostics (15 points)

5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.

par(mfrow = c(2, 2))
plot(mC, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

Standard Residual Diagnostic Plots

par(mfrow = c(1, 1))

How to read each plot:

Plot	What to Look For	Assumption Tested
Residuals vs. Fitted	Random scatter around 0; no pattern	Linearity + Homoscedasticity
Normal Q-Q	Points follow diagonal line	Normality of residuals
Scale-Location (√\|e\|)	Flat red line; constant spread	Homoscedasticity
Cook’s Distance / Residuals vs. Leverage	No points beyond 0.5 or 1.0 Cook’s D contour	Influential observations

augment(mC) %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_point(alpha = 0.07, size = 0.8, color = "steelblue") +
  geom_smooth(method = "loess", color = "darkblue", se = FALSE) +
  labs(
    title = "Residuals vs. Fitted Values",
    subtitle = "Ideally: random scatter around zero, no systematic pattern",
    x = "Fitted Values",
    y = "Residuals (Y − Ŷ)"
  ) +
  theme_minimal(base_size = 13)

Residuals vs. Fitted (ggplot2)

##Interpretations

Residuals vs. Fitted We can see clear curvature rather than a flat horizontal band, Residuals are not centered evenly around 0 and variance increases slightly at higher fitted values. This means linearity is violated with the relationship between predictors and mental health days not being perfectly linear. Homoscedasticity is also questionable since residual spread is not constant.
Normal Q–Q Plot We can see strong deviations from the diagonal line, especially in both tails, Heavy right tail (large positive residuals). Residuals are not normally distributed. This is expected because the outcome is bounded (0–30) and right‑skewed.
Scale–Location Plot We can see that red line slopes upward, Spread of root of residuals increases with fitted values. Homoscedasticity is violated and Variability in mental health days increases for people with higher predicted values.
Cook’s Distance We can see few points (e.g., 830, 1439, 1532) stand out but none exceed the common cutoff of Cook’s D > 1. THis could mean that there is no highly influential observations and a few moderately influential points exist, but nothing that violates model stability.

5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain.

Because the outcome is bounded (0–30) and heavily right‑skewed, the following assumptions are most problematic: - Normality of residuals → violated - Homoscedasticity → violated - Linearity → little violated

However, it does not invalidate the analysis. With - n ≈ 5000, the linear model is good to non‑normality. Coefficient estimates remain unbiased. Standard errors may be slightly off, but the very large sample size stabilizes inference. The model is still useful for describing associations, which is the goal in most public‑health analyses.

5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis?

None of the observations exceed 1. Few points are moderately high but still below the threshold. In real analysis, they dont have to be removed. we should check for any data entry errors.

Task 6: Interpretation for a Public Health Audience (10 points)

Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:

Identify which predictors were significantly associated with mental health days
State the direction and approximate magnitude of the most important associations
Appropriately caveat the cross-sectional nature of the data (no causal language)
Not use any statistical jargon (no “significant,” “coefficient,” “p-value”)

##Answer: People reported more mentally unhealthy days when they also had more physically unhealthy days, were younger, slept less, had lower incomes, or were women. The strongest pattern we found was that people experiencing more physical health problems also tended to report more days of poor mental health, with smaller but meaningful differences linked to sleep, age, and income. Exercise showed only a very small difference in mental health days once the other factors were taken into account. Because these data were collected at one point in time, the results describe patterns in the population but cannot show whether any of these factors directly cause changes in mental health. —

End of Lab Activity

Symbol	Meaning
\(Y_i\)	Outcome for subject \(i\) (mentally unhealthy days)
\(\beta_0\)	Intercept — expected value of \(Y\) when all \(X\)s = 0
\(\beta_j\)	Partial regression coefficient for predictor \(j\)
\(X_{ji}\)	Value of predictor \(j\) for subject \(i\)
\(p\)	Number of predictors in the model
\(\varepsilon_i\)	Random error for subject \(i\), \(\varepsilon_i \overset{iid}{\sim} N(0,\sigma^2)\)

Multiple Linear Regression- Lab 07

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Suruchi Shahi

March 3, 5, and 10, 2026