Introduction

In the previous lectures, we used Simple Linear Regression (SLR) to model a continuous outcome as a function of a single predictor. But real-world health phenomena are never driven by one variable alone. A person’s mental health is shaped by how much they sleep, how physically ill they are, how old they are, their income, their sex, and much more — simultaneously.

Multiple Linear Regression (MLR) extends SLR to accommodate this complexity. It allows us to:

Adjust for confounding — estimate the association between an exposure and outcome after accounting for other covariates
Improve prediction — use multiple pieces of information to better forecast an outcome
Model effect modification — examine whether the effect of one variable changes across levels of another
Gain precision — by explaining additional variance in the outcome, we can reduce uncertainty in our estimates

In epidemiology, MLR is our primary tool for multivariable analysis of continuous outcomes.

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)

The BRFSS 2020 Dataset

We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 — a nationally representative telephone survey of U.S. adults conducted by the CDC. This dataset contains over 400,000 respondents with detailed information on health behaviors, chronic conditions, and social determinants of health.

Research question for today:

What individual, behavioral, and socioeconomic factors are associated with the number of mentally unhealthy days in the past 30 days among U.S. adults?

This is a quintessential epidemiological question: we have a health outcome (mental health days) and want to understand its multivariable determinants while appropriately controlling for confounders.

brfss_full <- read_xpt(
  "/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_2020"
) %>%
  clean_names()

# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
  mutate(
    # Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
    menthlth_days = case_when(
      menthlth == 88              ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_
    ),
    # Physical health days (key predictor)
    physhlth_days = case_when(
      physhlth == 88              ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                        ~ NA_real_
    ),
    # Sleep hours (practical cap at 14)
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                        ~ NA_real_
    ),
    # Age (capped at 80 per BRFSS coding)
    age = age80,
    # Income category (ordinal 1–8)
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Exercise in past 30 days (any physical activity)
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    # BMI (stored as integer × 100 in BRFSS)
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    # Income as labeled factor (for display)
    income_f = factor(income2, levels = 1:8,
      labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
                 "$25-35k", "$35-50k", "$50-75k", ">$75k"))
  ) %>%
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age),   age >= 18,
    !is.na(income_cat),
    !is.na(sex),
    !is.na(exercise)
  )

# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_cat, income_f, sex, exercise, bmi) %>%
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_mlr,
  "/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_mlr_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

Descriptive Statistics

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_f, sex, exercise) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      sleep_hrs     ~ "Sleep (hours/night)",
      age           ~ "Age (years)",
      income_f      ~ "Household income",
      sex           ~ "Sex",
      exercise      ~ "Any physical activity (past 30 days)"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,000¹
Mentally unhealthy days (past 30)	5,000	3.8 (7.7)
Physically unhealthy days (past 30)	5,000	3.3 (7.8)
Sleep (hours/night)	5,000	7.1 (1.3)
Age (years)	5,000	54.3 (17.2)
Household income	5,000
<$10k		190 (3.8%)
$10-15k		169 (3.4%)
$15-20k		312 (6.2%)
$20-25k		434 (8.7%)
$25-35k		489 (9.8%)
$35-50k		683 (14%)
$50-75k		841 (17%)
>$75k		1,882 (38%)
Sex	5,000
Male		2,331 (47%)
Female		2,669 (53%)
Any physical activity (past 30 days)	5,000	3,874 (77%)
¹ Mean (SD); n (%)

p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)

Distribution of Mentally Unhealthy Days (BRFSS 2020)

Note: The outcome is right-skewed — most respondents report zero mentally unhealthy days, with a long tail toward 30. This is common in count outcomes and is worth keeping in mind when we evaluate model assumptions later.

Part 1: Guided Practice — Multiple Linear Regression

1. From Simple to Multiple: The Big Picture

1.1 Extending the Model

Recall the SLR model:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

Multiple linear regression simply adds more predictors:

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} + \varepsilon_i\]

Or equivalently, using summation notation:

\[Y_i = \beta_0 + \sum_{j=1}^{p} \beta_j X_{ji} + \varepsilon_i\]

Symbol	Meaning
$Y_i$	Outcome for subject $i$ (mentally unhealthy days)
$\beta_0$	Intercept — expected value of $Y$ when all $X$s = 0
$\beta_j$	Partial regression coefficient for predictor $j$
$X_{ji}$	Value of predictor $j$ for subject $i$
$p$	Number of predictors in the model
$\varepsilon_i$	Random error for subject $i$, $\varepsilon_i \overset{iid}{\sim} N(0,\sigma^2)$

The word partial is critical. Each $\beta_j$ represents the estimated change in $E(Y)$ for a one-unit increase in $X_j$, holding all other predictors constant. This “holding constant” is the mathematical mechanism by which we adjust for confounding.

1.2 Three Goals of Regression in Epidemiology

There are three analytic goals. Understanding which goal is driving your analysis determines how you build and interpret your model:

Goal	Question	Example
Descriptive	How do the $X$s describe $Y$ in this sample?	What variables are correlated with mental health days in BRFSS 2020?
Predictive	Do the $X$s help predict $Y$ statistically?	Can we build a model to identify people likely to report many mentally unhealthy days?
Associative (Etiologic)	What is the effect of a specific $X$ on $Y$, controlling for confounders?	What is the effect of sleep on mental health days, after adjusting for age, income, and physical health?

In epidemiology, we are most often doing associative analysis — estimating causal or causal-adjacent effects after controlling for confounding.

1.3 Why MLR is More Challenging than SLR

Model selection is non-trivial — there may be many candidate predictors, and the “right” model depends on the scientific question.
You cannot visualize the relationship in $p$-dimensional space. We work with 2D projections (partial regression plots, predicted vs. observed plots).
Interpretation requires discipline — the “holding all else constant” framing must always accompany coefficient interpretation.
Coefficients are computed numerically — unlike SLR, there are no simple closed-form formulas we can compute by hand for $p > 2$.

2. Assumptions of Multiple Linear Regression

The assumptions are a direct extension of the SLR assumptions — often summarized with the LINE mnemonic:

	Assumption	In MLR
L	Linearity	$E(Y)$ is a linear function of the $\beta_j$s. Non-linear functions of $X$ (e.g., $X^2$, $\log X$) are allowed, as long as the model is linear in the coefficients.
I	Independence	Observations $Y_i$ are independent of each other.
N	Normality	Errors $\varepsilon_i \sim N(0, \sigma^2)$. Required for inference (t-tests, CIs). Not required for OLS estimation itself.
E	Equal variance	$\text{Var}(\varepsilon_i) = \sigma^2$ for all $i$ (homoscedasticity).

Important clarifications:

The linearity assumption refers to linearity in the parameters ($\beta$s), not in the predictors. Including $X^2$ as a predictor doesn’t violate linearity.
Normality is only needed for valid inference. With large samples (like BRFSS), the Central Limit Theorem means mild violations cause minimal bias.
Homoscedasticity matters most for standard errors and inference. Mild departures are tolerable; severe heteroscedasticity requires correction (robust SEs, WLS).

# Pairs plot of continuous predictors vs outcome
brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  rename(
    `Mental Health\nDays`  = menthlth_days,
    `Physical Health\nDays` = physhlth_days,
    `Sleep\n(hrs)`         = sleep_hrs,
    Age                    = age
  ) %>%
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
  ) +
  theme_minimal(base_size = 11)

Visualizing Key Predictors vs. Mental Health Days

3. Missing Data

Before fitting any model, we must address missing data. Let’s expand on it with our BRFSS context.

3.1 Why Missing Data Matters

Statistical software performs complete case analysis by default: observations with any missing value on any variable in the model are dropped. This has two consequences:

Loss of sample size — reduces power
Potential bias — if the missing cases differ systematically from observed cases

The direction and magnitude of bias depends on the mechanism of missingness.

3.2 The Three Mechanisms

Mechanism	Definition	Bias?	Example in BRFSS
MCAR — Missing Completely at Random	P(missing) is unrelated to observed or unobserved data	No	Lab equipment randomly fails; a beaker is dropped
MAR — Missing at Random	P(missing) depends on observed data, not on the missing value itself	Potentially, but adjustable	Men are more likely to skip the income question; once you account for sex, the missingness is independent of income
MNAR — Missing Not at Random	P(missing) depends on the unobserved value itself	Yes, difficult to correct	People with very low income refuse to report income because of the stigma

MCAR is rarely a safe assumption with human subjects data. In BRFSS, income non-response is strongly socioeconomically patterned — a classic MNAR concern.

3.3 Implications for Practice

For MCAR and MAR, multiple imputation (using the mice package in R) can recover unbiased estimates.
For MNAR, there is no general statistical fix. You must acknowledge the limitation and reason about the direction of potential bias.
In this course, we will use complete case analysis but will always note the amount and likely pattern of missingness.

# Assess missingness in our analytic variables before the sample was taken
brfss_full %>%
  select(menthlth, physhlth, sleptim1, age80, income2, sexvar, exerany2, bmi5) %>%
  summarise(across(everything(), ~ sum(is.na(.) | . %in% c(77, 88, 99)) / n() * 100)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Pct_Missing_or_DK") %>%
  mutate(Pct_Missing_or_DK = round(Pct_Missing_or_DK, 1)) %>%
  arrange(desc(Pct_Missing_or_DK)) %>%
  kable(
    caption = "Table 2. Missing or Don't Know/Refused (%) — BRFSS 2020 Full Sample",
    col.names = c("Variable", "% Missing / DK / Refused")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 2. Missing or Don’t Know/Refused (%) — BRFSS 2020 Full Sample
Variable	% Missing / DK / Refused
physhlth	71.5
menthlth	65.6
income2	19.9
bmi5	10.3
age80	1.4
sleptim1	1.2
sexvar	0.0
exerany2	0.0

4. Fitting the MLR Model in R

4.1 Model Specification

We will build our model in stages — a common and pedagogically powerful approach that mirrors what you would do in a real epidemiological analysis.

Model 1 — Unadjusted (SLR baseline): Mentally unhealthy days ~ Physical health days

Model 2 — Sleep added: + Sleep hours

Model 3 — Full model: + Age + Income + Sex + Exercise

This sequential approach lets us see how each block of variables changes the coefficients and model fit.

# Model 1: Unadjusted
m1 <- lm(menthlth_days ~ physhlth_days, data = brfss_mlr)

# Model 2: Add sleep
m2 <- lm(menthlth_days ~ physhlth_days + sleep_hrs, data = brfss_mlr)

# Model 3: Full multivariable model
m3 <- lm(menthlth_days ~ physhlth_days + sleep_hrs + age + income_cat + sex + exercise,
         data = brfss_mlr)

4.2 The `lm()` Formula Interface

In R, the lm() formula has the structure:

lm(outcome ~ predictor1 + predictor2 + predictor3, data = dataset)

The outcome is to the left of the ~
Predictors are on the right, separated by +
R automatically creates dummy variables for factors
The intercept is included by default (suppress with - 1 if needed)

4.3 Examining Model 3 Output

summary(m3)

## 
## Call:
## lm(formula = menthlth_days ~ physhlth_days + sleep_hrs + age + 
##     income_cat + sex + exercise, data = brfss_mlr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9192  -3.4262  -1.7803   0.2948  30.0568 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.475489   0.716959  17.401  < 2e-16 ***
## physhlth_days  0.291657   0.013579  21.478  < 2e-16 ***
## sleep_hrs     -0.509160   0.075348  -6.757 1.57e-11 ***
## age           -0.082307   0.005933 -13.872  < 2e-16 ***
## income_cat    -0.321323   0.052012  -6.178 7.02e-10 ***
## sexFemale      1.245053   0.202333   6.153 8.17e-10 ***
## exerciseYes   -0.342685   0.253138  -1.354    0.176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared:  0.1569, Adjusted R-squared:  0.1559 
## F-statistic: 154.9 on 6 and 4993 DF,  p-value: < 2.2e-16

tidy(m3, conf.int = TRUE) %>%
  mutate(
    term     = dplyr::recode(term,
      "(Intercept)" = "Intercept",
      "physhlth_days" = "Physical health days",
      "sleep_hrs"     = "Sleep (hours/night)",
      "age"           = "Age (years)",
      "income_cat"    = "Income (ordinal 1-8)",
      "sexFemale"     = "Sex: Female (ref = Male)",
      "exerciseYes"   = "Exercise: Yes (ref = No)"
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption  = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
    col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(c(2, 3), background = "#EBF5FB")  # highlight key predictors

Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)
Term	Estimate (β̂	Std. Erro
Intercept	12.4755	0.7170	17.4006	0.0000	11.0699	13.8810
Physical health days	0.2917	0.0136	21.4779	0.0000	0.2650	0.3183
Sleep (hours/night)	-0.5092	0.0753	-6.7574	0.0000	-0.6569	-0.3614
Age (years)	-0.0823	0.0059	-13.8724	0.0000	-0.0939	-0.0707
Income (ordinal 1-8)	-0.3213	0.0520	-6.1778	0.0000	-0.4233	-0.2194
Sex: Female (ref = Male)	1.2451	0.2023	6.1535	0.0000	0.8484	1.6417
Exercise: Yes (ref = No)	-0.3427	0.2531	-1.3537	0.1759	-0.8389	0.1536

5. Interpreting Coefficients

This is the most important and most commonly mis-stated skill in applied regression. Let’s do it carefully.

5.1 The General Rule

For predictor $X_j$: $\hat{\beta}_j$ is the estimated change in the mean of $Y$ for a one-unit increase in $X_j$, holding all other predictors in the model constant.

The phrase “holding all other predictors constant” (also called ceteris paribus or partial effect) is what distinguishes MLR from SLR. It is the mathematical expression of confounder adjustment.

b <- round(coef(m3), 3)
ci <- round(confint(m3), 3)

5.2 The Fitted Equation

\[\widehat{\text{Mental Health Days}} = 12.475 + 0.292(\text{Phys. Health Days}) + -0.509(\text{Sleep}) + -0.082(\text{Age}) + -0.321(\text{Income}) + 1.245(\text{Female}) + -0.343(\text{Exercise: Yes})\]

5.3 Interpreting Each Term

Physical health days ($\hat{\beta}$ = 0.292): Each additional day of poor physical health is associated with an estimated 0.292 additional mentally unhealthy day on average, holding sleep, age, income, sex, and exercise constant (95% CI: 0.265 to 0.318). This is the strongest predictor in the model and is consistent with the bidirectional mind-body connection documented in the literature.

Sleep hours ($\hat{\beta}$ = -0.509): Each additional hour of sleep per night is associated with an estimated 0.509 fewer mentally unhealthy days on average, adjusting for all other covariates (95% CI: -0.657 to -0.361). The negative sign indicates a protective association.

Age ($\hat{\beta}$ = -0.082): Each additional year of age is associated with 0.082 fewer mentally unhealthy days on average (holding all else constant). This counterintuitive finding is well-documented — older adults often report fewer mental health difficulties, possibly due to better emotion regulation, survivor bias, or cohort effects.

Income category ($\hat{\beta}$ = -0.321): Each one-unit increase in the income category (on the 1–8 ordinal scale) is associated with 0.321 fewer mentally unhealthy days on average, consistent with the well-established socioeconomic gradient in mental health.

Sex: Female ($\hat{\beta}$ = 1.245): Compared to males (the reference group), females report an estimated 1.245 more mentally unhealthy days on average, holding all other variables constant.

Exercise: Yes ($\hat{\beta}$ = -0.343): People who engaged in any physical activity in the past 30 days report an estimated 0.343 fewer mentally unhealthy days compared to those who did not exercise, adjusting for all other covariates.

Intercept ($\hat{\beta}_0$ = 12.475): The estimated mean mental health days for a person with zero physically unhealthy days, zero sleep hours, age = 0, income category = 0, who is male and does not exercise. This is a mathematical artifact — not a meaningful quantity in context.

Epidemiological insight: Notice that the physical health coefficient changes when we add other covariates. Compare $\hat{\beta}$ from Model 1 vs. Model 3. This change represents the confounding that age, income, sleep, sex, and exercise collectively introduce into the crude physical health–mental health association.

# Compare the physhlth_days coefficient across models
tribble(
  ~Model, ~`Physical Health Days β̂`, ~`95% CI`, ~`Adj. R²`,
  "M1 (unadjusted)", round(coef(m1)[2], 3),
    paste0("(", round(confint(m1)[2,1],3), ", ", round(confint(m1)[2,2],3), ")"),
    round(summary(m1)$adj.r.squared, 3),
  "M2 (+sleep)",     round(coef(m2)[2], 3),
    paste0("(", round(confint(m2)[2,1],3), ", ", round(confint(m2)[2,2],3), ")"),
    round(summary(m2)$adj.r.squared, 3),
  "M3 (full)",       round(coef(m3)[2], 3),
    paste0("(", round(confint(m3)[2,1],3), ", ", round(confint(m3)[2,2],3), ")"),
    round(summary(m3)$adj.r.squared, 3)
) %>%
  kable(caption = "Table 4. Physical Health Days Coefficient Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Table 4. Physical Health Days Coefficient Across Sequential Models
Model	Physical Health Days β	\|95% CI	Adj. R
M1 (unadjusted)	0.312	(0.286, 0.338)	0.099
M2 (+sleep)	0.300	(0.274, 0.326)	0.110
M3 (full)	0.292	(0.265, 0.318)	0.156

6. Hypothesis Testing for Individual Coefficients

6.1 The t-Test for Each $\hat{\beta}_j$

For each predictor $X_j$, we test:

\[H_0: \beta_j = 0 \quad \text{vs.} \quad H_A: \beta_j \neq 0\]

(given all other predictors are in the model)

The test statistic is:

\[t = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \sim t_{n-p-1}\]

where $n$ is sample size and $p$ is the number of predictors. We compare this to a $t$-distribution with $n - p - 1$ degrees of freedom.

Critically: This tests whether $X_j$ has a statistically significant partial association with $Y$, given the other predictors already in the model. A predictor might be significant alone (unadjusted) but non-significant in the multivariable model due to overlap with other predictors (multicollinearity or confounding).

6.2 Interpreting the p-Values

tidy(m3, conf.int = TRUE) %>%
  mutate(
    Significance = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      p.value < 0.1   ~ ".",
      TRUE            ~ "ns"
    ),
    term = dplyr::recode(term,
      "(Intercept)"    = "Intercept",
      "physhlth_days"  = "Physical health days",
      "sleep_hrs"      = "Sleep (hours/night)",
      "age"            = "Age (years)",
      "income_cat"     = "Income (ordinal 1-8)",
      "sexFemale"      = "Sex: Female",
      "exerciseYes"    = "Exercise: Yes"
    )
  ) %>%
  select(term, estimate, std.error, statistic, p.value, conf.low, conf.high, Significance) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption   = "Table 5. Hypothesis Tests for Each Partial Coefficient (Model 3)",
    col.names = c("Predictor", "β̂", "SE", "t", "p-value", "CI Lower", "CI Upper", "Sig.")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 5. Hypothesis Tests for Each Partial Coefficient (Model 3)
Predictor	β	S
Intercept	12.4755	0.7170	17.4006	0.0000	11.0699	13.8810	***
Physical health days	0.2917	0.0136	21.4779	0.0000	0.2650	0.3183	***
Sleep (hours/night)	-0.5092	0.0753	-6.7574	0.0000	-0.6569	-0.3614	***
Age (years)	-0.0823	0.0059	-13.8724	0.0000	-0.0939	-0.0707	***
Income (ordinal 1-8)	-0.3213	0.0520	-6.1778	0.0000	-0.4233	-0.2194	***
Sex: Female	1.2451	0.2023	6.1535	0.0000	0.8484	1.6417	***
Exercise: Yes	-0.3427	0.2531	-1.3537	0.1759	-0.8389	0.1536	ns

tidy(m3, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    term = dplyr::recode(term,
      "physhlth_days" = "Physical health days",
      "sleep_hrs"     = "Sleep (hours/night)",
      "age"           = "Age (years)",
      "income_cat"    = "Income (1-8)",
      "sexFemale"     = "Sex: Female",
      "exerciseYes"   = "Exercise: Yes"
    ),
    term = fct_reorder(term, estimate),
    sig  = ifelse(p.value < 0.05, "Significant (p < 0.05)", "Non-significant")
  ) %>%
  ggplot(aes(x = estimate, y = term, color = sig)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray60") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.25, linewidth = 0.9) +
  geom_point(size = 3.5) +
  scale_color_manual(values = c("Significant (p < 0.05)" = "steelblue",
                                "Non-significant" = "tomato")) +
  labs(
    title    = "Partial Regression Coefficients with 95% Confidence Intervals",
    subtitle = "Outcome: Mentally Unhealthy Days (BRFSS 2020, n = 5,000)",
    x        = "Estimated Change in Mental Health Days (β̂)",
    y        = NULL,
    color    = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

Coefficient Plot: Partial Associations with Mental Health Days

7. The Overall F-Test and ANOVA Decomposition

7.1 Partitioning Variability

Just as in SLR, MLR partitions total variability in $Y$:

\[\underbrace{SSY}_{\text{Total}} = \underbrace{SSR}_{\text{Model explains}} + \underbrace{SSE}_{\text{Model doesn't explain}}\]

Source	Sum of Squares	df	Mean Square	F
Model (Regression)	$SSR$	$p$	$MSR = SSR/p$	$F = MSR/MSE$
Error (Residual)	$SSE$	$n-p-1$	$MSE = SSE/(n-p-1)$
Total	$SSY$	$n-1$

With $p$ predictors, the model uses $p$ degrees of freedom (one for each $\hat{\beta}_j$), leaving $n - p - 1$ residual degrees of freedom.

7.2 The Overall F-Test

The omnibus F-test asks:

\[H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \quad \text{(none of the predictors matter)}\] \[H_A: \text{At least one } \beta_j \neq 0\]

\[F = \frac{MSR}{MSE} \sim F_{p,\; n-p-1} \quad \text{under } H_0\]

A large F-value (small p-value) means the predictors collectively explain more variation than expected by chance.

anova_m3 <- anova(m3)
print(anova_m3)

## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## physhlth_days    1  29475 29474.7 586.3411 < 2.2e-16 ***
## sleep_hrs        1   3323  3323.0  66.1052 5.345e-16 ***
## age              1   9261  9261.5 184.2385 < 2.2e-16 ***
## income_cat       1   2649  2648.8  52.6918 4.503e-13 ***
## sex              1   1918  1918.4  38.1626 7.025e-10 ***
## exercise         1     92    92.1   1.8326    0.1759    
## Residuals     4993 250993    50.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

glance(m3) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Statistic = dplyr::recode(Statistic,
    "r.squared"     = "R²",
    "adj.r.squared" = "Adjusted R²",
    "sigma"         = "Residual Std. Error (Root MSE)",
    "statistic"     = "F-statistic",
    "p.value"       = "p-value (overall F-test)",
    "df"            = "Model df (p)",
    "df.residual"   = "Residual df (n − p − 1)",
    "nobs"          = "n (observations)"
  )) %>%
  kable(caption = "Table 6. Overall Model Summary — Model 3") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 6. Overall Model Summary — Model 3
Statistic	Value
R²	0.1569
Adjusted R²	0.1559
Residual Std. Error (Root MSE)	7.0901
F-statistic	154.8953
p-value (overall F-test)	0.0000
Model df (p)	6.0000
Residual df (n − p − 1)	4993.0000
n (observations)	5000.0000

8. Coefficient of Determination: $R^2$ and Adjusted $R^2$

8.1 $R^2$

\[R^2 = \frac{SSR}{SSY} = 1 - \frac{SSE}{SSY}\]

$R^2$ is the proportion of total variance in $Y$ explained by the model. It ranges from 0 to 1.

Critical property: $R^2$ always increases (or stays the same) when you add a predictor, even if that predictor is random noise. It cannot be used to compare models with different numbers of predictors.

8.2 Adjusted $R^2$

\[R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}\]

Adjusted $R^2$ penalizes for the number of predictors $p$. It only increases when a new predictor improves the model by more than chance alone.

$R^2_{adj} \leq R^2$ always
$R^2_{adj}$ can decrease if you add a useless predictor
Use $R^2_{adj}$ when comparing models with different numbers of predictors

tribble(
  ~Model, ~Predictors, ~R2, ~`Adj. R²`,
  "M1: Physical health only",      1, round(summary(m1)$r.squared, 4), round(summary(m1)$adj.r.squared, 4),
  "M2: + Sleep",                   2, round(summary(m2)$r.squared, 4), round(summary(m2)$adj.r.squared, 4),
  "M3: Full (+ age, income, sex, exercise)", 6,
    round(summary(m3)$r.squared, 4), round(summary(m3)$adj.r.squared, 4)
) %>%
  kable(caption = "Table 7. R² and Adjusted R² Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(3, bold = TRUE, background = "#EBF5FB")

Table 7. R² and Adjusted R² Across Sequential Models
Model	Predictors	R2	Adj. R²
M1: Physical health only	1	0.0990	0.0988
M2: + Sleep	2	0.1102	0.1098
M3: Full (+ age, income, sex, exercise)	6	0.1569	0.1559

Interpretation: R² for mental health outcomes in population surveys is typically low (5–20%), and that’s expected. Mental health is shaped by hundreds of unmeasured factors — genetics, life events, social support, trauma history. A low R² does not mean the model is “wrong” or “useless” for estimating associations. Statistical significance (the t-tests) and the magnitude and direction of coefficients matter more for etiologic questions.

8.3 Root MSE

\[RMSE = \sqrt{MSE} = s_{Y|X}\]

The Root MSE (also called the Residual Standard Error in R output) is the estimated standard deviation of the errors. It tells you, on average, how many days the model’s prediction is off by.

rmse_m3 <- round(summary(m3)$sigma, 2)
cat("Root MSE (Model 3):", rmse_m3, "mentally unhealthy days\n")

## Root MSE (Model 3): 7.09 mentally unhealthy days

9. Visualizing Predicted Values and Effects

9.1 Predicted vs. Observed

augment(m3) %>%
  ggplot(aes(x = .fitted, y = menthlth_days)) +
  geom_point(alpha = 0.08, size = 0.8, color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1, linetype = "dashed") +
  geom_smooth(method = "loess", color = "darkblue", linewidth = 1, se = FALSE) +
  labs(
    title    = "Predicted vs. Observed Mentally Unhealthy Days",
    subtitle = "Dashed = perfect prediction; Blue = LOESS smoother",
    x        = "Fitted (Predicted) Values (Ŷ)",
    y        = "Observed Mentally Unhealthy Days (Y)"
  ) +
  theme_minimal(base_size = 13)

Predicted vs. Observed Mental Health Days (Model 3)

9.2 Marginal Effects Plots (`ggeffects`)

The ggeffects package computes adjusted predictions — the predicted mean of $Y$ across values of one predictor, with all other predictors held at their means (or reference categories). This is a powerful visualization of what MLR is actually estimating.

ggpredict(m3, terms = "physhlth_days") %>%
  plot() +
  labs(
    title    = "Marginal Effect of Physical Health Days on Mentally Unhealthy Days",
    subtitle = "Other covariates held at their means/modes",
    x        = "Number of Physically Unhealthy Days (past 30)",
    y        = "Predicted Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13)

Adjusted Predictions: Physical Health Days → Mental Health Days

ggpredict(m3, terms = "sleep_hrs") %>%
  plot() +
  labs(
    title    = "Marginal Effect of Sleep on Mentally Unhealthy Days",
    subtitle = "Other covariates held at their means/modes",
    x        = "Sleep Hours per Night",
    y        = "Predicted Mentally Unhealthy Days"
  ) +
  theme_minimal(base_size = 13)

Adjusted Predictions: Sleep → Mental Health Days

ggpredict(m3, terms = c("exercise", "sex")) %>%
  plot() +
  labs(
    title    = "Predicted Mental Health Days by Exercise Status and Sex",
    subtitle = "Adjusted for physical health, sleep, age, income",
    x        = "Any Physical Activity in Past 30 Days",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Sex", fill = "Sex"
  ) +
  theme_minimal(base_size = 13)

Adjusted Predictions by Sex and Exercise

10. Residual Diagnostics

We assess the LINE assumptions through residual plots. Use plot(model) in base R or augment() + ggplot2.

par(mfrow = c(2, 2))
plot(m3, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

Standard Residual Diagnostic Plots

par(mfrow = c(1, 1))

How to read each plot:

Plot	What to Look For	Assumption Tested
Residuals vs. Fitted	Random scatter around 0; no pattern	Linearity + Homoscedasticity
Normal Q-Q	Points follow diagonal line	Normality of residuals
Scale-Location (√\|e\|)	Flat red line; constant spread	Homoscedasticity
Cook’s Distance / Residuals vs. Leverage	No points beyond 0.5 or 1.0 Cook’s D contour	Influential observations

augment(m3) %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_point(alpha = 0.07, size = 0.8, color = "steelblue") +
  geom_smooth(method = "loess", color = "darkblue", se = FALSE) +
  labs(
    title = "Residuals vs. Fitted Values",
    subtitle = "Ideally: random scatter around zero, no systematic pattern",
    x = "Fitted Values",
    y = "Residuals (Y − Ŷ)"
  ) +
  theme_minimal(base_size = 13)

$Residuals vs. Fitted (ggplot2)$

Residuals vs. Fitted (ggplot2)

Note on our outcome: Because mental health days is bounded at 0 and 30 and is heavily right-skewed, perfect normality and homoscedasticity are not achievable. Our large sample size (n = 5,000) provides robustness. In practice, you might consider a log transformation or Poisson/negative binomial regression as alternatives — topics we’ll cover later in the course.

11. Main Effects and Interaction Terms

11.1 Higher-Order Terms

The MLR framework accommodates higher-order terms (polynomials, interactions) while remaining “linear in the parameters.” Some key rules:

Quadratic terms: If you include $X^2$, you must also include $X$ itself.
Interaction terms: If you include $X_1 \times X_2$, you must include both $X_1$ and $X_2$ as main effects.

These are called the hierarchical principle or marginality rule in regression modeling.

11.2 Testing for an Interaction

Does the effect of sleep on mental health differ between males and females?

# Model with sleep × sex interaction
m4_interact <- lm(menthlth_days ~ physhlth_days + sleep_hrs * sex + age + income_cat + exercise,
                  data = brfss_mlr)

tidy(m4_interact, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Table 8. Model with Sleep × Sex Interaction",
    col.names = c("Term", "β̂", "SE", "t", "p-value", "CI Low", "CI High")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 8. Model with Sleep × Sex Interaction
Term	β	S
(Intercept)	12.3812	0.9203	13.4537	0.0000	10.5770	14.1853
physhlth_days	0.2917	0.0136	21.4763	0.0000	0.2651	0.3183
sleep_hrs	-0.4960	0.1101	-4.5072	0.0000	-0.7118	-0.2803
sexFemale	1.4178	1.0757	1.3180	0.1876	-0.6911	3.5267
age	-0.0823	0.0059	-13.8720	0.0000	-0.0940	-0.0707
income_cat	-0.3210	0.0521	-6.1673	0.0000	-0.4231	-0.2190
exerciseYes	-0.3422	0.2532	-1.3518	0.1765	-0.8386	0.1541
sleep_hrs:sexFemale	-0.0244	0.1495	-0.1635	0.8701	-0.3174	0.2686

ggpredict(m4_interact, terms = c("sleep_hrs", "sex")) %>%
  plot() +
  labs(
    title    = "Predicted Mental Health Days by Sleep Hours and Sex",
    subtitle = "Interaction model: slopes are allowed to differ by sex",
    x        = "Sleep Hours per Night",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Sex", fill = "Sex"
  ) +
  theme_minimal(base_size = 13)

Interaction: Effect of Sleep on Mental Health by Sex

# Likelihood ratio / F-test comparing models with and without interaction
anova(m3, m4_interact) %>%
  kable(caption = "Table 9. F-Test Comparing Models With and Without Interaction") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 9. F-Test Comparing Models With and Without Interaction
Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
4993	250992.8	NA	NA	NA	NA
4992	250991.5	1	1.344174	0.0267344	0.8701261

Interpretation: If the interaction term is statistically significant, the slopes (effect of sleep) differ between males and females. If not significant, a model with only main effects is preferred (parsimony). We’ll cover interaction/effect modification formally in a dedicated lecture.

12. The Multivariate Normal Distribution (Conceptual Bridge)

Just as SLR is linked to the bivariate normal distribution, MLR is linked to the multivariate normal distribution. When all continuous variables follow a multivariate normal distribution, the conditional distribution of $Y$ given any combination of $X$s is:

\[Y \mid X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p, \; \sigma^2_{Y|X})\]

This is the population model that our sample estimates are trying to recover. The bivariate normal from the correlation lecture is simply a special case with $p = 1$.

13. Summary of Key Takeaways

Concept	Key Point
MLR formula	$Y_i = \beta_0 + \sum_{j=1}^p \beta_j X_{ji} + \varepsilon_i$
Partial coefficient	$\hat{\beta}_j$ = change in $E(Y)$ per 1-unit increase in $X_j$, holding all else constant
Confounder adjustment	Achieved mathematically by including confounders as additional predictors
Assumptions	LINE: Linearity, Independence, Normality, Equal variance
Missing data	MCAR, MAR, MNAR — mechanism determines bias risk
F-test	Tests $H_0: \beta_1 = \cdots = \beta_p = 0$ (omnibus test)
$R^2$ vs. Adj. $R^2$	Use Adj. $R^2$ to compare models of different sizes
Root MSE	Average prediction error in outcome units

Part 2: In-Class Lab Activity

EPI 553 — Multiple Linear Regression Lab Due: End of class, March 10, 2026

Instructions

In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`income_cat`	Household income (1 = <$10k to 8 = >$75k)	Ordinal
`sex`	Sex (Male/Female)	Factor
`exercise`	Any physical activity past 30 days (Yes/No)	Factor

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)

brfss_mlr <- readRDS(
  "/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_mlr_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes all variables in the dataset. Include means (SD) for continuous variables and n (%) for categorical variables.

Descriptive Statistics

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_f, sex, exercise) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      sleep_hrs     ~ "Sleep (hours/night)",
      age           ~ "Age (years)",
      income_f      ~ "Household income",
      sex           ~ "Sex",
      exercise      ~ "Any physical activity (past 30 days)"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,000¹
Mentally unhealthy days (past 30)	5,000	3.8 (7.7)
Physically unhealthy days (past 30)	5,000	3.3 (7.8)
Sleep (hours/night)	5,000	7.1 (1.3)
Age (years)	5,000	54.3 (17.2)
Household income	5,000
<$10k		190 (3.8%)
$10-15k		169 (3.4%)
$15-20k		312 (6.2%)
$20-25k		434 (8.7%)
$25-35k		489 (9.8%)
$35-50k		683 (14%)
$50-75k		841 (17%)
>$75k		1,882 (38%)
Sex	5,000
Male		2,331 (47%)
Female		2,669 (53%)
Any physical activity (past 30 days)	5,000	3,874 (77%)
¹ Mean (SD); n (%)

1b. (5 pts) Create a histogram of menthlth_days. Describe the shape of the distribution. Is it symmetric, right-skewed, or left-skewed? What are the implications of this shape for regression modeling?

p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)

The histogram appears to right-skewed. Since the data are count outcomes and zero is a common reported answer and can extend all the way to 30. This suggests that the residuals may not be normally distributed and the variance may not be constant and is worth keeping track of as we continue with the models later.

1c. (5 pts) Create a scatterplot matrix (using GGally::ggpairs() or similar) for the continuous variables: menthlth_days, physhlth_days, sleep_hrs, and age. Comment on the direction and strength of each pairwise correlation with the outcome.

# Pairs plot of continuous predictors vs outcome
brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  rename(
    `Mental Health\nDays`  = menthlth_days,
    `Physical Health\nDays` = physhlth_days,
    `Sleep\n(hrs)`         = sleep_hrs,
    Age                    = age
  ) %>%
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
  ) +
  theme_minimal(base_size = 11)

Overall, none of the correlations are strong. Physical health days and mental health days shows a moderate positive correlation of 0.315. Sleep in hours and mental health days demonstrate a weak negative correlation of -0.140. Sleep in hours and physical health days show a weak negative correlation of -0.112. Age and mental health days demonstrate a weak negative correlation of -0.156. Age and physical health days show a weak positive correlation of 0.093. Finally, age and sleep in hours demonstrates a weak positive correlation of 0.089.

Task 2: Unadjusted (Simple) Linear Regression (15 points)

2a. (5 pts) Fit a simple linear regression model regressing menthlth_days on sleep_hrs alone. Write out the fitted regression equation.

# Model A: Unadjusted mental health and sleep hours
modela <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

Write the fitted regression equation in the form

$\widehat{Mental Health Days} = 9.4743 + (-.8042) \times\ Hours of Sleep$.

tidy(modela, conf.int = TRUE) %>%
  mutate(
    term = dplyr::recode(term,
      "(Intercept)"   = "Intercept",
      "sleep_hrs"     = "Sleep (hours/night)"
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption   = "Table 2. Multiple Linear Regression: Mentally Unhealthy Days ~ Sleep (BRFSS 2020, n = 5,000)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    format = "html"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE) %>%
  row_spec(0, bold = TRUE)

Table 2. Multiple Linear Regression: Mentally Unhealthy Days ~ Sleep (BRFSS 2020, n = 5,000)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
Intercept	9.4743	0.5771	16.4165	0	8.3429	10.6057
Sleep (hours/night)	-0.8042	0.0802	-10.0218	0	-0.9616	-0.6469

# Anova to see degrees of freedom 
anova_modela <- anova(modela)

anova_modela %>% 
  kable(
    caption = "ANOVA Table: Phys_days ~ BMI", 
    digits = 3, 
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

ANOVA Table: Phys_days ~ BMI
Source	Df	Sum Sq	Mean Sq	F value	Pr(>F)
sleep_hrs	1	5864.765	5864.765	100.437	0
Residuals	4998	291846.575	58.393	NA	NA

2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).

Each additional hour of sleep per night (𝛽̂ = -0.8042) is associated with an estimated 0.8042 fewer mentally unhealthy days on average (95% CI: -0.9616 to -0.6469).

2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?

Null Hypothesis (Ho): No association between mental health days and sleep hours Alternative Hypothesis (H1): There is an association between mental health days and sleep hours. The t-statistic is -10.0218 and the p-value is < .001 suggesting there is enough evidence to reject the null hypothesis and consider the evidence that there is an association between mental health and sleep hours. The degress of freedom for this test is one in the numerator and 4998 in the denominator.

Task 3: Building the Multivariable Model (25 points)

3a. (5 pts) Fit three models:

Model A: menthlth_days ~ sleep_hrs
Model B: menthlth_days ~ sleep_hrs + age + sex
Model C: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise

# Model A: Unadjusted
modela <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

# Model B: Add sleep
modelb <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)

# Model C: Full multivariable model
modelc <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
         data = brfss_mlr)

3b. (10 pts) Create a table comparing the sleep coefficient ($\hat{\beta}$, SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?

tidy(modelb, conf.int = TRUE) %>%
  mutate(
    term = dplyr::recode(term,
      "(Intercept)"   = "Intercept",
      "sleep_hrs"     = "Sleep (hours/night)",
      "age"           = "Age (years)",
      "sexFemale"     = "Sex: Female (ref = Male)"
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption   = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Sleep, Age, Sex (BRFSS 2020, n = 5,000)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    format = "html"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE) %>%
  row_spec(0, bold = TRUE)

Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Sleep, Age, Sex (BRFSS 2020, n = 5,000)
Term	Estimate	Std. Error	t-statistic	95% CI Lower	95% CI Upper
Intercept	11.7656	0.6445	18.2563	10.5022	13.0291
Sleep (hours/night)	-0.7339	0.0793	-9.2528	-0.8894	-0.5784
Age (years)	-0.0665	0.0062	-10.6877	-0.0787	-0.0543
Sex: Female (ref = Male)	1.5399	0.2134	7.2166	1.1216	1.9583

tidy(modelc, conf.int = TRUE) %>%
  mutate(
    term = dplyr::recode(term,
      "(Intercept)"   = "Intercept",
      "physhlth_days" = "Physical health days",
      "sleep_hrs"     = "Sleep (hours/night)",
      "age"           = "Age (years)",
      "income_cat"    = "Income (ordinal 1-8)",
      "sexFemale"     = "Sex: Female (ref = Male)",
      "exerciseYes"   = "Exercise: Yes (ref = No)"
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption   = "Table 4. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    format = "html"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE) %>%
  row_spec(0, bold = TRUE)

Table 4. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
Intercept	12.4755	0.7170	17.4006	0.0000	11.0699	13.8810
Sleep (hours/night)	-0.5092	0.0753	-6.7574	0.0000	-0.6569	-0.3614
Age (years)	-0.0823	0.0059	-13.8724	0.0000	-0.0939	-0.0707
Sex: Female (ref = Male)	1.2451	0.2023	6.1535	0.0000	0.8484	1.6417
Physical health days	0.2917	0.0136	21.4779	0.0000	0.2650	0.3183
Income (ordinal 1-8)	-0.3213	0.0520	-6.1778	0.0000	-0.4233	-0.2194
Exercise: Yes (ref = No)	-0.3427	0.2531	-1.3537	0.1759	-0.8389	0.1536

# Compare the sleep_hrs coefficient across models
tribble(
  ~Model, ~`sleep_hrs β̂`, ~`95% CI`, ~`Adj. R²`,
  "Model A (unadjusted sleep)", round(coef(modela)[2], 3),
    paste0("(", round(confint(modela)[2,1],3), ", ", round(confint(modela)[2,2],3), ")"),
    round(summary(modela)$adj.r.squared, 3),
  "Model B (+ age, sex)",     round(coef(modelb)[2], 3),
    paste0("(", round(confint(modelb)[2,1],3), ", ", round(confint(modelb)[2,2],3), ")"),
    round(summary(modelb)$adj.r.squared, 3),
  "Model C (full)",       round(coef(modelc)[2], 3),
    paste0("(", round(confint(modelc)[2,1],3), ", ", round(confint(modelc)[2,2],3), ")"),
    round(summary(modelc)$adj.r.squared, 3)
) %>%
  kable(caption = "Table 5. Sleep Hours Coefficient Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Table 5. Sleep Hours Coefficient Across Sequential Models
Model	sleep_hrs β	\|95% CI	Adj. R
Model A (unadjusted sleep)	-0.804	(-0.962, -0.647)	0.020
Model B (+ age, sex)	-0.734	(-0.889, -0.578)	0.050
Model C (full)	-0.509	(-0.657, -0.361)	0.156

The sleep coefficient does change when we add more covariates. In Model A with only sleep, the coefficient is -0.804. When we added age and sex as covariates the coefficient gets higher -0.734. Finally when we have the full model, the coefficient is its highest at -0.509. This suggests that Model A was overestimating the association due to confounding because when those other covariates are controlled for such as in the full Model C, the coefficient is closer to 0.

3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.

$\widehat{Mental Health Days} = 12.4755 + -0.5092(Sleep) + -0.0823(Age) + 1.2451(Female) + 0.2917(Physical Health Days) + -0.3213(Income) + -0.3427(Exercise:Yes)$.

Sleep hours (𝛽̂ = -0.5092): Each additional hour of sleep per night is associated with an estimated 0.5092 fewer mentally unhealthy days on average, adjusting for all other covariates (95% CI: -0.6569 to -0.3614). The negative sign suggests that sleep hours could cause less bad mental health days.

Age (𝛽̂ = -0.0823): Each additional year of age is associated with 0.082 fewer mentally unhealthy days on average (holding all else constant). This suggests that as individuals get older, they possibly experience less mentally unhealthy days on average.

Income category (𝛽̂ = -0.3213): Each one-unit increase in the income category (on the 1–8 ordinal scale) is associated with 0.321 fewer mentally unhealthy days on average. This findings seems correct when considering that individuals who are more financially comfortable may not experience as many mental health days.

Sex: Female (𝛽̂ = 1.2451): Compared to males (the reference group), females report an estimated 1.245 more mentally unhealthy days on average, holding all other variables constant.

Exercise: Yes (𝛽̂ = -0.3427): People who engaged in any physical activity in the past 30 days report an estimated 0.343 fewer mentally unhealthy days compared to those who did not exercise, when adjusting for all other covariates.

Intercept (𝛽̂ 0 = 12.4755): The estimated mean mental health days for a person with zero physically unhealthy days, zero sleep hours, age = 0, income category = 0, who is male and does not exercise. This is not a meaningful quantity in the current context because these qualifiers are nearly impossible to meet. —

Task 4: Model Fit and ANOVA (20 points)

4a. (5 pts) Report $R^2$ and Adjusted $R^2$ for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?

tribble(
  ~Model, ~Predictors, ~R2, ~`Adj. R²`,
  "Model A: Sleep Hours Only",      1, round(summary(modela)$r.squared, 4), round(summary(modela)$adj.r.squared, 4),
  "Model B: + Age + Sex",                   2, round(summary(modelb)$r.squared, 4), round(summary(modelb)$adj.r.squared, 4),
  "Model C: Full (+ physical health, income, exercise)", 6,
    round(summary(modelc)$r.squared, 4), round(summary(modelc)$adj.r.squared, 4)
) %>%
  kable(caption = "Table 6. R² and Adjusted R² Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(3, bold = TRUE, background = "#EBF5FB")

Table 6. R² and Adjusted R² Across Sequential Models
Model	Predictors	R2	Adj. R²
Model A: Sleep Hours Only	1	0.0197	0.0195
Model B: + Age + Sex	2	0.0504	0.0498
Model C: Full (+ physical health, income, exercise)	6	0.1569	0.1559

Model C explains the most variance in mental health days with a R² of 0.1569 and an adjusted R² of 0.1559 whereas Model B had a R² of 0.0504 and adjusted R² of 0.0498 and finally Model A had an R² of 0.0197 and an adjusted R² of 0.0195.

4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?

rmse_modelc <- round(summary(modelc)$sigma, 2)
cat("Root MSE (Model C):", rmse_modelc, "mentally unhealthy days\n")

## Root MSE (Model C): 7.09 mentally unhealthy days

The root MSE for Model C is 7.09. This suggests in practical terms that Model C’s prediction is off, on average, by 7.09 mentally unhealthy days.

4c. (10 pts) Using the ANOVA output for Model C, fill in the following table manually (i.e., compute the values using the output from anova() or glance()):

# (a) Display the ANOVA table
anova_modelc <- anova(modelc)

anova_modelc %>% 
  kable(
    caption = "ANOVA Table: Model C ", 
    digits = 3, 
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

ANOVA Table: Model C
Source	Df	Sum Sq	Mean Sq	F value	Pr(>F)
sleep_hrs	1	5864.765	5864.765	116.668	0.000
age	1	6182.237	6182.237	122.983	0.000
sex	1	2947.099	2947.099	58.627	0.000
physhlth_days	1	29455.513	29455.513	585.959	0.000
income_cat	1	2176.799	2176.799	43.303	0.000
exercise	1	92.124	92.124	1.833	0.176
Residuals	4993	250992.804	50.269	NA	NA

g <- glance(modelc)

# Extract values
k  <- g$df
df_res <- g$df.residual
sigma <- g$sigma
R2 <- g$r.squared
F_val <- g$statistic
p_val <- g$p.value

# Compute components
MSE <- sigma^2
SSE <- MSE * df_res
SST <- SSE / (1 - R2)
SSR <- SST - SSE
MSR <- SSR / k
F_manual <- MSR / MSE

# Print ANOVA table
anova_table <- data.frame(
  Source = c("Regression", "Residuals", "Total"),
  DF = c(k, df_res, k + df_res),
  SS = c(SSR, SSE, SST),
  MS = c(MSR, MSE, NA),
  F = c(F_manual, NA, NA)
)

anova_table

##       Source   DF        SS         MS        F
## 1 Regression    6  46718.54 7786.42287 154.8953
## 2  Residuals 4993 250992.80   50.26894       NA
## 3      Total 4999 297711.34         NA       NA

# View model summary to see the F-Test results 
summary(modelc)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days + 
##     income_cat + exercise, data = brfss_mlr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9192  -3.4262  -1.7803   0.2948  30.0568 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.475489   0.716959  17.401  < 2e-16 ***
## sleep_hrs     -0.509160   0.075348  -6.757 1.57e-11 ***
## age           -0.082307   0.005933 -13.872  < 2e-16 ***
## sexFemale      1.245053   0.202333   6.153 8.17e-10 ***
## physhlth_days  0.291657   0.013579  21.478  < 2e-16 ***
## income_cat    -0.321323   0.052012  -6.178 7.02e-10 ***
## exerciseYes   -0.342685   0.253138  -1.354    0.176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared:  0.1569, Adjusted R-squared:  0.1559 
## F-statistic: 154.9 on 6 and 4993 DF,  p-value: < 2.2e-16

Source	df	SS	MS	F
Model	6	46718.54	7786.42	154.90
Residual	4993	250992.80	50.27	NA
Total	4999	297711.34	NA	NA

State the null hypothesis for the overall F-test and your conclusion.

\[H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \quad \text{(none of the predictors matter)}\] \[H_A: \text{At least one } \beta_j \neq 0\] The F-statistic for Model C is 154.9 on 6 and 4993 degrees of freedom with a p-value < .001. This suggests that there is significant evidence to reject the null hypothesis and to consider the evidence that at least one predictor has an association and the high f-test and low p-value suggest this is unlikely to chance.

Task 5: Residual Diagnostics (15 points)

5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.

par(mfrow = c(2, 2))
plot(modelc, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

Standard Residual Diagnostic Plots

par(mfrow = c(1, 1))

Residuals vs. Fitted: Relative random scatter around 0 for the first half of the graph with a slight negative slope that plateaus toward the second half of the graph. This may suggest issues with homoscedasticity.

Q-Q Residuals: Demonstrates a curve resembling an S instead of following the reference line. This suggests that the residuals are not normally distributed.

Scale-Location: Should depict a flat red line, with constant spread, instead we see a positive sloped line suggesting issues with homoscedasticity.

Cook’s Distance: There are no points beyond 0.5 or 1.0 of Cook’s D contour suggesting no influential observations.

5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain.

Equal variance and normal distribution of residuals are most likely violated. Independence may also be violated because some BRFSS collections are from the same family. This does not necessary invalidate the analysis because we have such a large sample size allowing central limit theorem to state that mild violations will cause minimal bias. Mild departures from variance are acceptable.

5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis?

augmented <- augment(modelc) %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(modelc),
    influential = ifelse(cooks_d > 1, "Influential (Cook's D > 1)", "Not influential")
  )

# Count how many observations exceed Cook's D > 1
n_influential <- sum(augmented$cooks_d > 1)

p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 1, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Influential (Cook's D > 1)" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = Cook's D > 1 | ",
                      n_influential, " influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks

There are no observations identified by Cook’s Distance. If there were an outliers, we could center them, utilize imputation, or cap them at 1.5 IQR.

Task 6: Interpretation for a Public Health Audience (10 points)

Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:

Identify which predictors were significantly associated with mental health days
State the direction and approximate magnitude of the most important associations
Appropriately caveat the cross-sectional nature of the data (no causal language)
Not use any statistical jargon (no “significant,” “coefficient,” “p-value”)

We created a model with six factors and performed a multiple linear regression. Sleep (hours at night), age(years), sex(female), physical health days, and income were all associated with mental health days. Sex and sleep were both important associations in which females compared to men were associated with 1.245 more mentally unhealthy days on average, holding all other variables constant and each additional hours of sleep was associated with a suggested 0.509 fewer mentally unhealthy days on average adjusting for all other factors. It is important to note that these associations are not causal because BRFSS is a cross-sectional survey and can not establish temporal relationships only associations.

End of Lab Activity

Symbol	Meaning
\(Y_i\)	Outcome for subject \(i\) (mentally unhealthy days)
\(\beta_0\)	Intercept — expected value of \(Y\) when all \(X\)s = 0
\(\beta_j\)	Partial regression coefficient for predictor \(j\)
\(X_{ji}\)	Value of predictor \(j\) for subject \(i\)
\(p\)	Number of predictors in the model
\(\varepsilon_i\)	Random error for subject \(i\), \(\varepsilon_i \overset{iid}{\sim} N(0,\sigma^2)\)

Source	Sum of Squares	df	Mean Square	F
Model (Regression)	\(SSR\)	\(p\)	\(MSR = SSR/p\)	\(F = MSR/MSE\)
Error (Residual)	\(SSE\)	\(n-p-1\)	\(MSE = SSE/(n-p-1)\)
Total	\(SSY\)	\(n-1\)

Goal	Question	Example
Descriptive	How do the \(X\)s describe \(Y\) in this sample?	What variables are correlated with mental health days in BRFSS 2020?
Predictive	Do the \(X\)s help predict \(Y\) statistically?	Can we build a model to identify people likely to report many mentally unhealthy days?
Associative (Etiologic)	What is the effect of a specific \(X\) on \(Y\), controlling for confounders?	What is the effect of sleep on mental health days, after adjusting for age, income, and physical health?

	Assumption	In MLR
L	Linearity	\(E(Y)\) is a linear function of the \(\beta_j\)s. Non-linear functions of \(X\) (e.g., \(X^2\), \(\log X\)) are allowed, as long as the model is linear in the coefficients.
I	Independence	Observations \(Y_i\) are independent of each other.
N	Normality	Errors \(\varepsilon_i \sim N(0, \sigma^2)\). Required for inference (t-tests, CIs). Not required for OLS estimation itself.
E	Equal variance	\(\text{Var}(\varepsilon_i) = \sigma^2\) for all \(i\) (homoscedasticity).

Concept	Key Point
MLR formula	\(Y_i = \beta_0 + \sum_{j=1}^p \beta_j X_{ji} + \varepsilon_i\)
Partial coefficient	\(\hat{\beta}_j\) = change in \(E(Y)\) per 1-unit increase in \(X_j\), holding all else constant
Confounder adjustment	Achieved mathematically by including confounders as additional predictors
Assumptions	LINE: Linearity, Independence, Normality, Equal variance
Missing data	MCAR, MAR, MNAR — mechanism determines bias risk
F-test	Tests \(H_0: \beta_1 = \cdots = \beta_p = 0\) (omnibus test)
\(R^2\) vs. Adj. \(R^2\)	Use Adj. \(R^2\) to compare models of different sizes
Root MSE	Average prediction error in outcome units

Multiple Linear Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Zoya Hayes

March 3, 5, and 10, 2026

Introduction

Setup and Data

The BRFSS 2020 Dataset

Descriptive Statistics

Part 1: Guided Practice — Multiple Linear Regression

1. From Simple to Multiple: The Big Picture

1.1 Extending the Model

1.2 Three Goals of Regression in Epidemiology

1.3 Why MLR is More Challenging than SLR

2. Assumptions of Multiple Linear Regression

3. Missing Data

3.1 Why Missing Data Matters

3.2 The Three Mechanisms

3.3 Implications for Practice

4. Fitting the MLR Model in R

4.1 Model Specification

4.2 The lm() Formula Interface

4.3 Examining Model 3 Output

5. Interpreting Coefficients

5.1 The General Rule

5.2 The Fitted Equation

5.3 Interpreting Each Term

6. Hypothesis Testing for Individual Coefficients

6.1 The t-Test for Each \(\hat{\beta}_j\)

6.2 Interpreting the p-Values

7. The Overall F-Test and ANOVA Decomposition

7.1 Partitioning Variability

7.2 The Overall F-Test

8. Coefficient of Determination: \(R^2\) and Adjusted \(R^2\)

8.1 \(R^2\)

8.2 Adjusted \(R^2\)

8.3 Root MSE

9. Visualizing Predicted Values and Effects

9.1 Predicted vs. Observed

9.2 Marginal Effects Plots (ggeffects)

10. Residual Diagnostics

11. Main Effects and Interaction Terms

11.1 Higher-Order Terms

11.2 Testing for an Interaction

12. The Multivariate Normal Distribution (Conceptual Bridge)

13. Summary of Key Takeaways

Part 2: In-Class Lab Activity

Instructions

Data for the Lab

Task 1: Exploratory Data Analysis (15 points)

Descriptive Statistics

Task 2: Unadjusted (Simple) Linear Regression (15 points)

Task 3: Building the Multivariable Model (25 points)

Task 4: Model Fit and ANOVA (20 points)

Task 5: Residual Diagnostics (15 points)

Task 6: Interpretation for a Public Health Audience (10 points)

4.2 The `lm()` Formula Interface

9.2 Marginal Effects Plots (`ggeffects`)