Introduction
In the previous lectures, we used Simple Linear Regression (SLR) to
model a continuous outcome as a function of a single
predictor. But real-world health phenomena are never driven by one
variable alone. A person’s mental health is shaped by how much they
sleep, how physically ill they are, how old they are, their income,
their sex, and much more — simultaneously.
Multiple Linear Regression (MLR) extends SLR to
accommodate this complexity. It allows us to:
- Adjust for confounding — estimate the association
between an exposure and outcome after accounting for other
covariates
- Improve prediction — use multiple pieces of
information to better forecast an outcome
- Model effect modification — examine whether the
effect of one variable changes across levels of another
- Gain precision — by explaining additional variance
in the outcome, we can reduce uncertainty in our estimates
In epidemiology, MLR is our primary tool for multivariable analysis
of continuous outcomes.
Part 1: Guided Practice — Multiple Linear Regression
1. From Simple to Multiple: The Big Picture
1.1 Extending the Model
Recall the SLR model:
\[Y_i = \beta_0 + \beta_1 X_i +
\varepsilon_i\]
Multiple linear regression simply adds more predictors:
\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2
X_{2i} + \cdots + \beta_p X_{pi} + \varepsilon_i\]
Or equivalently, using summation notation:
\[Y_i = \beta_0 + \sum_{j=1}^{p} \beta_j
X_{ji} + \varepsilon_i\]
| \(Y_i\) |
Outcome for subject \(i\) (mentally
unhealthy days) |
| \(\beta_0\) |
Intercept — expected value of \(Y\)
when all \(X\)s = 0 |
| \(\beta_j\) |
Partial regression coefficient for predictor \(j\) |
| \(X_{ji}\) |
Value of predictor \(j\) for
subject \(i\) |
| \(p\) |
Number of predictors in the model |
| \(\varepsilon_i\) |
Random error for subject \(i\),
\(\varepsilon_i \overset{iid}{\sim}
N(0,\sigma^2)\) |
The word partial is critical. Each \(\beta_j\) represents the estimated change
in \(E(Y)\) for a one-unit increase in
\(X_j\), holding all other
predictors constant. This “holding constant” is the
mathematical mechanism by which we adjust for confounding.
1.2 Three Goals of Regression in Epidemiology
There are three analytic goals. Understanding which goal is driving
your analysis determines how you build and interpret your model:
| Descriptive |
How do the \(X\)s describe \(Y\) in this sample? |
What variables are correlated with mental health days in BRFSS
2020? |
| Predictive |
Do the \(X\)s help predict \(Y\) statistically? |
Can we build a model to identify people likely to report many
mentally unhealthy days? |
| Associative (Etiologic) |
What is the effect of a specific \(X\) on \(Y\), controlling for confounders? |
What is the effect of sleep on mental health days, after adjusting
for age, income, and physical health? |
In epidemiology, we are most often doing associative
analysis — estimating causal or causal-adjacent effects after
controlling for confounding.
1.3 Why MLR is More Challenging than SLR
- Model selection is non-trivial — there may be many
candidate predictors, and the “right” model depends on the scientific
question.
- You cannot visualize the relationship in \(p\)-dimensional space. We work with 2D
projections (partial regression plots, predicted vs. observed
plots).
- Interpretation requires discipline — the “holding
all else constant” framing must always accompany coefficient
interpretation.
- Coefficients are computed numerically — unlike SLR,
there are no simple closed-form formulas we can compute by hand for
\(p > 2\).
2. Assumptions of Multiple Linear Regression
The assumptions are a direct extension of the SLR assumptions — often
summarized with the LINE mnemonic:
| L |
Linearity |
\(E(Y)\) is a linear
function of the \(\beta_j\)s.
Non-linear functions of \(X\) (e.g.,
\(X^2\), \(\log X\)) are allowed, as long as the model
is linear in the coefficients. |
| I |
Independence |
Observations \(Y_i\) are
independent of each other. |
| N |
Normality |
Errors \(\varepsilon_i \sim N(0,
\sigma^2)\). Required for inference (t-tests, CIs). Not required
for OLS estimation itself. |
| E |
Equal variance |
\(\text{Var}(\varepsilon_i) =
\sigma^2\) for all \(i\)
(homoscedasticity). |
Important clarifications:
- The linearity assumption refers to linearity in the
parameters (\(\beta\)s), not
in the predictors. Including \(X^2\) as
a predictor doesn’t violate linearity.
- Normality is only needed for valid inference. With
large samples (like BRFSS), the Central Limit Theorem means mild
violations cause minimal bias.
- Homoscedasticity matters most for standard errors
and inference. Mild departures are tolerable; severe heteroscedasticity
requires correction (robust SEs, WLS).
# Pairs plot of continuous predictors vs outcome
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)
3. Missing Data
Before fitting any model, we must address missing data. Let’s expand
on it with our BRFSS context.
3.1 Why Missing Data Matters
Statistical software performs complete case analysis
by default: observations with any missing value on any variable in the
model are dropped. This has two consequences:
- Loss of sample size — reduces power
- Potential bias — if the missing cases differ
systematically from observed cases
The direction and magnitude of bias depends on the
mechanism of missingness.
3.2 The Three Mechanisms
| MCAR — Missing Completely at Random |
P(missing) is unrelated to observed or unobserved data |
No |
Lab equipment randomly fails; a beaker is dropped |
| MAR — Missing at Random |
P(missing) depends on observed data, not on the missing value
itself |
Potentially, but adjustable |
Men are more likely to skip the income question; once you account
for sex, the missingness is independent of income |
| MNAR — Missing Not at Random |
P(missing) depends on the unobserved value itself |
Yes, difficult to correct |
People with very low income refuse to report income because of the
stigma |
MCAR is rarely a safe assumption with human subjects
data. In BRFSS, income non-response is strongly
socioeconomically patterned — a classic MNAR concern.
3.3 Implications for Practice
- For MCAR and MAR, multiple imputation (using the
mice package in R) can recover unbiased estimates.
- For MNAR, there is no general statistical fix. You must
acknowledge the limitation and reason about the
direction of potential bias.
- In this course, we will use complete case analysis
but will always note the amount and likely pattern of missingness.
# Assess missingness in our analytic variables before the sample was taken
brfss_full %>%
select(menthlth, physhlth, sleptim1, age80, income2, sexvar, exerany2, bmi5) %>%
summarise(across(everything(), ~ sum(is.na(.) | . %in% c(77, 88, 99)) / n() * 100)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Pct_Missing_or_DK") %>%
mutate(Pct_Missing_or_DK = round(Pct_Missing_or_DK, 1)) %>%
arrange(desc(Pct_Missing_or_DK)) %>%
kable(
caption = "Table 2. Missing or Don't Know/Refused (%) — BRFSS 2020 Full Sample",
col.names = c("Variable", "% Missing / DK / Refused")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 2. Missing or Don’t Know/Refused (%) — BRFSS 2020 Full Sample
|
Variable
|
% Missing / DK / Refused
|
|
physhlth
|
71.5
|
|
menthlth
|
65.6
|
|
income2
|
19.9
|
|
bmi5
|
10.3
|
|
age80
|
1.4
|
|
sleptim1
|
1.2
|
|
sexvar
|
0.0
|
|
exerany2
|
0.0
|
4. Fitting the MLR Model in R
4.1 Model Specification
We will build our model in stages — a common and pedagogically
powerful approach that mirrors what you would do in a real
epidemiological analysis.
Model 1 — Unadjusted (SLR baseline): Mentally
unhealthy days ~ Physical health days
Model 2 — Sleep added: + Sleep hours
Model 3 — Full model: + Age + Income + Sex +
Exercise
This sequential approach lets us see how each block of variables
changes the coefficients and model fit.
# Model 1: Unadjusted
m1 <- lm(menthlth_days ~ physhlth_days, data = brfss_mlr)
# Model 2: Add sleep
m2 <- lm(menthlth_days ~ physhlth_days + sleep_hrs, data = brfss_mlr)
# Model 3: Full multivariable model
m3 <- lm(menthlth_days ~ physhlth_days + sleep_hrs + age + income_cat + sex + exercise,
data = brfss_mlr)
4.3 Examining Model 3 Output
##
## Call:
## lm(formula = menthlth_days ~ physhlth_days + sleep_hrs + age +
## income_cat + sex + exercise, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9192 -3.4262 -1.7803 0.2948 30.0568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.475489 0.716959 17.401 < 2e-16 ***
## physhlth_days 0.291657 0.013579 21.478 < 2e-16 ***
## sleep_hrs -0.509160 0.075348 -6.757 1.57e-11 ***
## age -0.082307 0.005933 -13.872 < 2e-16 ***
## income_cat -0.321323 0.052012 -6.178 7.02e-10 ***
## sexFemale 1.245053 0.202333 6.153 8.17e-10 ***
## exerciseYes -0.342685 0.253138 -1.354 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared: 0.1569, Adjusted R-squared: 0.1559
## F-statistic: 154.9 on 6 and 4993 DF, p-value: < 2.2e-16
tidy(m3, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"physhlth_days" = "Physical health days",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"income_cat" = "Income (ordinal 1-8)",
"sexFemale" = "Sex: Female (ref = Male)",
"exerciseYes" = "Exercise: Yes (ref = No)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE) %>%
row_spec(c(2, 3), background = "#EBF5FB") # highlight key predictors
Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple
Predictors (BRFSS 2020, n = 5,000)
|
Term
|
Estimate (β̂
|
Std. Erro
|
|
Intercept
|
12.4755
|
0.7170
|
17.4006
|
0.0000
|
11.0699
|
13.8810
|
|
Physical health days
|
0.2917
|
0.0136
|
21.4779
|
0.0000
|
0.2650
|
0.3183
|
|
Sleep (hours/night)
|
-0.5092
|
0.0753
|
-6.7574
|
0.0000
|
-0.6569
|
-0.3614
|
|
Age (years)
|
-0.0823
|
0.0059
|
-13.8724
|
0.0000
|
-0.0939
|
-0.0707
|
|
Income (ordinal 1-8)
|
-0.3213
|
0.0520
|
-6.1778
|
0.0000
|
-0.4233
|
-0.2194
|
|
Sex: Female (ref = Male)
|
1.2451
|
0.2023
|
6.1535
|
0.0000
|
0.8484
|
1.6417
|
|
Exercise: Yes (ref = No)
|
-0.3427
|
0.2531
|
-1.3537
|
0.1759
|
-0.8389
|
0.1536
|
5. Interpreting Coefficients
This is the most important and most commonly mis-stated skill in
applied regression. Let’s do it carefully.
5.1 The General Rule
For predictor \(X_j\): \(\hat{\beta}_j\) is the estimated change in
the mean of \(Y\) for
a one-unit increase in \(X_j\), holding all other predictors
in the model constant.
The phrase “holding all other predictors constant” (also called
ceteris paribus or partial effect) is what
distinguishes MLR from SLR. It is the mathematical expression of
confounder adjustment.
b <- round(coef(m3), 3)
ci <- round(confint(m3), 3)
5.2 The Fitted Equation
\[\widehat{\text{Mental Health Days}} =
12.475 + 0.292(\text{Phys. Health Days}) + -0.509(\text{Sleep}) +
-0.082(\text{Age}) + -0.321(\text{Income}) + 1.245(\text{Female}) +
-0.343(\text{Exercise: Yes})\]
5.3 Interpreting Each Term
Physical health days (\(\hat{\beta}\) = 0.292): Each
additional day of poor physical health is associated with an estimated
0.292 additional mentally unhealthy day on average,
holding sleep, age, income, sex, and exercise constant (95% CI: 0.265 to
0.318). This is the strongest predictor in the model and is consistent
with the bidirectional mind-body connection documented in the
literature.
Sleep hours (\(\hat{\beta}\)
= -0.509): Each additional hour of sleep per night is
associated with an estimated 0.509 fewer mentally unhealthy
days on average, adjusting for all other covariates (95% CI:
-0.657 to -0.361). The negative sign indicates a protective
association.
Age (\(\hat{\beta}\) =
-0.082): Each additional year of age is associated with
0.082 fewer mentally unhealthy days on average (holding
all else constant). This counterintuitive finding is well-documented —
older adults often report fewer mental health difficulties, possibly due
to better emotion regulation, survivor bias, or cohort effects.
Income category (\(\hat{\beta}\) = -0.321): Each
one-unit increase in the income category (on the 1–8 ordinal scale) is
associated with 0.321 fewer mentally unhealthy days on
average, consistent with the well-established socioeconomic gradient in
mental health.
Sex: Female (\(\hat{\beta}\)
= 1.245): Compared to males (the reference group), females
report an estimated 1.245 more mentally unhealthy days
on average, holding all other variables constant.
Exercise: Yes (\(\hat{\beta}\) = -0.343): People
who engaged in any physical activity in the past 30 days report an
estimated 0.343 fewer mentally unhealthy days compared
to those who did not exercise, adjusting for all other covariates.
Intercept (\(\hat{\beta}_0\)
= 12.475): The estimated mean mental health days for a person
with zero physically unhealthy days, zero sleep hours, age = 0, income
category = 0, who is male and does not exercise. This is a
mathematical artifact — not a meaningful quantity in
context.
Epidemiological insight: Notice that the physical
health coefficient changes when we add other covariates. Compare \(\hat{\beta}\) from Model 1 vs. Model 3.
This change represents the confounding that age,
income, sleep, sex, and exercise collectively introduce into the crude
physical health–mental health association.
# Compare the physhlth_days coefficient across models
tribble(
~Model, ~`Physical Health Days β̂`, ~`95% CI`, ~`Adj. R²`,
"M1 (unadjusted)", round(coef(m1)[2], 3),
paste0("(", round(confint(m1)[2,1],3), ", ", round(confint(m1)[2,2],3), ")"),
round(summary(m1)$adj.r.squared, 3),
"M2 (+sleep)", round(coef(m2)[2], 3),
paste0("(", round(confint(m2)[2,1],3), ", ", round(confint(m2)[2,2],3), ")"),
round(summary(m2)$adj.r.squared, 3),
"M3 (full)", round(coef(m3)[2], 3),
paste0("(", round(confint(m3)[2,1],3), ", ", round(confint(m3)[2,2],3), ")"),
round(summary(m3)$adj.r.squared, 3)
) %>%
kable(caption = "Table 4. Physical Health Days Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
Table 4. Physical Health Days Coefficient Across Sequential Models
|
Model
|
Physical Health Days β
|
|95% CI
|
Adj. R
|
|
M1 (unadjusted)
|
0.312
|
(0.286, 0.338)
|
0.099
|
|
M2 (+sleep)
|
0.300
|
(0.274, 0.326)
|
0.110
|
|
M3 (full)
|
0.292
|
(0.265, 0.318)
|
0.156
|
6. Hypothesis Testing for Individual Coefficients
6.1 The t-Test for Each \(\hat{\beta}_j\)
For each predictor \(X_j\), we
test:
\[H_0: \beta_j = 0 \quad \text{vs.} \quad
H_A: \beta_j \neq 0\]
(given all other predictors are in the model)
The test statistic is:
\[t =
\frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \sim t_{n-p-1}\]
where \(n\) is sample size and \(p\) is the number of predictors. We compare
this to a \(t\)-distribution with \(n - p - 1\) degrees of freedom.
Critically: This tests whether \(X_j\) has a statistically significant
partial association with \(Y\), given the other predictors already in
the model. A predictor might be significant alone (unadjusted) but
non-significant in the multivariable model due to overlap with other
predictors (multicollinearity or confounding).
6.2 Interpreting the p-Values
tidy(m3, conf.int = TRUE) %>%
mutate(
Significance = case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
TRUE ~ "ns"
),
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"physhlth_days" = "Physical health days",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"income_cat" = "Income (ordinal 1-8)",
"sexFemale" = "Sex: Female",
"exerciseYes" = "Exercise: Yes"
)
) %>%
select(term, estimate, std.error, statistic, p.value, conf.low, conf.high, Significance) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Table 5. Hypothesis Tests for Each Partial Coefficient (Model 3)",
col.names = c("Predictor", "β̂", "SE", "t", "p-value", "CI Lower", "CI Upper", "Sig.")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 5. Hypothesis Tests for Each Partial Coefficient (Model 3)
|
Predictor
|
β
|
S
|
|
Intercept
|
12.4755
|
0.7170
|
17.4006
|
0.0000
|
11.0699
|
13.8810
|
***
|
|
Physical health days
|
0.2917
|
0.0136
|
21.4779
|
0.0000
|
0.2650
|
0.3183
|
***
|
|
Sleep (hours/night)
|
-0.5092
|
0.0753
|
-6.7574
|
0.0000
|
-0.6569
|
-0.3614
|
***
|
|
Age (years)
|
-0.0823
|
0.0059
|
-13.8724
|
0.0000
|
-0.0939
|
-0.0707
|
***
|
|
Income (ordinal 1-8)
|
-0.3213
|
0.0520
|
-6.1778
|
0.0000
|
-0.4233
|
-0.2194
|
***
|
|
Sex: Female
|
1.2451
|
0.2023
|
6.1535
|
0.0000
|
0.8484
|
1.6417
|
***
|
|
Exercise: Yes
|
-0.3427
|
0.2531
|
-1.3537
|
0.1759
|
-0.8389
|
0.1536
|
ns
|
tidy(m3, conf.int = TRUE) %>%
filter(term != "(Intercept)") %>%
mutate(
term = dplyr::recode(term,
"physhlth_days" = "Physical health days",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"income_cat" = "Income (1-8)",
"sexFemale" = "Sex: Female",
"exerciseYes" = "Exercise: Yes"
),
term = fct_reorder(term, estimate),
sig = ifelse(p.value < 0.05, "Significant (p < 0.05)", "Non-significant")
) %>%
ggplot(aes(x = estimate, y = term, color = sig)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray60") +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.25, linewidth = 0.9) +
geom_point(size = 3.5) +
scale_color_manual(values = c("Significant (p < 0.05)" = "steelblue",
"Non-significant" = "tomato")) +
labs(
title = "Partial Regression Coefficients with 95% Confidence Intervals",
subtitle = "Outcome: Mentally Unhealthy Days (BRFSS 2020, n = 5,000)",
x = "Estimated Change in Mental Health Days (β̂)",
y = NULL,
color = NULL
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
7. The Overall F-Test and ANOVA Decomposition
7.1 Partitioning Variability
Just as in SLR, MLR partitions total variability in \(Y\):
\[\underbrace{SSY}_{\text{Total}} =
\underbrace{SSR}_{\text{Model explains}} + \underbrace{SSE}_{\text{Model
doesn't explain}}\]
| Model (Regression) |
\(SSR\) |
\(p\) |
\(MSR = SSR/p\) |
\(F = MSR/MSE\) |
| Error (Residual) |
\(SSE\) |
\(n-p-1\) |
\(MSE = SSE/(n-p-1)\) |
|
| Total |
\(SSY\) |
\(n-1\) |
|
|
With \(p\) predictors, the model
uses \(p\) degrees of freedom (one for
each \(\hat{\beta}_j\)), leaving \(n - p - 1\) residual degrees of
freedom.
7.2 The Overall F-Test
The omnibus F-test asks:
\[H_0: \beta_1 = \beta_2 = \cdots =
\beta_p = 0 \quad \text{(none of the predictors matter)}\] \[H_A: \text{At least one } \beta_j \neq
0\]
\[F = \frac{MSR}{MSE} \sim F_{p,\; n-p-1}
\quad \text{under } H_0\]
A large F-value (small p-value) means the predictors collectively
explain more variation than expected by chance.
anova_m3 <- anova(m3)
print(anova_m3)
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## physhlth_days 1 29475 29474.7 586.3411 < 2.2e-16 ***
## sleep_hrs 1 3323 3323.0 66.1052 5.345e-16 ***
## age 1 9261 9261.5 184.2385 < 2.2e-16 ***
## income_cat 1 2649 2648.8 52.6918 4.503e-13 ***
## sex 1 1918 1918.4 38.1626 7.025e-10 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glance(m3) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R²",
"adj.r.squared" = "Adjusted R²",
"sigma" = "Residual Std. Error (Root MSE)",
"statistic" = "F-statistic",
"p.value" = "p-value (overall F-test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n − p − 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Table 6. Overall Model Summary — Model 3") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 6. Overall Model Summary — Model 3
|
Statistic
|
Value
|
|
R²
|
0.1569
|
|
Adjusted R²
|
0.1559
|
|
Residual Std. Error (Root MSE)
|
7.0901
|
|
F-statistic
|
154.8953
|
|
p-value (overall F-test)
|
0.0000
|
|
Model df (p)
|
6.0000
|
|
Residual df (n − p − 1)
|
4993.0000
|
|
n (observations)
|
5000.0000
|
8. Coefficient of Determination: \(R^2\) and Adjusted \(R^2\)
8.1 \(R^2\)
\[R^2 = \frac{SSR}{SSY} = 1 -
\frac{SSE}{SSY}\]
\(R^2\) is the proportion of
total variance in \(Y\) explained by
the model. It ranges from 0 to 1.
Critical property: \(R^2\) always increases (or
stays the same) when you add a predictor, even if that predictor is
random noise. It cannot be used to compare models with different numbers
of predictors.
8.2 Adjusted \(R^2\)
\[R^2_{adj} = 1 - \frac{(1 - R^2)(n -
1)}{n - p - 1}\]
Adjusted \(R^2\) penalizes for the
number of predictors \(p\). It only
increases when a new predictor improves the model by more than chance
alone.
- \(R^2_{adj} \leq R^2\) always
- \(R^2_{adj}\) can
decrease if you add a useless predictor
- Use \(R^2_{adj}\) when comparing
models with different numbers of predictors
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R²`,
"M1: Physical health only", 1, round(summary(m1)$r.squared, 4), round(summary(m1)$adj.r.squared, 4),
"M2: + Sleep", 2, round(summary(m2)$r.squared, 4), round(summary(m2)$adj.r.squared, 4),
"M3: Full (+ age, income, sex, exercise)", 6,
round(summary(m3)$r.squared, 4), round(summary(m3)$adj.r.squared, 4)
) %>%
kable(caption = "Table 7. R² and Adjusted R² Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")
Table 7. R² and Adjusted R² Across Sequential Models
|
Model
|
Predictors
|
R2
|
Adj. R²
|
|
M1: Physical health only
|
1
|
0.0990
|
0.0988
|
|
M2: + Sleep
|
2
|
0.1102
|
0.1098
|
|
M3: Full (+ age, income, sex, exercise)
|
6
|
0.1569
|
0.1559
|
Interpretation: R² for mental health outcomes in
population surveys is typically low (5–20%), and that’s expected. Mental
health is shaped by hundreds of unmeasured factors — genetics, life
events, social support, trauma history. A low R² does not mean the model
is “wrong” or “useless” for estimating associations. Statistical
significance (the t-tests) and the magnitude and direction of
coefficients matter more for etiologic questions.
8.3 Root MSE
\[RMSE = \sqrt{MSE} = s_{Y|X}\]
The Root MSE (also called the Residual Standard
Error in R output) is the estimated standard deviation of the
errors. It tells you, on average, how many days the
model’s prediction is off by.
rmse_m3 <- round(summary(m3)$sigma, 2)
cat("Root MSE (Model 3):", rmse_m3, "mentally unhealthy days\n")
## Root MSE (Model 3): 7.09 mentally unhealthy days
9. Visualizing Predicted Values and Effects
9.1 Predicted vs. Observed
augment(m3) %>%
ggplot(aes(x = .fitted, y = menthlth_days)) +
geom_point(alpha = 0.08, size = 0.8, color = "steelblue") +
geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1, linetype = "dashed") +
geom_smooth(method = "loess", color = "darkblue", linewidth = 1, se = FALSE) +
labs(
title = "Predicted vs. Observed Mentally Unhealthy Days",
subtitle = "Dashed = perfect prediction; Blue = LOESS smoother",
x = "Fitted (Predicted) Values (Ŷ)",
y = "Observed Mentally Unhealthy Days (Y)"
) +
theme_minimal(base_size = 13)
9.2 Marginal Effects Plots (ggeffects)
The ggeffects package computes adjusted
predictions — the predicted mean of \(Y\) across values of one predictor, with
all other predictors held at their means (or reference categories). This
is a powerful visualization of what MLR is actually estimating.
ggpredict(m3, terms = "physhlth_days") %>%
plot() +
labs(
title = "Marginal Effect of Physical Health Days on Mentally Unhealthy Days",
subtitle = "Other covariates held at their means/modes",
x = "Number of Physically Unhealthy Days (past 30)",
y = "Predicted Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13)
ggpredict(m3, terms = "sleep_hrs") %>%
plot() +
labs(
title = "Marginal Effect of Sleep on Mentally Unhealthy Days",
subtitle = "Other covariates held at their means/modes",
x = "Sleep Hours per Night",
y = "Predicted Mentally Unhealthy Days"
) +
theme_minimal(base_size = 13)
ggpredict(m3, terms = c("exercise", "sex")) %>%
plot() +
labs(
title = "Predicted Mental Health Days by Exercise Status and Sex",
subtitle = "Adjusted for physical health, sleep, age, income",
x = "Any Physical Activity in Past 30 Days",
y = "Predicted Mentally Unhealthy Days",
color = "Sex", fill = "Sex"
) +
theme_minimal(base_size = 13)
10. Residual Diagnostics
We assess the LINE assumptions through residual plots. Use
plot(model) in base R or augment() +
ggplot2.
par(mfrow = c(2, 2))
plot(m3, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)
How to read each plot:
| Residuals vs. Fitted |
Random scatter around 0; no pattern |
Linearity + Homoscedasticity |
| Normal Q-Q |
Points follow diagonal line |
Normality of residuals |
| Scale-Location (√|e|) |
Flat red line; constant spread |
Homoscedasticity |
| Cook’s Distance / Residuals vs. Leverage |
No points beyond 0.5 or 1.0 Cook’s D contour |
Influential observations |
augment(m3) %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_point(alpha = 0.07, size = 0.8, color = "steelblue") +
geom_smooth(method = "loess", color = "darkblue", se = FALSE) +
labs(
title = "Residuals vs. Fitted Values",
subtitle = "Ideally: random scatter around zero, no systematic pattern",
x = "Fitted Values",
y = "Residuals (Y − Ŷ)"
) +
theme_minimal(base_size = 13)
Note on our outcome: Because mental health days is
bounded at 0 and 30 and is heavily right-skewed, perfect normality and
homoscedasticity are not achievable. Our large sample size (n = 5,000)
provides robustness. In practice, you might consider a log
transformation or Poisson/negative binomial regression as alternatives —
topics we’ll cover later in the course.
11. Main Effects and Interaction Terms
11.1 Higher-Order Terms
The MLR framework accommodates higher-order terms
(polynomials, interactions) while remaining “linear in the parameters.”
Some key rules:
- Quadratic terms: If you include \(X^2\), you must also include \(X\) itself.
- Interaction terms: If you include \(X_1 \times X_2\), you must include both
\(X_1\) and \(X_2\) as main effects.
These are called the hierarchical principle or
marginality rule in regression modeling.
11.2 Testing for an Interaction
Does the effect of sleep on mental health differ between males and
females?
# Model with sleep × sex interaction
m4_interact <- lm(menthlth_days ~ physhlth_days + sleep_hrs * sex + age + income_cat + exercise,
data = brfss_mlr)
tidy(m4_interact, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Table 8. Model with Sleep × Sex Interaction",
col.names = c("Term", "β̂", "SE", "t", "p-value", "CI Low", "CI High")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 8. Model with Sleep × Sex Interaction
|
Term
|
β
|
S
|
|
(Intercept)
|
12.3812
|
0.9203
|
13.4537
|
0.0000
|
10.5770
|
14.1853
|
|
physhlth_days
|
0.2917
|
0.0136
|
21.4763
|
0.0000
|
0.2651
|
0.3183
|
|
sleep_hrs
|
-0.4960
|
0.1101
|
-4.5072
|
0.0000
|
-0.7118
|
-0.2803
|
|
sexFemale
|
1.4178
|
1.0757
|
1.3180
|
0.1876
|
-0.6911
|
3.5267
|
|
age
|
-0.0823
|
0.0059
|
-13.8720
|
0.0000
|
-0.0940
|
-0.0707
|
|
income_cat
|
-0.3210
|
0.0521
|
-6.1673
|
0.0000
|
-0.4231
|
-0.2190
|
|
exerciseYes
|
-0.3422
|
0.2532
|
-1.3518
|
0.1765
|
-0.8386
|
0.1541
|
|
sleep_hrs:sexFemale
|
-0.0244
|
0.1495
|
-0.1635
|
0.8701
|
-0.3174
|
0.2686
|
ggpredict(m4_interact, terms = c("sleep_hrs", "sex")) %>%
plot() +
labs(
title = "Predicted Mental Health Days by Sleep Hours and Sex",
subtitle = "Interaction model: slopes are allowed to differ by sex",
x = "Sleep Hours per Night",
y = "Predicted Mentally Unhealthy Days",
color = "Sex", fill = "Sex"
) +
theme_minimal(base_size = 13)
# Likelihood ratio / F-test comparing models with and without interaction
anova(m3, m4_interact) %>%
kable(caption = "Table 9. F-Test Comparing Models With and Without Interaction") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 9. F-Test Comparing Models With and Without Interaction
|
Res.Df
|
RSS
|
Df
|
Sum of Sq
|
F
|
Pr(>F)
|
|
4993
|
250992.8
|
NA
|
NA
|
NA
|
NA
|
|
4992
|
250991.5
|
1
|
1.344174
|
0.0267344
|
0.8701261
|
Interpretation: If the interaction term is
statistically significant, the slopes (effect of sleep) differ between
males and females. If not significant, a model with only main effects is
preferred (parsimony). We’ll cover interaction/effect modification
formally in a dedicated lecture.
12. The Multivariate Normal Distribution (Conceptual Bridge)
Just as SLR is linked to the bivariate normal
distribution, MLR is linked to the multivariate normal
distribution. When all continuous variables follow a
multivariate normal distribution, the conditional distribution of \(Y\) given any combination of \(X\)s is:
\[Y \mid X_1, X_2, \ldots, X_p \sim
N(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p, \;
\sigma^2_{Y|X})\]
This is the population model that our sample estimates are trying to
recover. The bivariate normal from the correlation lecture is simply a
special case with \(p = 1\).
13. Summary of Key Takeaways
| MLR formula |
\(Y_i = \beta_0 + \sum_{j=1}^p \beta_j
X_{ji} + \varepsilon_i\) |
| Partial coefficient |
\(\hat{\beta}_j\) = change in \(E(Y)\) per 1-unit increase in \(X_j\), holding all else
constant |
| Confounder adjustment |
Achieved mathematically by including confounders as additional
predictors |
| Assumptions |
LINE: Linearity, Independence, Normality, Equal variance |
| Missing data |
MCAR, MAR, MNAR — mechanism determines bias risk |
| F-test |
Tests \(H_0: \beta_1 = \cdots = \beta_p =
0\) (omnibus test) |
| \(R^2\) vs. Adj. \(R^2\) |
Use Adj. \(R^2\) to compare models
of different sizes |
| Root MSE |
Average prediction error in outcome units |
Part 2: In-Class Lab Activity
EPI 553 — Multiple Linear Regression Lab
Due: End of class, March 10, 2026
Instructions
In this lab, you will build and interpret multiple linear regression
models using the BRFSS 2020 analytic dataset. Work through each task
systematically. You may discuss concepts with classmates, but your
written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to
Brightspace by end of class.
Data for the Lab
Use the saved analytic dataset from today’s lecture. It contains
5,000 randomly sampled BRFSS 2020 respondents with the following
variables:
menthlth_days |
Mentally unhealthy days in past 30 |
Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 |
Continuous (0–30) |
sleep_hrs |
Sleep hours per night |
Continuous (1–14) |
age |
Age in years (capped at 80) |
Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) |
Ordinal |
sex |
Sex (Male/Female) |
Factor |
exercise |
Any physical activity past 30 days (Yes/No) |
Factor |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)
#brfss_mlr <- readRDS(brfss_mlr_2020.rds)
Task 1: Exploratory Data Analysis (15 points)
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_f, sex, exercise) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
income_f ~ "Household income",
sex ~ "Sex",
exercise ~ "Any physical activity (past 30 days)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") %>%
as_flex_table()
**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**Characteristic | N | N = 5,0001 |
|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.7) |
Physically unhealthy days (past 30) | 5,000 | 3.3 (7.8) |
Sleep (hours/night) | 5,000 | 7.1 (1.3) |
Age (years) | 5,000 | 54.3 (17.2) |
Household income | 5,000 |
|
<$10k |
| 190 (3.8%) |
$10-15k |
| 169 (3.4%) |
$15-20k |
| 312 (6.2%) |
$20-25k |
| 434 (8.7%) |
$25-35k |
| 489 (9.8%) |
$35-50k |
| 683 (14%) |
$50-75k |
| 841 (17%) |
>$75k |
| 1,882 (38%) |
Sex | 5,000 |
|
Male |
| 2,331 (47%) |
Female |
| 2,669 (53%) |
Any physical activity (past 30 days) | 5,000 | 3,874 (77%) |
1Mean (SD); n (%) |
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "coral", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_classic(base_size = 13)
plot(p_hist)
Interpretation The number of mentally unhealthy days
(0-30) reported is heavily right skewed, with a large number of
respondents reporting 0 mentally unhealthy days, and a long tail. Given
that this is count data, with hard boundaries, transformations or other
model structures may make more sense, but we will investigate the linear
model to determine how robust it is.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
p_mat <- ggpairs(brfss_mlr, columns = 1:4,
columnLabels = c("Poor Mental Health Days", "Poor Physical Health Days", "Hours of Sleep", "Age"))
print(p_mat)
The strongest correlation appears to be between Poor Mental Health Days
and Poor Physical Health Days (0.315). Age (-0.156) and Hours of Sleep
(-0.140) each have slighter negative correlations with the outcome.
Task 2: Unadjusted (Simple) Linear Regression (15 points)
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
##fit slr
model_slr <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Summary output
summary(model_slr)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
b0 <- round(coef(model_slr)[1], 3)
b1 <- round(coef(model_slr)[2], 4)
\[\widehat{\text{Poor Mental Health Days}}
= 9.474 + -0.8042 \times \text{Hours of Sleep}\]
2b. (5 pts) Interpret the slope for sleep in a
single sentence appropriate for a public health audience (no statistical
jargon): For each additional hour of sleep night, a person is expected
to have about 0.8 fewer poor mental health days (out of 30).
# Extract and report: t-statistic, p-value
n <- nrow(brfss_mlr)
slope_test <- tidy(model_slr, conf.int = TRUE) %>% filter(term == "sleep_hrs")
tibble(
Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
"Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
Value = c(
round(slope_test$estimate, 4),
round(slope_test$std.error, 4),
round(slope_test$statistic, 3),
n - 2,
format.pval(slope_test$p.value, digits = 3),
round(slope_test$conf.low, 4),
round(slope_test$conf.high, 4)
)
) %>%
kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
t-Test for the Slope (H₀: β₁ = 0)
|
Quantity
|
Value
|
|
Slope (b₁)
|
-0.8042
|
|
SE(b₁)
|
0.0802
|
|
t-statistic
|
-10.022
|
|
Degrees of freedom
|
4998
|
|
p-value
|
<2e-16
|
|
95% CI Lower
|
-0.9616
|
|
95% CI Upper
|
-0.6469
|
2c. (5 pts) State the null and alternative
hypotheses for the slope, report the t-statistic and p-value, and state
your conclusion. What is the degree of freedom for this test?
\[H_0: \beta_1 = 0 \quad \text{(no linear
relationship between X and Y)}\] \[H_A: \beta_1 \neq 0 \quad \text{(there is a
linear relationship)}\]
Decision: With p < 0.001, we reject \(H_0\) at the \(\alpha = 0.05\) level. There is
statistically significant evidence of a linear association between hours
of sleep and poor physical health days.
Task 3: Building the Multivariable Model (25 points)
3a. (5 pts) Fit three models:
- Model A:
menthlth_days ~ sleep_hrs
- Model B:
menthlth_days ~ sleep_hrs + age + sex
- Model C:
menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise
# Model 1: Unadjusted
m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model 2: Add sleep
m2 <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model 3: Full multivariable model
m3 <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
data = brfss_mlr)
3b. (10 pts) Create a table comparing the sleep
coefficient (\(\hat{\beta}\), SE, 95%
CI, p-value) across Models A, B, and C. Does the sleep coefficient
change substantially when you add more covariates? What does this
suggest about confounding?
# Compare the physhlth_days coefficient across models
b <- round(coef(m3), 3)
ci <- round(confint(m3), 3)
tribble(
~Model, ~`Sleep Hours β̂`, ~`95% CI`, ~`Adj. R²`,
"M1 (unadjusted)", round(coef(m1)[2], 3),
paste0("(", round(confint(m1)[2,1],3), ", ", round(confint(m1)[2,2],3), ")"),
round(summary(m1)$adj.r.squared, 3),
"M2 (+age and sex)", round(coef(m2)[2], 3),
paste0("(", round(confint(m2)[2,1],3), ", ", round(confint(m2)[2,2],3), ")"),
round(summary(m2)$adj.r.squared, 3),
"M3 (full)", round(coef(m3)[2], 3),
paste0("(", round(confint(m3)[2,1],3), ", ", round(confint(m3)[2,2],3), ")"),
round(summary(m3)$adj.r.squared, 3)
) %>%
kable(caption = "Table 4. Sleep Hours Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
Table 4. Sleep Hours Coefficient Across Sequential Models
|
Model
|
Sleep Hours β
|
|95% CI
|
Adj. R
|
|
M1 (unadjusted)
|
-0.804
|
(-0.962, -0.647)
|
0.020
|
|
M2 (+age and sex)
|
-0.734
|
(-0.889, -0.578)
|
0.050
|
|
M3 (full)
|
-0.509
|
(-0.657, -0.361)
|
0.156
|
The sleep hours coefficient changes substantially when more
covariates are added, down to -0.5 in the full model from -0.8 in the
SLR - indicating that there is confounding.
3c. (10 pts) For Model C, write out
the full fitted regression equation and interpret every
coefficient in plain language appropriate for a public health
report.
5.2 The Fitted Equation
Model 3: Full multivariable model
m3 <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days +
sleep_hrs + income_cat + exercise, data = brfss_mlr)
\[\widehat{\text{Mental Health Days}} =
12.475 + -0.509(\text{Hours of Sleep}) + -0.082(\text{Age}) +
1.245(\text{Sex}) + 0.292(\text{Physical Health Days}) +
-0.321(\text{Income}) + -0.343(\text{Exercise: Yes})\]
5.3 Interpreting Each Term
Hours of Sleep (\(\hat{\beta}\) = -0.509): Each
additional hours of extra sleep is associated with an estimated
-0.509 fewer mentally unhealthy day on average, holding
Age, Sex, Physical Health Days, Income, and Exercise constant (95% CI:
-0.657 to -0.361), indicating a protective association.
Age (\(\hat{\beta}\) =
-0.082): Each additional year of age is associated with
0.082 fewer mentally unhealthy days on average (holding
all else constant). This counterintuitive finding is well-documented —
older adults often report fewer mental health difficulties, possibly due
to better emotion regulation, survivor bias, or cohort effects.
Sex: Female (\(\hat{\beta}\)
= 1.245): Compared to males (the reference group), females
report an estimated 1.245 more mentally unhealthy days
on average, holding all other variables constant.
Physical health days (\(\hat{\beta}\) = 0.292): Each
additional day of poor physical health is associated with an estimated
0.292 additional mentally unhealthy day on average.
This is consistent with the bidirectional mind-body connection
documented in the literature.
Income category (\(\hat{\beta}\) = -0.321): Each
one-unit increase in the income category (on the 1–8 ordinal scale) is
associated with 0.292 fewer mentally unhealthy days on
average, consistent with the well-established socioeconomic gradient in
mental health.
Exercise: Yes (\(\hat{\beta}\) = -0.343): People
who engaged in any physical activity in the past 30 days report an
estimated 0.343 fewer mentally unhealthy days compared
to those who did not exercise, adjusting for all other covariates.
Intercept (\(\hat{\beta}_0\)
= 12.475): The estimated mean mental health days for a person
with zero physically unhealthy days, zero sleep hours, age = 0, income
category = 0, who is male and does not exercise. This is a
mathematical artifact — not a meaningful quantity in
context.
Task 4: Model Fit and ANOVA (20 points)
anova_m3 <- anova(m3)
print(anova_m3)
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B,
C). Create a table. Which model explains the most variance in mental
health days?
tibble(
Model = c("Mental Health Days ~ Sleep Hours", "Mental Health Days ~ Sleep Hours + Age + Sex", "Mental Health Days ~ Sleep Hours + Age + Sex + Phys Health Days + Income + Exercise"),
R_squared = c(
round(summary(m1)$r.squared, 4),
round(summary(m2)$r.squared, 4),
round(summary(m3)$r.squared, 4)
),
Adj_R2 = c(
round(summary(m1)$adj.r.squared, 4),
round(summary(m2)$adj.r.squared, 4),
round(summary(m3)$adj.r.squared, 4)
),
AIC = c(round(AIC(m1), 1),
round(AIC(m2), 1),
round(AIC(m3), 1))
) %>%
kable(
caption = "Model Comparison: Poor Mental Health Days vs. Hours of Sleep",
col.names = c("Model", "R^2", "Adj. R^2", "AIC")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which.min(c(AIC(m1), AIC(m2), AIC(m3))),
bold = TRUE, background = "#d4edda")
Model Comparison: Poor Mental Health Days vs. Hours of Sleep
|
Model
|
R^2
|
Adj. R^2
|
AIC
|
|
Mental Health Days ~ Sleep Hours
|
0.0197
|
0.0195
|
34529.3
|
|
Mental Health Days ~ Sleep Hours + Age + Sex
|
0.0504
|
0.0498
|
34374.4
|
|
Mental Health Days ~ Sleep Hours + Age + Sex + Phys Health Days + Income
+ Exercise
|
0.1569
|
0.1559
|
33785.3
|
M3, the full multivariate model explains the most variance across the
three different models.
4b. (5 pts) What is the Root MSE for Model C?
Interpret it in practical terms — what does it tell you about prediction
accuracy?
augmented <- augment(m3)
n <- nrow(brfss_mlr)
ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)
tibble(
Quantity = c("SS Residual", "MSE (sigma^2)", "Residual Std. Error (sigma)"),
Value = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3)),
Units = c("", "", "Mentally Unhealthy Days")
) %>%
kable(caption = "Model Error Estimates") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
Model Error Estimates
|
Quantity
|
Value
|
Units
|
|
SS Residual
|
250992.800
|
|
|
MSE (sigma^2)
|
50.219
|
|
|
Residual Std. Error (sigma)
|
7.087
|
Mentally Unhealthy Days
|
The Root MSE (also called the Residual Standard
Error in R output) is the estimated standard deviation of the
errors. It tells you, on average, how many days the
model’s prediction is off by. In this case, the MSE is off by about 7
mental health days typically.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glance(m3) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R^2",
"adj.r.squared" = "Adjusted R^2",
"sigma" = "Residual Std. Error (Root MSE)",
"statistic" = "F-statistic",
"p.value" = "p-value (overall F-test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n - p - 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Overall Model Summary - Model 3") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Overall Model Summary - Model 3
|
Statistic
|
Value
|
|
R^2
|
0.1569
|
|
Adjusted R^2
|
0.1559
|
|
Residual Std. Error (Root MSE)
|
7.0901
|
|
F-statistic
|
154.8953
|
|
p-value (overall F-test)
|
0.0000
|
|
Model df (p)
|
6.0000
|
|
Residual df (n - p - 1)
|
4993.0000
|
|
n (observations)
|
5000.0000
|
| Model (Regression) |
\(SSR\) |
\(p\) |
\(MSR = SSR/p\) |
\(F = MSR/MSE\) |
| Error (Residual) |
\(SSE\) |
\(n-p-1\) |
\(MSE = SSE/(n-p-1)\) |
|
| Total |
\(SSY\) |
\(n-1\) |
|
|
Calculations in excel: | Source | df | SS | MS | F |
|——–|—-|——|—-|—-| | Model | 6 | 46719 | 7786.59 | 154.9 | | Residual |
4993 | 250992 | 50.27 | | | Total | 5000 | 297711 | | |
State the null hypothesis for the overall F-test and your
conclusion.
\[H_0: \beta_1 = \beta_2 = \cdots =
\beta_p = 0 \quad \text{(none of the predictors matter)}\]
\[H_A: \text{At least one } \beta_j \neq
0\]
A large F-value (small p-value) means the predictors collectively
explain more variation than expected by chance - in this case, the
F-value is large, indicating that at least one of the predictors and
potentially all of them collectively do indicate that we can reject the
null.
Task 5: Residual Diagnostics (15 points)
5a. (5 pts) For Model C, produce the four standard
diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location,
Cook’s Distance). Comment on what each plot tells you about the LINE
assumptions.
par(mfrow = c(2, 2))
plot(m3, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)
The Residuals vs. Fitted test tests assumptions around linearity and
homoscedasticity - in this case, we do see that there are violations of
assumptions, in that the red line dips quite far from 0 and we see
something of a pattern.
Normal Q-Q tests for normality of the residuals - these depart from
the diagonal, likely a result of the heavy right skew and bounded (0-30)
nature of the outcome.
Scale-Location looks for a flat red line to indicate that there is no
homoscedasticity - in this case, we see that the line is very angled,
indicating there is quite a lot of heteroscedaticity.
Cook’s distance - it appears that there are 3 potentially influential
outliers labeled on the chart.
| Residuals vs. Fitted |
Random scatter around 0; no pattern |
Linearity + Homoscedasticity |
| Normal Q-Q |
Points follow diagonal line |
Normality of residuals |
| Scale-Location (√|e|) |
Flat red line; constant spread |
Homoscedasticity |
| Cook’s Distance / Residuals vs. Leverage |
No points beyond 0.5 or 1.0 Cook’s D contour |
Influential observations |
5b. (5 pts) Given the nature of the outcome (mental
health days, bounded at 0 and 30, heavily right-skewed), which
assumptions are most likely to be violated? Does this invalidate the
analysis? Explain.
Given the heavy right skew with so many respondents reporting 0, and
the outcome being bounded between 0-30, the assumptions of normality of
residuals (because it’s bounded) and heteroscedascity (from the right
skew) are impossible. However, because we have such a high sample size
(5,000), this does not necessarily invalidate it. Mental health days are
count, so we could potentially use poisson (although it is bounded, so
this may not be ideal), or negative binomial regression as alternatives,
to see if the fits are better there.
5c. (5 pts) Identify any observations with Cook’s
Distance > 1. How many are there? What would you do with them in a
real analysis?
There are three observations with Cook’s Distance > 1 - labeled on
the plot. It would make sense to conduct a sensitivity analysis with
these observations - comparing the models to models with these
observations excluded, to see the real influence that they have.
Task 6: Interpretation for a Public Health Audience (10 points)
Suppose you are writing the results section of a public health paper.
Write a 3–4 sentence paragraph summarizing the findings
from Model C for a non-statistical audience. Your paragraph should:
- Identify which predictors were significantly associated with mental
health days
- State the direction and approximate magnitude of the most important
associations
- Appropriately caveat the cross-sectional nature of the data (no
causal language)
- Not use any statistical jargon (no “significant,” “coefficient,”
“p-value”)
This analysis of BRFSS survey data investigated the association
between the number of reported poor mental health days (out of 30) and
other factors - hours of sleep per night, age, sex, number of poor
reported physical health days (out of 30), income bracket, and exercise
(in the last 30 days) - this data is all taken at the same time, so no
causation can be inferred. We found that nightly hours of sleep, age,
income, and exercise all seem to be potential protective factor. Hours
of sleep a night are the most impactful protective factor, with our
finding that with all other elements held the same, for each additional
hour of sleep a night, respondents would report about half a day fewer
of poor mental health days (out of 30), and both exercise and a higher
level of income reflected about 1/3 of a day fewer poor mental health
days. Being female and reporting poor physical health days had increased
association with poor mental health days, with reported poor physical
health days having the largest effect - where a poor physical health day
(out of 30) would be associated with about 1/3 of an additional day of
reported poor mental health.
End of Lab Activity