In the previous lectures, we used Simple Linear Regression (SLR) to model a continuous outcome as a function of a single predictor. But real-world health phenomena are never driven by one variable alone. A person’s mental health is shaped by how much they sleep, how physically ill they are, how old they are, their income, their sex, and much more — simultaneously.
Multiple Linear Regression (MLR) extends SLR to accommodate this complexity. It allows us to:
In epidemiology, MLR is our primary tool for multivariable analysis of continuous outcomes.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) | Ordinal |
sex |
Sex (Male/Female) | Factor |
exercise |
Any physical activity past 30 days (Yes/No) | Factor |
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(ggstats)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)We continue using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 — a nationally representative telephone survey of U.S. adults conducted by the CDC. This dataset contains over 400,000 respondents with detailed information on health behaviors, chronic conditions, and social determinants of health.
Research question for today:
What individual, behavioral, and socioeconomic factors are associated with the number of mentally unhealthy days in the past 30 days among U.S. adults?
This is a quintessential epidemiological question: we have a health outcome (mental health days) and want to understand its multivariable determinants while appropriately controlling for confounders.
# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
mutate(
# Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days (key predictor)
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours (practical cap at 14)
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age (capped at 80 per BRFSS coding)
age = age80,
# Income category (ordinal 1–8)
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Exercise in past 30 days (any physical activity)
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
# BMI (stored as integer × 100 in BRFSS)
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
# Income as labeled factor (for display)
income_f = factor(income2, levels = 1:8,
labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
"$25-35k", "$35-50k", "$50-75k", ">$75k"))
) %>%
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(age), age >= 18,
!is.na(income_cat),
!is.na(sex),
!is.na(exercise)
)
# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_cat, income_f, sex, exercise, bmi) %>%
slice_sample(n = 5000)
# Save for lab activity
saveRDS(brfss_mlr,
"C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/brfss_mlr_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
kable(caption = "Analytic Dataset Dimensions") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
EPI 553 — Multiple Linear Regression Lab Due: End of class, March 10, 2026
In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_f, sex, exercise) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
income_f ~ "Household income",
sex ~ "Sex",
exercise ~ "Any physical activity (past 30 days)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
# add_overall() %>%
bold_labels() %>%
#italicize_levels() %>%
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")| Characteristic | N | N = 5,0001 |
|---|---|---|
| Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.7) |
| Physically unhealthy days (past 30) | 5,000 | 3.3 (7.8) |
| Sleep (hours/night) | 5,000 | 7.1 (1.3) |
| Age (years) | 5,000 | 54.3 (17.2) |
| Household income | 5,000 | |
| <$10k | 190 (3.8%) | |
| $10-15k | 169 (3.4%) | |
| $15-20k | 312 (6.2%) | |
| $20-25k | 434 (8.7%) | |
| $25-35k | 489 (9.8%) | |
| $35-50k | 683 (14%) | |
| $50-75k | 841 (17%) | |
| >$75k | 1,882 (38%) | |
| Sex | 5,000 | |
| Male | 2,331 (47%) | |
| Female | 2,669 (53%) | |
| Any physical activity (past 30 days) | 5,000 | 3,874 (77%) |
| 1 Mean (SD); n (%) | ||
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(p_hist)Distribution of Mentally Unhealthy Days (BRFSS 2020)
Note: The outcome is right-skewed — most respondents report zero mentally unhealthy days, with a tail toward 30. The majority of respondents (the mode at 0) report zero mentally unhealthy days in the past month, with progressively fewer people reporting higher values. There is a small secondary peak at 30 days, representing individuals who experienced poor mental health throughout the entire month.
Implications for regression modeling:
Normality assumption: The residuals from a linear model are unlikely to be perfectly normal because the outcome itself is highly non-normal. However, with our large sample size (n = 5,000), the Central Limit Theorem provides robustness— our coefficient estimates and standard errors will still be approximately valid.
Homoscedasticity concern: The bounded nature of the outcome (0-30 days) and the clustering at 0 may lead to heteroscedasticity (unequal variance of residuals). We should check residual plots carefully.
Alternative models: For count outcomes with excess zeros, zero-inflated models, negative binomial regression, or Poisson regression might be more appropriate. However, linear regression remains interpretable and is often used as a starting point.
Study Note: Right-skewed outcomes are common in health data (hospital days, sick days, healthcare costs). Linear regression can still provide valid estimates of associations, especially with large samples, though predictions at extreme values should be interpreted cautiously.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
# Create scatterplot matrix for continuous variables
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
"Mentally unhealthy days (past 30)" = menthlth_days,
"Physically unhealthy days (past 30)" = physhlth_days,
"Sleep (hours/night)" = sleep_hrs,
"Age (years)"= age ,
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)Mental Health Days vs. Physical Health Days (r ≈
0.315):
This is the strongest correlation in the matrix. People with more mental
distress also report more physical distress.
Mental Health Days vs. Sleep Hours (r ≈
-0.14):
Very weak negative relation. Less sleep is associated with slightly more
mentally unhealthy days, but the relationship is modest. Sleep may be
both a cause (poor sleep worsens mood) and consequence of poor mental
health.
Mental Health Days vs. Age (r ≈ -0.156):
Also very weak negative relation. Older adults report slightly fewer
mentally unhealthy days on average.
These are unadjusted (crude) correlations. In multivariable regression, these relationships may change after accounting for confounding.
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
# Fit simple linear regression: menthlth_days ~ sleep_hrs
model_slr <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Display model summary
summary(model_slr)##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
Recall the SLR model:
\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]
\[\widehat{\text{Mental Health Days}} = \beta_0 + \beta_1 \times (\text{Sleep Hours}) + \varepsilon_i\]
\[\widehat{Y} = 9.47 - 0.804 \times \text{Sleep} + \varepsilon_i\]
| Symbol | Meaning |
|---|---|
| \(Y_i\) | Outcome for subject \(i\) (mentally unhealthy days) |
| \(\beta_0\) | Intercept — expected value of \(Y\) when all \(X\)s = 0 |
| \(\beta_j\) | Partial regression coefficient for predictor \(j\) |
| \(X_{ji}\) | Value of predictor \(j\) for subject \(i\) |
| \(p\) | Number of predictors in the model |
| \(\varepsilon_i\) | Random error for subject \(i\), \(\varepsilon_i \overset{iid}{\sim} N(0,\sigma^2)\) |
The word partial is critical. Each \(\beta_j\) represents the estimated change in \(E(Y)\) for a one-unit increase in \(X_j\), holding all other predictors constant. This “holding constant” is the mathematical mechanism by which we adjust for confounding.
2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).
Interpretation: The negative slope (-0.332) means that for every additional hour of sleep per night, people report about 0.8 fewer mentally unhealthy days per month on average. The intercept (9.47) would be the predicted mental health days for someone with 0 hours of sleep, which is biologically impossible and thus not meaningful.
2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?
Null and Alternative Hypotheses
\[H_0: \beta_1 = 0 \quad \text{(Sleep is not associated with mentally unhealthy days)}\] \[H_A: \beta_1 \neq 0 \quad \text{(Sleep is associated with mentally unhealthy days)}\] ### Test Results:
# Get confidence interval
ci_slr <- confint(model_slr)
# Extract test statistics
tidy_slr <- tidy(model_slr)
# Display results
tidy_slr %>%
filter(term == "sleep_hrs") %>%
mutate(
`95% CI Lower` = ci_slr[2, 1],
`95% CI Upper` = ci_slr[2, 2]
) %>%
select(term, estimate, std.error, statistic, p.value, `95% CI Lower`, `95% CI Upper`) %>%
mutate(across(where(is.numeric), ~round(., 4))) %>%
kable(caption = "Hypothesis Test for Sleep Coefficient",
col.names = c("Term", "Estimate", "SE", "t-statistic", "p-value", "CI Lower", "CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t-statistic | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| sleep_hrs | -0.8042 | 0.0802 | -10.0218 | 0 | -0.9616 | -0.6469 |
Test Statistic and p‑value From the output: - t-statistic: -10.02 - p-value: < 2.2e-16
Degrees of Freedom df = n-2 = 4998
Conclusion: Because the p-value is below 0.05, we reject the null hypothesis. There is strong statistical evidence that sleep hours are associated with mentally unhealthy days.
3a. (5 pts) Fit three models:
menthlth_days ~ sleep_hrsmenthlth_days ~ sleep_hrs + age + sexmenthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise# Model 1: Unadjusted
mA <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model 2: Add sleep
mB <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model 3: Full multivariable model
mC <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
data = brfss_mlr)##Examining Model A B and C Output
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.306 -3.980 -2.515 -0.307 32.360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.76563 0.64447 18.256 < 2e-16 ***
## sleep_hrs -0.73391 0.07932 -9.253 < 2e-16 ***
## age -0.06648 0.00622 -10.688 < 2e-16 ***
## sexFemale 1.53992 0.21339 7.217 6.13e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.523 on 4996 degrees of freedom
## Multiple R-squared: 0.05036, Adjusted R-squared: 0.04979
## F-statistic: 88.32 on 3 and 4996 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days +
## income_cat + exercise, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9192 -3.4262 -1.7803 0.2948 30.0568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.475489 0.716959 17.401 < 2e-16 ***
## sleep_hrs -0.509160 0.075348 -6.757 1.57e-11 ***
## age -0.082307 0.005933 -13.872 < 2e-16 ***
## sexFemale 1.245053 0.202333 6.153 8.17e-10 ***
## physhlth_days 0.291657 0.013579 21.478 < 2e-16 ***
## income_cat -0.321323 0.052012 -6.178 7.02e-10 ***
## exerciseYes -0.342685 0.253138 -1.354 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared: 0.1569, Adjusted R-squared: 0.1559
## F-statistic: 154.9 on 6 and 4993 DF, p-value: < 2.2e-16
tidy(mC, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"physhlth_days" = "Physical health days",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"income_cat" = "Income (ordinal 1-8)",
"sexFemale" = "Sex: Female (ref = Male)",
"exerciseYes" = "Exercise: Yes (ref = No)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
format = "html"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = TRUE) %>%
row_spec(0, bold = TRUE) %>%
row_spec(c(2, 3), background = "#EBF5FB")| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| Intercept | 12.4755 | 0.7170 | 17.4006 | 0.0000 | 11.0699 | 13.8810 |
| Sleep (hours/night) | -0.5092 | 0.0753 | -6.7574 | 0.0000 | -0.6569 | -0.3614 |
| Age (years) | -0.0823 | 0.0059 | -13.8724 | 0.0000 | -0.0939 | -0.0707 |
| Sex: Female (ref = Male) | 1.2451 | 0.2023 | 6.1535 | 0.0000 | 0.8484 | 1.6417 |
| Physical health days | 0.2917 | 0.0136 | 21.4779 | 0.0000 | 0.2650 | 0.3183 |
| Income (ordinal 1-8) | -0.3213 | 0.0520 | -6.1778 | 0.0000 | -0.4233 | -0.2194 |
| Exercise: Yes (ref = No) | -0.3427 | 0.2531 | -1.3537 | 0.1759 | -0.8389 | 0.1536 |
3b. (10 pts) Create a table comparing the sleep coefficient (\(\hat{\beta}\), SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?
# Compare the sleep coefficient across models
tribble(
~Model, ~`Sleep β̂`, ~`95% CI`, ~`Adj. R²`,
"mA (unadjusted)", round(coef(mA)[2], 3),
paste0("(", round(confint(mA)[2,1],3), ", ", round(confint(mA)[2,2],3), ")"),
round(summary(mA)$adj.r.squared, 3),
"mB (+sleep)", round(coef(mB)[2], 3),
paste0("(", round(confint(mB)[2,1],3), ", ", round(confint(mB)[2,2],3), ")"),
round(summary(mB)$adj.r.squared, 3),
"mC (full)", round(coef(mC)[2], 3),
paste0("(", round(confint(mC)[2,1],3), ", ", round(confint(mC)[2,2],3), ")"),
round(summary(mC)$adj.r.squared, 3)
) %>%
kable(caption = "Table 4. Sleep Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Model | Sleep β | |95% CI |
Adj. R
|
|---|---|---|---|
| mA (unadjusted) | -0.804 | (-0.962, -0.647) | 0.020 |
| mB (+sleep) | -0.734 | (-0.889, -0.578) | 0.050 |
| mC (full) | -0.509 | (-0.657, -0.361) | 0.156 |
3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.
####The Fitted Equation
\[\widehat{\text{Mental Health Days}} = 12.475 + -0.509(\text{Sleep}) + -0.082(\text{age}) + 1.245(\text{Female}) + 0.292(\text{Phys. Health Days}) + -0.321(\text{income_cat}) + -0.343(\text{Exercise: Yes})\] menthlth_days = 12.4755 - 0.5092sleep_hrs - 0.0823age + 1.2451sexFemale + 0.2917physhlth_days - 0.3213income_cat - 0.3427exerciseYes
**Intercept (\(\hat{\beta}\) = 12.475) Only mathematical starting point. If a person slept 0 hours, was age 0, male, did not exercise, had zero physically unhealthy days, and was in the lowest income category, the model predicts about 12.5 mentally unhealthy days. Not very meaningful.
Sleep hours (\(\hat{\beta}\) = -0.509): For each additional hour of sleep per night, people report about half a day (-0.5) fewer mentally unhealthy days per month, holding all other factors constant.
Age (\(\hat{\beta}\) = -0.082): For every additional year of age, people report slightly fewer mentally unhealthy days.(-0.082) Younger adults report more mental distress than older adults, even after accounting for sleep, income, physical health, and exercise. The negative sign indicates a protective association.
Sex : Female (\(\hat{\beta}\) = 1.245): Compared to males (the reference group), females report an estimated of 1.25 more mentally unhealthy days per month , after adjusting for all other variables.
Income category (\(\hat{\beta}\) = 0.292): Each one-unit increase in the income category (on the 1–8 ordinal scale) is associated with 0.32 fewer mentally unhealthy days on average, consistent with the well-established socioeconomic gradient in mental health.
Physically Unhealthy days (\(\hat{\beta}\) = -0.321): Each additional physically unhealthy day is associated with about 0.29 more mentally unhealthy days. People experiencing more physical health problems also tend to report more mental distress.
Exercise: Yes (\(\hat{\beta}\) = -0.343): People who engaged in any physical activity in the past 30 days report an estimated 0.34 fewer mentally unhealthy days** compared to those who did not exercise, adjusting for all other covariates.
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 100.44 < 2.2e-16 ***
## Residuals 4998 291847 58.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 103.638 < 2.2e-16 ***
## age 1 6182 6182.2 109.249 < 2.2e-16 ***
## sex 1 2947 2947.1 52.079 6.131e-13 ***
## Residuals 4996 282717 56.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R²`,
"mA: Sleep only", 1, round(summary(mA)$r.squared, 4), round(summary(mA)$adj.r.squared, 4),
"mB: + Sleep + age + sex", 2, round(summary(mB)$r.squared, 4), round(summary(mB)$adj.r.squared, 4),
"mC: Full", 6,
round(summary(mC)$r.squared, 4), round(summary(mC)$adj.r.squared, 4)
) %>%
kable(caption = "Table 7. R² and Adjusted R² Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")| Model | Predictors | R2 | Adj. R² |
|---|---|---|---|
| mA: Sleep only | 1 | 0.0197 | 0.0195 |
| mB: + Sleep + age + sex | 2 | 0.0504 | 0.0498 |
| mC: Full | 6 | 0.1569 | 0.1559 |
Interpretation: Model C explains the most variance in mentally unhealthy days (about 15.7%). As expected, adding more meaningful predictors improves explanatory power.
4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?
glance(mC) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R²",
"adj.r.squared" = "Adjusted R²",
"sigma" = "Residual Std. Error (Root MSE)",
"statistic" = "F-statistic",
"p.value" = "p-value (overall F-test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n − p − 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Table 6. Overall Model Summary — Model C") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Statistic | Value |
|---|---|
| R² | 0.1569 |
| Adjusted R² | 0.1559 |
| Residual Std. Error (Root MSE) | 7.0901 |
| F-statistic | 154.8953 |
| p-value (overall F-test) | 0.0000 |
| Model df (p) | 6.0000 |
| Residual df (n − p − 1) | 4993.0000 |
| n (observations) | 5000.0000 |
rmse_mC <- round(summary(mC)$sigma, 2)
cat("Root MSE (Model 3):", rmse_mC, "mentally unhealthy days\n")## Root MSE (Model 3): 7.09 mentally unhealthy days
##Interpretation On average, the model’s predictions are off by about 7 mentally unhealthy days per month. Even with six predictors, the model still has substantial prediction error — people’s mental health days vary widely, and the model captures only part of that variability.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Calculate total SS and df manually
ss_model <- sum(anova_C$`Sum Sq`[1:6]) # Sum of all predictor SS
ss_residual <- anova_C$`Sum Sq`[7] # Residual SS
ss_total <- ss_model + ss_residual
df_model <- sum(anova_C$Df[1:6]) # Sum of all predictor df
df_residual <- anova_C$Df[7]
df_total <- df_model + df_residual
# Calculate Mean Squares
ms_model <- ss_model / df_model
ms_residual <- ss_residual / df_residual
# Calculate F-statistic
f_statistic <- ms_model / ms_residual
# Get p-value
f_pvalue <- glance(mC)$p.value# Create formatted table
tibble(
Source = c("Model (Regression)", "Residual (Error)", "Total"),
df = c(df_model, df_residual, df_total),
SS = c(ss_model, ss_residual, ss_total),
MS = c(ms_model, ms_residual, NA),
F = c(f_statistic, NA, NA)
) %>%
mutate(across(where(is.numeric), ~round(., 2))) %>%
kable(caption = "Table 5. ANOVA Decomposition for Model C",
align = c("l", "r", "r", "r", "r")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(1, bold = TRUE, background = "#EBF5FB")| Source | df | SS | MS | F |
|---|---|---|---|---|
| Model (Regression) | 6 | 46718.54 | 7786.42 | 154.9 |
| Residual (Error) | 4993 | 250992.80 | 50.27 | NA |
| Total | 4999 | 297711.34 | NA | NA |
| Source | df | SS | MS | F |
|---|---|---|---|---|
| Model | 6 | 46718.54 | 7786.42 | 154.9 |
| Residual | 4993 | 250992.80 | 50.27 | NA |
| Total | 4999 | 297711.34 | NA | NA |
State the null hypothesis for the overall F-test and your conclusion.
$$H_0: _j = 0
H_A: _j $$
(given all other predictors are in the model)
The null hypothesis states that none of the predictors in Model C improve prediction of mentally unhealthy days. The alternative says that at least one predictor contributes meaningful explanatory power.
The test statistic is:
\[t = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \sim t_{n-p-1}\]
where \(n\) is sample size and \(p\) is the number of predictors. We compare this to a \(t\)-distribution with \(n - p - 1\) degrees of freedom.
A predictor may be significant alone (in a simple regression), but not significant in the multivariable model because its effect overlaps with other predictors (confounding or shared variance).
5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.
Standard Residual Diagnostic Plots
How to read each plot:
| Plot | What to Look For | Assumption Tested |
|---|---|---|
| Residuals vs. Fitted | Random scatter around 0; no pattern | Linearity + Homoscedasticity |
| Normal Q-Q | Points follow diagonal line | Normality of residuals |
| Scale-Location (√|e|) | Flat red line; constant spread | Homoscedasticity |
| Cook’s Distance / Residuals vs. Leverage | No points beyond 0.5 or 1.0 Cook’s D contour | Influential observations |
augment(mC) %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_point(alpha = 0.07, size = 0.8, color = "steelblue") +
geom_smooth(method = "loess", color = "darkblue", se = FALSE) +
labs(
title = "Residuals vs. Fitted Values",
subtitle = "Ideally: random scatter around zero, no systematic pattern",
x = "Fitted Values",
y = "Residuals (Y − Ŷ)"
) +
theme_minimal(base_size = 13)Residuals vs. Fitted (ggplot2)
##Interpretations
Residuals vs. Fitted We can see clear curvature rather than a flat horizontal band, Residuals are not centered evenly around 0 and variance increases slightly at higher fitted values. This means linearity is violated with the relationship between predictors and mental health days not being perfectly linear. Homoscedasticity is also questionable since residual spread is not constant.
Normal Q–Q Plot We can see strong deviations from the diagonal line, especially in both tails, Heavy right tail (large positive residuals). Residuals are not normally distributed. This is expected because the outcome is bounded (0–30) and right‑skewed.
Scale–Location Plot We can see that red line slopes upward, Spread of root of residuals increases with fitted values. Homoscedasticity is violated and Variability in mental health days increases for people with higher predicted values.
Cook’s Distance We can see few points (e.g., 830, 1439, 1532) stand out but none exceed the common cutoff of Cook’s D > 1. THis could mean that there is no highly influential observations and a few moderately influential points exist, but nothing that violates model stability.
5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain.
Because the outcome is bounded (0–30) and heavily right‑skewed, the following assumptions are most problematic: - Normality of residuals → violated - Homoscedasticity → violated - Linearity → little violated
However, it does not invalidate the analysis. With - n ≈ 5000, the linear model is good to non‑normality. Coefficient estimates remain unbiased. Standard errors may be slightly off, but the very large sample size stabilizes inference. The model is still useful for describing associations, which is the goal in most public‑health analyses.
5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis?
Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:
##Answer: People reported more mentally unhealthy days when they also had more physically unhealthy days, were younger, slept less, had lower incomes, or were women. The strongest pattern we found was that people experiencing more physical health problems also tended to report more days of poor mental health, with smaller but meaningful differences linked to sleep, age, and income. Exercise showed only a very small difference in mental health days once the other factors were taken into account. Because these data were collected at one point in time, the results describe patterns in the population but cannot show whether any of these factors directly cause changes in mental health. —
End of Lab Activity