library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
mutate(
# Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days (key predictor)
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours (practical cap at 14)
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age (capped at 80 per BRFSS coding)
age = age80,
# Income category (ordinal 1–8)
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Exercise in past 30 days (any physical activity)
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
# BMI (stored as integer × 100 in BRFSS)
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
# Income as labeled factor (for display)
income_f = factor(income2, levels = 1:8,
labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
"$25-35k", "$35-50k", "$50-75k", ">$75k"))
) %>%
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(age), age >= 18,
!is.na(income_cat),
!is.na(sex),
!is.na(exercise)
)
# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_cat, income_f, sex, exercise, bmi) %>%
slice_sample(n = 5000)
# Save for lab activity
saveRDS(brfss_mlr,
"data/brfss_mlr_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
kable(caption = "Analytic Dataset Dimensions") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
EPI 553 — Multiple Linear Regression Lab Due: End of class, March 10, 2026
In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) | Ordinal |
sex |
Sex (Male/Female) | Factor |
exercise |
Any physical activity past 30 days (Yes/No) | Factor |
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_f, sex, exercise) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
income_f ~ "Household income",
sex ~ "Sex",
exercise ~ "Any physical activity (past 30 days)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") %>%
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.7) |
Physically unhealthy days (past 30) | 5,000 | 3.3 (7.8) |
Sleep (hours/night) | 5,000 | 7.1 (1.3) |
Age (years) | 5,000 | 54.3 (17.2) |
Household income | 5,000 | |
<$10k | 190 (3.8%) | |
$10-15k | 169 (3.4%) | |
$15-20k | 312 (6.2%) | |
$20-25k | 434 (8.7%) | |
$25-35k | 489 (9.8%) | |
$35-50k | 683 (14%) | |
$50-75k | 841 (17%) | |
>$75k | 1,882 (38%) | |
Sex | 5,000 | |
Male | 2,331 (47%) | |
Female | 2,669 (53%) | |
Any physical activity (past 30 days) | 5,000 | 3,874 (77%) |
1Mean (SD); n (%) | ||
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 10, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(p_hist)Distribution of Mentally Unhealthy Days (BRFSS 2020)
The outcome is right-skewed with most respondents reporting no mentally unhealthy days, with a long tail toward 30. This means that outliers may skew our results.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)Mental health days and Physical health days (r= 0.315) shows a moderate positive correlation Mental health days and sleep (r= -0.140) shows a weak negative correlation Mental health days and age (r= -0.156) shows a weak negative correlation Physical health days and sleep (r= -0.112) shows a weak negative correlation Physical health days and age (r= 0.093) shows a weak positive correlation Sleep and age (r= 0.089) shows a weak positive correlation —
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
Mental Health Days = 9.474 + -0.804 (Sleep Hours)
2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon). The estimated change in the mean number of days with poor mental health is -0.804 for a one-unit increase in sleep hours, holding all other predictors in the model constant.
2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test? The null hypothesis is that there is no linear association between the number of days with poor health and BMI, whereas the alternative hypothesis is that there is a linear relationship. The null hypothesis can be rejected (p< 0.05) suggesting there is a linear relationship. The test-statistic is 16.42 with 4998 degrees of freedom.
3a. (5 pts) Fit three models:
menthlth_days ~ sleep_hrsmenthlth_days ~ sleep_hrs + age + sexmenthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise# Model 1: Unadjusted
m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model 2: Add sleep
m2 <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model 3: Full multivariable model
m3 <- lm(menthlth_days ~ sleep_hrs + physhlth_days + age + income_cat + sex + exercise, data = brfss_mlr)
summary(m3)##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + physhlth_days + age +
## income_cat + sex + exercise, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9192 -3.4262 -1.7803 0.2948 30.0568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.475489 0.716959 17.401 < 2e-16 ***
## sleep_hrs -0.509160 0.075348 -6.757 1.57e-11 ***
## physhlth_days 0.291657 0.013579 21.478 < 2e-16 ***
## age -0.082307 0.005933 -13.872 < 2e-16 ***
## income_cat -0.321323 0.052012 -6.178 7.02e-10 ***
## sexFemale 1.245053 0.202333 6.153 8.17e-10 ***
## exerciseYes -0.342685 0.253138 -1.354 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared: 0.1569, Adjusted R-squared: 0.1559
## F-statistic: 154.9 on 6 and 4993 DF, p-value: < 2.2e-16
3b. (10 pts) Create a table comparing the sleep coefficient (\(\hat{\beta}\), SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?
tribble(
~Model, ~`Sleep β̂`, ~`95% CI`, ~`Adj. R²`,
"M1 (unadjusted)", round(coef(m1)[2], 3),
paste0("(", round(confint(m1)[2,1],3), ", ", round(confint(m1)[2,2],3), ")"),
round(summary(m1)$adj.r.squared, 3),
"M2 (+ age and sex)", round(coef(m2)[2], 3),
paste0("(", round(confint(m2)[2,1],3), ", ", round(confint(m2)[2,2],3), ")"),
round(summary(m2)$adj.r.squared, 3),
"M3 (full)", round(coef(m3)[2], 3),
paste0("(", round(confint(m3)[2,1],3), ", ", round(confint(m3)[2,2],3), ")"),
round(summary(m3)$adj.r.squared, 3)
) %>%
kable(caption = "Sleep Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Model | Sleep β | |95% CI |
Adj. R
|
|---|---|---|---|
| M1 (unadjusted) | -0.804 | (-0.962, -0.647) | 0.020 |
| M2 (+ age and sex) | -0.734 | (-0.889, -0.578) | 0.050 |
| M3 (full) | -0.509 | (-0.657, -0.361) | 0.156 |
The sleep coefficient appears to decrease with increasing model covariates. This suggests that the covariates added to the model were confounding the relationship between sleep and days with poor mental health.
3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report. Mental Health Days = 12.475 + -0.509 (Sleep Hours) + 0.292 (Physhealh_days) + -0.082 (age) + -0.321 (income_cat) + 1.245 (sex: Female) + -0.343 (exercise: Yes) This fitted regression equation describes the association between covariates and mental health days: 1) The estimated change in the mean number of days with poor mental health is -0.509 for a one-unit increase in sleep hours, holding all other predictors in the model constant. 2) The estimated change in the mean number of days with poor mental health is 0.292 for a one-unit increase in days with poor physical health, holding all other predictors in the model constant. 3) The estimated change in the mean number of days with poor mental health is -0.082 for a one-unit increase in age, holding all other predictors in the model constant. 4) The estimated change in the mean number of days with poor mental health is -0.321 for a one-unit increase in income category, holding all other predictors in the model constant. 5) The estimated change in the mean number of days with poor mental health is 1.245 for females compared to males, holding all other predictors in the model constant. 6) The estimated change in the mean number of days with poor mental health is -0.343 for those who exercise compared to those who do not, holding all other predictors in the model constant. —
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R²`,
"M1: Sleep only", 1, round(summary(m1)$r.squared, 4), round(summary(m1)$adj.r.squared, 4),
"M2: + age and sex", 2, round(summary(m2)$r.squared, 4), round(summary(m2)$adj.r.squared, 4),
"M3: Full (sleep_hrs + age + sex + physhlth_days + income_cat + exercise)", 6,
round(summary(m3)$r.squared, 4), round(summary(m3)$adj.r.squared, 4)
) %>%
kable(caption = "R² and Adjusted R² Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")| Model | Predictors | R2 | Adj. R² |
|---|---|---|---|
| M1: Sleep only | 1 | 0.0197 | 0.0195 |
| M2: + age and sex | 2 | 0.0504 | 0.0498 |
| M3: Full (sleep_hrs + age + sex + physhlth_days + income_cat + exercise) | 6 | 0.1569 | 0.1559 |
The full model explains the most variance in mental health days as evidenced by the largest r2 and adjusted r2 values (0.157 and 0.156).
4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy? rmse_m3 <- round(summary(m3)$sigma, 2) cat(“Root MSE (Model 3):”, rmse_m3, “mentally unhealthy days”) The root MSE is 7.09, which suggests that the model is off by 7.09 mentally unhealthy days.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## physhlth_days 1 26933 26933.0 535.7786 < 2.2e-16 ***
## age 1 9261 9261.5 184.2385 < 2.2e-16 ***
## income_cat 1 2649 2648.8 52.6918 4.503e-13 ***
## sex 1 1918 1918.4 38.1626 7.025e-10 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glance(m3) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R²",
"adj.r.squared" = "Adjusted R²",
"sigma" = "Residual Std. Error (Root MSE)",
"statistic" = "F-statistic",
"p.value" = "p-value (overall F-test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n − p − 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Overall Model Summary — Model 3") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Statistic | Value |
|---|---|
| R² | 0.1569 |
| Adjusted R² | 0.1559 |
| Residual Std. Error (Root MSE) | 7.0901 |
| F-statistic | 154.8953 |
| p-value (overall F-test) | 0.0000 |
| Model df (p) | 6.0000 |
| Residual df (n − p − 1) | 4993.0000 |
| n (observations) | 5000.0000 |
| Source | df | SS | MS | F |
|---|---|---|---|---|
| Model C | 6 | 46718 | 46717.3 | 154.90 |
| Residual | 4993 | 250993 | 50.3 | 154.90 |
| Total | 5000 | 297711 | 46767.6 | 154.90 |
**State the null hypothesis for the overall F-test and your conclusion. The null hypothesis is that none of the covariates have a linear relationship with mental health days. The overall F-test is 15.90 with a p-value of <0.05, whcih means that at least one of these covariatres does have a linear relationship with mental health days so we can reject the null.
5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.
cooks_d <- cooks.distance(m3)
# Create data frame
influence_df <- data.frame(
observation = 1:length(cooks_d),
cooks_d = cooks_d
) %>%
mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))
# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
labs(
title = "Cook's Distance: Identifying Influential Observations",
subtitle = "Values > 1 indicate potentially influential observations",
x = "Observation Number",
y = "Cook's Distance",
color = "Influential?"
) +
scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
theme_minimal(base_size = 12)
ggplotly(p5)Residual vs. fitted: Checks for linearity and equal variance. We want a horizontal red line and random scatter with no pattern. This graphs shows the spread increasing with fitted values indicating heteroscedasticity.
Normal Q-Q: Checks for normality of residuals. Points should fall approximately along the 45° reference line. The heavy tail suggests non-normality.
Scale-Location: Another check for equal variance or homoscedasticity. A flat line indicates constant variance. The graph shows increases along fitted values suggesting heteroscedasticity.
Cook’s Distance: Check for influential observations using Cook’s distance. Points above a Cook’s distance of 0.5 or 1 are considered to have high influence.
5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain. Normality and equal variance are likely to be violated in this analysis. This does not invalidate the analysis because the sample size is large (n=5000) and the covariate group sizes are relatively balanced.
5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis? There are no observations with Cook’s distance >1. If there were I would remove them as they would be highly influential. —
Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:
This analysis aimed to study the associations between poor mental health and sleep, physical health, age, income, sex, and exercise. The results show that all of these factors play a role in poor mental health days, except for exercise which was found to have no association. Higher sleep, age, and income all reduced the average number of poor mental health days, while a lower number of days with poor physical health and being female reduced poor mental health days. It’s important to note that since this analysis only took a snapshot of a sample group of people, we cannot determine whther these variables are impacting mental health or if mental health is simply assoicated with them. Further studies that follow-up with their study participants can help determine this association.