Part 2: In-Class Lab Activity
EPI 553 — Multiple Linear Regression Lab
Due: End of class, March 10, 2026
Instructions
In this lab, you will build and interpret multiple linear regression
models using the BRFSS 2020 analytic dataset. Work through each task
systematically. You may discuss concepts with classmates, but your
written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to
Brightspace by end of class.
Data for the Lab
Use the saved analytic dataset from today’s lecture. It contains
5,000 randomly sampled BRFSS 2020 respondents with the following
variables:
menthlth_days |
Mentally unhealthy days in past 30 |
Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 |
Continuous (0–30) |
sleep_hrs |
Sleep hours per night |
Continuous (1–14) |
age |
Age in years (capped at 80) |
Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) |
Ordinal |
sex |
Sex (Male/Female) |
Factor |
exercise |
Any physical activity past 30 days (Yes/No) |
Factor |
# Load the dataset
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)
brfss_full <- read_xpt("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Labs/LLCP2020.XPT") %>%
clean_names()
# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
mutate(
# Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days (key predictor)
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours (practical cap at 14)
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age (capped at 80 per BRFSS coding)
age = age80,
# Income category (ordinal 1–8)
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Exercise in past 30 days (any physical activity)
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
# BMI (stored as integer × 100 in BRFSS)
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
# Income as labeled factor (for display)
income_f = factor(income2, levels = 1:8,
labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
"$25-35k", "$35-50k", "$50-75k", ">$75k"))
) %>%
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(age), age >= 18,
!is.na(income_cat),
!is.na(sex),
!is.na(exercise)
)
# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_cat, income_f, sex, exercise, bmi) %>%
slice_sample(n = 5000)
Task 1: Exploratory Data Analysis (15 points)
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_f, sex, exercise, bmi) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
income_f ~ "Household income",
sex ~ "Sex",
exercise ~ "Any physical activity (past 30 days)",
bmi ~ "Body Mass Index (kg/m^2)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")
Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)
| Characteristic |
N |
N = 5,000 |
| Mentally unhealthy days (past 30) |
5,000 |
3.8 (7.7) |
| Physically unhealthy days (past 30) |
5,000 |
3.3 (7.8) |
| Sleep (hours/night) |
5,000 |
7.1 (1.3) |
| Age (years) |
5,000 |
54.3 (17.2) |
| Household income |
5,000 |
|
| <$10k |
|
190 (3.8%) |
| $10-15k |
|
169 (3.4%) |
| $15-20k |
|
312 (6.2%) |
| $20-25k |
|
434 (8.7%) |
| $25-35k |
|
489 (9.8%) |
| $35-50k |
|
683 (14%) |
| $50-75k |
|
841 (17%) |
| >$75k |
|
1,882 (38%) |
| Sex |
5,000 |
|
| Male |
|
2,331 (47%) |
| Female |
|
2,669 (53%) |
| Any physical activity (past 30 days) |
5,000 |
3,874 (77%) |
| Body Mass Index (kg/m^2) |
4,706 |
28.4 (6.4) |
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(p_hist)
This distribution is right skewed, most individuals had 0 mentally
unhealthy days. This might affect the multiple linear regression model,
which relies on the assumption of normality.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)

Mental health days is positively correlated with physical health
days, with an r of 0.315 suggesting a moderate correlation. Mental
health days is negatively correlated with sleep, with an r of -0.140
suggesting a weak correlation. Mental health days is negatively
correlated with age as well, with an r of -0.156 suggesting a weak
correlation. The other pairwise correlations were statistically
significant, but with low magnitude suggesting weak correlations.
Task 2: Unadjusted (Simple) Linear Regression (15 points)
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
# Fit simple linear regression: BMI ~ Age
SLR <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Summary output
summary(SLR)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
menthlth_days = (-0.80424 * sleep_hrs) + 9.47429
2b. (5 pts) Interpret the slope for sleep in a
single sentence appropriate for a public health audience (no statistical
jargon).
For every 1 hr increase in sleep, there is a reduction of 0.80 poor
mental health days per month.
2c. (5 pts) State the null and alternative
hypotheses for the slope, report the t-statistic and p-value, and state
your conclusion. What is the degree of freedom for this test?
H0: β = 0; there is no linear relationship between menthlth_days and
sleep_hrs H1: β ≠ 0; there is a linear relationship between
menthlth_days and sleep_hrs
T = -10.02 p < 2e16 df = 4998
I reject the null hypothesis and conclude there is a linear
relationship between menthlth_days and sleep_hrs.
Task 3: Building the Multivariable Model (25 points)
3a. (5 pts) Fit three models:
- Model A:
menthlth_days ~ sleep_hrs
- Model B:
menthlth_days ~ sleep_hrs + age + sex
- Model C:
menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise
# Model 1: Unadjusted
mA <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model 2: Add sleep
mB <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model 3: Full multivariable model
mC <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
data = brfss_mlr)
3b. (10 pts) Create a table comparing the sleep
coefficient (\(\hat{\beta}\), SE, 95%
CI, p-value) across Models A, B, and C. Does the sleep coefficient
change substantially when you add more covariates? What does this
suggest about confounding?
tribble(
~Model, ~`Sleep Hours β̂`, ~`95% CI`, ~`Adj. R²`,
"MA (unadjusted)", round(coef(mA)[2], 3),
paste0("(", round(confint(mA)[2,1],3), ", ", round(confint(mA)[2,2],3), ")"),
round(summary(mA)$adj.r.squared, 3),
"MB (+sleep)", round(coef(mB)[2], 3),
paste0("(", round(confint(mB)[2,1],3), ", ", round(confint(mB)[2,2],3), ")"),
round(summary(mB)$adj.r.squared, 3),
"MC (full)", round(coef(mC)[2], 3),
paste0("(", round(confint(mC)[2,1],3), ", ", round(confint(mC)[2,2],3), ")"),
round(summary(mC)$adj.r.squared, 3)
) %>%
kable(caption = "Table 4. Sleep Hours Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
Table 4. Sleep Hours Coefficient Across Sequential Models
|
Model
|
Sleep Hours β
|
|95% CI
|
Adj. R
|
|
MA (unadjusted)
|
-0.804
|
(-0.962, -0.647)
|
0.020
|
|
MB (+sleep)
|
-0.734
|
(-0.889, -0.578)
|
0.050
|
|
MC (full)
|
-0.509
|
(-0.657, -0.361)
|
0.156
|
The sleep coefficient does change substantially when more covariates
are added, suggesting that these variables confound the relationship
between sleep and poor mental health days, bringing it further from the
null.
3c. (10 pts) For Model C, write out
the full fitted regression equation and interpret every
coefficient in plain language appropriate for a public health
report.
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.306 -3.980 -2.515 -0.307 32.360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.76563 0.64447 18.256 < 2e-16 ***
## sleep_hrs -0.73391 0.07932 -9.253 < 2e-16 ***
## age -0.06648 0.00622 -10.688 < 2e-16 ***
## sexFemale 1.53992 0.21339 7.217 6.13e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.523 on 4996 degrees of freedom
## Multiple R-squared: 0.05036, Adjusted R-squared: 0.04979
## F-statistic: 88.32 on 3 and 4996 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days +
## income_cat + exercise, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9192 -3.4262 -1.7803 0.2948 30.0568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.475489 0.716959 17.401 < 2e-16 ***
## sleep_hrs -0.509160 0.075348 -6.757 1.57e-11 ***
## age -0.082307 0.005933 -13.872 < 2e-16 ***
## sexFemale 1.245053 0.202333 6.153 8.17e-10 ***
## physhlth_days 0.291657 0.013579 21.478 < 2e-16 ***
## income_cat -0.321323 0.052012 -6.178 7.02e-10 ***
## exerciseYes -0.342685 0.253138 -1.354 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared: 0.1569, Adjusted R-squared: 0.1559
## F-statistic: 154.9 on 6 and 4993 DF, p-value: < 2.2e-16
anova_mC <- anova(mC)
print(anova_mC)
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glance(mC) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R²",
"adj.r.squared" = "Adjusted R²",
"sigma" = "Residual Std. Error (Root MSE)",
"statistic" = "F-statistic",
"p.value" = "p-value (overall F-test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n − p − 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Table 6. Overall Model Summary — Model C") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 6. Overall Model Summary — Model C
|
Statistic
|
Value
|
|
R²
|
0.1569
|
|
Adjusted R²
|
0.1559
|
|
Residual Std. Error (Root MSE)
|
7.0901
|
|
F-statistic
|
154.8953
|
|
p-value (overall F-test)
|
0.0000
|
|
Model df (p)
|
6.0000
|
|
Residual df (n − p − 1)
|
4993.0000
|
|
n (observations)
|
5000.0000
|
menthlth_days = 12.475489 + (-0.509160 * sleep_hrs) + (-0.082307 *
age) + (1.245053 * sexFemale) + (0.291657 * physhlth_days) + (-0.321323
* income_cat) + (-0.342685 * exerciseYes)
For every 1 hr increase in sleep, there is a reduction of 0.51 poor
mental health days per month.
For every 1 yr increase in age, there is a reduction of 0.08 poor
mental health days per month.
Being female is associated with 1.25 more poor mental health days per
month compared to males.
For every 1 day increase in the number of poor physical health days
per month, there is an increase of 0.29 poor mental health days per
month.
For every 1 income category increase, there is a reduction of 0.32
poor mental health days per month.
Exercising is associated with a reduction of 0.34 poor mental health
days per month compared with not exercising.
Task 4: Model Fit and ANOVA (20 points)
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B,
C). Create a table. Which model explains the most variance in mental
health days?
R2 Model A: 0.0197 Model B: 0.05036 Model C: 0.1569
Adjusted R2 Model A: 0.0195 Model B: 0.04979 Model C: 0.1559
| Model A |
0.0197 |
0.0195 |
| Model B |
0.05036 |
0.04979 |
| Model C |
0.1569 |
0.1559 |
Model C explains the most variance in poor mental health days.
4b. (5 pts) What is the Root MSE for Model C?
Interpret it in practical terms — what does it tell you about prediction
accuracy?
The root MSE for model C is 7.09. Root MSE is the average of the
deviations from the regression line, so it tells you how well the
regression line fits the data. Thus, for any given sleep hours, the
average deviation from the regression line is 7.09 poor mental health
days per month, suggesting the regression line is not great for
prediction accuracy.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
| Model |
6 |
46747.40154 |
7791.23359 |
154.8953 |
| Residual |
4993 |
250993 |
50.3 |
|
| Total |
4999 |
297740.40154 |
|
|
State the null hypothesis for the overall F-test and your
conclusion.
H0: β1 = β2 = β3 = β4 = β5 = β6; none of the predictors matter
The p value for the overall F test was < 0.05, thus I reject the
null and conclude that the predictors explain more variation than
expected by chance.
Task 5: Residual Diagnostics (15 points)
5a. (5 pts) For Model C, produce the four standard
diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location,
Cook’s Distance). Comment on what each plot tells you about the LINE
assumptions.
par(mfrow = c(2, 2))
plot(mC, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)

Residuals vs. Fitted: we do not see random scatter around zero, but a
clear downward trend - non-equal variance of residuals.
Q-Q: the points deviate substantially from the line - non-normal
distribution of residuals.
Scale-Location: line is not flat - variance is not constant.
Cook’s distance: no points are greater than 0.5 or 1 Cook’s distance
- no influential outliers.
5b. (5 pts) Given the nature of the outcome (mental
health days, bounded at 0 and 30, heavily right-skewed), which
assumptions are most likely to be violated? Does this invalidate the
analysis? Explain.
GIven the nature of the outcome, normality and equal variance of
residuals are likely to violated. However, this does not invalidate the
analysis because our sample size is large at n=5000.
5c. (5 pts) Identify any observations with Cook’s
Distance > 1. How many are there? What would you do with them in a
real analysis?
augmented <- augment(mC)
augmented <- augmented %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(mC),
influential = ifelse(cooks_d > 1, "Potentially influential", "Not influential")
)
n_influential <- sum(augmented$cooks_d > 1)
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 1, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooks

There are zero observations with Cook’s distance > 1. In a real
analysis, these would be removed.
Task 6: Interpretation for a Public Health Audience (10 points)
Suppose you are writing the results section of a public health paper.
Write a 3–4 sentence paragraph summarizing the findings
from Model C for a non-statistical audience. Your paragraph should:
We found that hours of sleep, age, sex, days with poor physical
health, and income category were associated with number of poor mental
health days. Specifically, we found that 1) For every 1 hr increase in
sleep, there is a reduction of 0.51 poor mental health days per month;
2) For every 1 yr increase in age, there is a reduction of 0.08 poor
mental health days per month; 3) Being female is associated with 1.25
more poor mental health days per month compared to males, 4) For every 1
day increase in the number of poor physical health days per month, there
is an increase of 0.29 poor mental health days per month, and 5) For
every 1 income category increase, there is a reduction of 0.32 poor
mental health days per month. It is important to note that the data was
obtained from BRFSS, a cross-sectional study. This means that data was
collected at a single point in time, so these results do indicate
temporality or causality (that is, we cannot say that any of these
variables are CAUSING a change in poor mental health days).
- Identify which predictors were significantly associated with mental
health days
- State the direction and approximate magnitude of the most important
associations
- Appropriately caveat the cross-sectional nature of the data (no
causal language)
- Not use any statistical jargon (no “significant,” “coefficient,”
“p-value”)
End of Lab Activity