Part 2: In-Class Lab Activity
EPI 553 — Multiple Linear Regression Lab
Due: End of class, March 10, 2026
Instructions
In this lab, you will build and interpret multiple linear regression
models using the BRFSS 2020 analytic dataset. Work through each task
systematically. You may discuss concepts with classmates, but your
written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to
Brightspace by end of class.
Data for the Lab
Use the saved analytic dataset from today’s lecture. It contains
5,000 randomly sampled BRFSS 2020 respondents with the following
variables:
menthlth_days |
Mentally unhealthy days in past 30 |
Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 |
Continuous (0–30) |
sleep_hrs |
Sleep hours per night |
Continuous (1–14) |
age |
Age in years (capped at 80) |
Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) |
Ordinal |
sex |
Sex (Male/Female) |
Factor |
exercise |
Any physical activity past 30 days (Yes/No) |
Factor |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)
brfss_mlr <- readRDS(
"C:/Users/Alyss/OneDrive/Desktop/epi553/data/brfss_mlr_2020.rds"
)
Task 1: Exploratory Data Analysis (15 points)
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age, income_cat, sex, exercise) %>%
# Create a summary table
tbl_summary(
# Label variables in table
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
income_cat ~ "Household income",
sex ~ "Sex",
exercise ~ "Any physical activity (past 30 days)"
),
# If continuous, report mean and sd
# If categorical, report n and percent
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
# Round continuous variables to nearest tenth
digits = all_continuous() ~ 1,
# Do not display missing values row
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics - BRFSS 2020 Analytic Sample (n = 5,000)**")
Table 1. Descriptive Statistics - BRFSS 2020 Analytic Sample (n = 5,000)
| Characteristic |
N |
N = 5,000 |
| Mentally unhealthy days (past 30) |
5,000 |
3.8 (7.7) |
| Physically unhealthy days (past 30) |
5,000 |
3.3 (7.8) |
| Sleep (hours/night) |
5,000 |
7.1 (1.3) |
| Age (years) |
5,000 |
54.3 (17.2) |
| Household income |
5,000 |
|
| 1 |
|
190 (3.8%) |
| 2 |
|
169 (3.4%) |
| 3 |
|
312 (6.2%) |
| 4 |
|
434 (8.7%) |
| 5 |
|
489 (9.8%) |
| 6 |
|
683 (14%) |
| 7 |
|
841 (17%) |
| 8 |
|
1,882 (38%) |
| Sex |
5,000 |
|
| Male |
|
2,331 (47%) |
| Female |
|
2,669 (53%) |
| Any physical activity (past 30 days) |
5,000 |
3,874 (77%) |
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
# Create histogram
menth_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "forestgreen", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(menth_hist)
Describe the shape of the distribution. Is it symmetric,
right-skewed, or left-skewed?
The shape of the distribution is right-skewed with most of the
responses on mental unhealthiness being 0.
What are the implications of this shape for regression modeling?
The implications of this skewed shape shows it is does not follow
normality; the skewedness and amount of outliers shows that regular
regression modeling may not provide the best model of the data.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
# Create scatterplot matrix from ggpairs()
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "darkred", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Scatter Pairwise Relationship among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)

Comment on the direction and strength of each pairwise correlation
with the outcome.
Physical Health Days and Mental Health Days: There is a moderate
positive correlation between physical health days and the outcome.
Sleep per night and Mental Health Days: There is a weak negative
correlation between sleep per night and the outcome.
Age and Mental Health Days: There is a slightly greater but still
weak negative correlation between age and the outcome compared to sleep
per night and the outcome.
Task 2: Unadjusted (Simple) Linear Regression (15 points)
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
# Fit linear regression model
lr <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Show summary of linear regression
summary(lr)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
# Create tidy table to display results
tidy(lr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Mental Health Days ~ Sleep per night (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
Simple Linear Regression: Mental Health Days ~ Sleep per night (BRFSS
2020)
|
Term
|
Estimate
|
Std. Error
|
t-statistic
|
p-value
|
95% CI Lower
|
95% CI Upper
|
|
(Intercept)
|
9.4743
|
0.5771
|
16.4165
|
0
|
8.3429
|
10.6057
|
|
sleep_hrs
|
-0.8042
|
0.0802
|
-10.0218
|
0
|
-0.9616
|
-0.6469
|
Fitted Regression Equation: \[Mental
Health Days = 9.474 + -0.804*Sleep Per Night \]
2b. (5 pts) Interpret the slope for sleep in a
single sentence appropriate for a public health audience (no statistical
jargon).
For each 1-hour increase in sleep per night, poor mental health days
decreases by 0.804, on average, when everything else is held
constant.
2c. (5 pts) State the null and alternative
hypotheses for the slope, report the t-statistic and p-value, and state
your conclusion. What is the degree of freedom for this test?
\[ H_0:\beta_1=0 \] (No linear
relationship between X and Y)
\[
H_1=\beta_1\neq0
\] (There is a linear relationship)
With a p-value of <2.2e-16 and a t-statistic of -10.022, we reject
H0 at the alpha = 0.05 level. There is statistically significant
evidence of a linear association between sleep per night and mental
health days. The degree of freedom is 4998.
Task 3: Building the Multivariable Model (25 points)
3a. (5 pts) Fit three models:
- Model A:
menthlth_days ~ sleep_hrs
- Model B:
menthlth_days ~ sleep_hrs + age + sex
- Model C:
menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise
#Model A: menthlth_days ~ sleep_hrs
m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model B: menthlth_days ~ sleep_hrs + age + sex
m2 <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model C: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise
m3 <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise, data = brfss_mlr)
# Table for m3
tidy(m3, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"sex" = "Sex",
"physhlth_days" = "Physical health days",
"income_cat" = "Income (ordinal 1-8)",
"exercise" = "Exercise"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
format = "html"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = TRUE) %>%
row_spec(0, bold = TRUE) %>%
row_spec(c(2, 3), background = "#EBF5FB")
Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple
Predictors (BRFSS 2020, n = 5,000)
|
Term
|
Estimate
|
Std. Error
|
t-statistic
|
p-value
|
95% CI Lower
|
95% CI Upper
|
|
Intercept
|
12.4755
|
0.7170
|
17.4006
|
0.0000
|
11.0699
|
13.8810
|
|
Sleep (hours/night)
|
-0.5092
|
0.0753
|
-6.7574
|
0.0000
|
-0.6569
|
-0.3614
|
|
Age (years)
|
-0.0823
|
0.0059
|
-13.8724
|
0.0000
|
-0.0939
|
-0.0707
|
|
sexFemale
|
1.2451
|
0.2023
|
6.1535
|
0.0000
|
0.8484
|
1.6417
|
|
Physical health days
|
0.2917
|
0.0136
|
21.4779
|
0.0000
|
0.2650
|
0.3183
|
|
Income (ordinal 1-8)
|
-0.3213
|
0.0520
|
-6.1778
|
0.0000
|
-0.4233
|
-0.2194
|
|
exerciseYes
|
-0.3427
|
0.2531
|
-1.3537
|
0.1759
|
-0.8389
|
0.1536
|
3b. (10 pts) Create a table comparing the sleep
coefficient (\(\hat{\beta}\), SE, 95%
CI, p-value) across Models A, B, and C. Does the sleep coefficient
change substantially when you add more covariates? What does this
suggest about confounding?
# Compare the sleep_hrs coefficient across models.
tribble(
~Model, ~`Sleep Per Night β`, ~`95% CI`, ~`Adj. R²`, "M1 (unadjusted)", round(coef(m1)[2], 3), paste0("(", round(confint(m1)[2,1],3), ", ", round(confint(m1)[2,2],3), ")"),
round(summary(m1)$adj.r.squared, 3),
"M2 (+ age and sex)", round(coef(m2)[2], 3),
paste0("(", round(confint(m2)[2,1],3), ", ", round(confint(m2)[2,2],3), ")"),
round(summary(m2)$adj.r.squared, 3),
"M3 (full)", round(coef(m3)[2], 3),
paste0("(", round(confint(m3)[2,1],3), ", ", round(confint(m3)[2,2],3), ")"),
round(summary(m3)$adj.r.squared, 3)
) %>%
kable(caption = "Table 2. Sleep per night Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
Table 2. Sleep per night Coefficient Across Sequential Models
|
Model
|
Sleep Per Night β
|
95% CI
|
Adj. R²
|
|
M1 (unadjusted)
|
-0.804
|
(-0.962, -0.647)
|
0.020
|
|
M2 (+ age and sex)
|
-0.734
|
(-0.889, -0.578)
|
0.050
|
|
M3 (full)
|
-0.509
|
(-0.657, -0.361)
|
0.156
|
Does the sleep coefficient change substantially when you add more
covariates? What does this suggest about confounding?
As more covariates are added, the sleep per night coefficient
changed. This suggests that the original regression model was biased and
the true relationship was distorted by the confounders.
3c. (10 pts) For Model C, write out
the full fitted regression equation and interpret every
coefficient in plain language appropriate for a public health
report.
\[\widehat{\text{Mental Health Days}} =
12.476 + -0.509(\text{Sleep per night}) + -0.082(\text{age}) +
1.245(\text{sex}) + 0.291(\text{Phys. Health Days}) +
-0.321(\text{Income}) + -0.343(\text{Exercise})\]
Sleep per night (β = -0.509): For every 1-hour increase in sleep,
days with poor mental health decrease by 0.509 on average, holding age,
sex, physical health days, income, and exercise constant (95% CI: -0.657
to -0.361). Getting better sleep reduces poor mental health days.
Age (β = -0.082): For every year increase in age, days with poor
mental health decrease by 0.082 on average, holding all other covariates
constant (95% CI: -0.094 to -0.071). Older adults have less bad mental
health days due to potential better emotional regulation or survivor
bias.
Sex (β = 1.245): Compared to males, females have on average 1.245
more bad mental health days (95% CI: 0.848 to 1.64).
Physical Health Days (β = 0.291): For every 1 day increase in poor
physical health, poor mental health increases by 0.291 days. (95% CI:
0.265 to 0.318).
Income (β = -0.321): For every 1 category increase in income, poor
mental health decreases by 0.321 days. (95% CI: -0.423 to -0.219).
Higher Income level may make individuals feel less financial stress so
poor mental health days are less often.
Exercise (β = -0.343): Adults who engage in any type of activity
compared to those who have done no exercise had on average 0.343 less
poor mental health days. (95% CI: -0.839 to -0.154). Exercise has been
proven to improve mood and would impact mental health.
Task 4: Model Fit and ANOVA (20 points)
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B,
C). Create a table. Which model explains the most variance in mental
health days?
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R²`,
"M1: Sleep per night only", 1, round(summary(m1)$r.squared, 4), round(summary(m1)$adj.r.squared, 4),
"M2: + Age, Sex", 3, round(summary(m2)$r.squared, 4), round(summary(m2)$adj.r.squared, 4),
"M3: Full (+ phys health, income, exercise)", 6,
round(summary(m3)$r.squared, 4), round(summary(m3)$adj.r.squared, 4)
) %>%
kable(caption = "R² and Adjusted R² Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")
R² and Adjusted R² Across Sequential Models
|
Model
|
Predictors
|
R2
|
Adj. R²
|
|
M1: Sleep per night only
|
1
|
0.0197
|
0.0195
|
|
M2: + Age, Sex
|
3
|
0.0504
|
0.0498
|
|
M3: Full (+ phys health, income, exercise)
|
6
|
0.1569
|
0.1559
|
Which model explains the most variance in mental health days?
Model 3 explains the most variance in mental health days.
4b. (5 pts) What is the Root MSE for Model C?
Interpret it in practical terms — what does it tell you about prediction
accuracy?
root_mse_m3 <- round(summary(m3)$sigma, 2)
cat("Root MSE (Model 3):", root_mse_m3, "mentally unhealthy days\n")
## Root MSE (Model 3): 7.09 mentally unhealthy days
The root mse tells us how “off” the model’s prediction.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.157 0.156 7.09 155. 6.48e-181 6 -16885. 33785. 33837.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
| Model |
6 |
297,712 |
46,768.8 |
929.7972 |
| Residual |
4993 |
250,993 |
50.3 |
|
| Total |
4999 |
548.705 |
|
|
State the null hypothesis for the overall F-test and your
conclusion.
Null Hypothesis(H0): All regression coefficients are equal to zero
(No predictors are significant). The p-values of sleep per night, age,
sex, physical health days, and income are significant while exercise is
not, meaning we can reject the null and say at least one of the
predictors is significant.
Task 5: Residual Diagnostics (15 points)
5a. (5 pts) For Model C, produce the four standard
diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location,
Cook’s Distance). Comment on what each plot tells you about the LINE
assumptions.
par(mfrow = c(2, 2))
plot(m3, which = 1:4, col = adjustcolor("green", alpha.f = 0.3), pch = 16)

Residuals vs. Fitted: There looks to be pattern so linearity can not
be assumed but large sample n = 5,000 gives robustness.
Normal Q-Q plot: Points follow the diagonal line but deviate on the
tails. Normality may be challenged but can still be assumed to sample
size.
Scale-Location: The red line is not flat and points show pattern.
Homoscedasticity can not be assumed.
Cooks Distance: There are some points that appear to be beyond the
0.5 or 1.0 contour. We can not assume there are not influential
outliers.
5b. (5 pts) Given the nature of the outcome (mental
health days, bounded at 0 and 30, heavily right-skewed), which
assumptions are most likely to be violated? Does this invalidate the
analysis? Explain.
The assumptions that are most likely to be violated are linearity,
normality, and influential outliers because the skewness of the data.
However, the sample size is large, it provides robustness to the
data.
5c. (5 pts) Identify any observations with Cook’s
Distance > 1. How many are there? What would you do with them in a
real analysis?
There are three observations with Cook’s Distance > 1. In a real
analysis, I would first check to see if the data is valid or recorded
correctly. Once validity is confirmed I would then see what removing the
data would do or changing it would do and report it.
Task 6: Interpretation for a Public Health Audience (10 points)
Suppose you are writing the results section of a public health paper.
Write a 3–4 sentence paragraph summarizing the findings
from Model C for a non-statistical audience. Your paragraph should:
- Identify which predictors were significantly associated with mental
health days
- State the direction and approximate magnitude of the most important
associations
- Appropriately caveat the cross-sectional nature of the data (no
causal language)
- Not use any statistical jargon (no “significant,” “coefficient,”
“p-value”)
The variables that were most associated with poor mental health days
were hours of sleep per night, age, sex, poor physical health days, and
household income. Sleep per night had an inverse relationship with poor
mental health days; when someone slept more during the night, they had
less days feeling mentally unhealthy. The data used was cross-sectional
and was taken all at the same time; we cannot say what came first or
last.
End of Lab Activity