Part 2: In-Class Lab Activity
EPI 553 — Multiple Linear Regression Lab
Due: End of class, March 10, 2026
Instructions
In this lab, you will build and interpret multiple linear regression
models using the BRFSS 2020 analytic dataset. Work through each task
systematically. You may discuss concepts with classmates, but your
written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to
Brightspace by end of class.
Data for the Lab
Use the saved analytic dataset from today’s lecture. It contains
5,000 randomly sampled BRFSS 2020 respondents with the following
variables:
menthlth_days |
Mentally unhealthy days in past 30 |
Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 |
Continuous (0–30) |
sleep_hrs |
Sleep hours per night |
Continuous (1–14) |
age |
Age in years (capped at 80) |
Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) |
Ordinal |
sex |
Sex (Male/Female) |
Factor |
exercise |
Any physical activity past 30 days (Yes/No) |
Factor |
Task 1: Exploratory Data Analysis (15 points)
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
library(dplyr)
library(gtsummary)
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_cat, sex, exercise) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep hours per night",
age ~ "Age (years)",
income_cat ~ "Household income (1 = <$10k to 8 = >$75k)",
sex ~ "Sex",
exercise ~ "Any physical activity in past 30 days"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")
Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)
| Characteristic |
N |
N = 5,000 |
| Mentally unhealthy days (past 30) |
5,000 |
3.8 (7.7) |
| Physically unhealthy days (past 30) |
5,000 |
3.3 (7.8) |
| Sleep hours per night |
5,000 |
7.1 (1.3) |
| Age (years) |
5,000 |
54.3 (17.2) |
| Household income (1 = <$10k to 8 = >$75k) |
5,000 |
|
| 1 |
|
190 (3.8%) |
| 2 |
|
169 (3.4%) |
| 3 |
|
312 (6.2%) |
| 4 |
|
434 (8.7%) |
| 5 |
|
489 (9.8%) |
| 6 |
|
683 (14%) |
| 7 |
|
841 (17%) |
| 8 |
|
1,882 (38%) |
| Sex |
5,000 |
|
| Male |
|
2,331 (47%) |
| Female |
|
2,669 (53%) |
| Any physical activity in past 30 days |
5,000 |
3,874 (77%) |
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
library(ggplot2)
library(plotly)
p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(p_hist)
Interpretation:
The histogram shows a right-skewed distribution, with many
respondents reporting 0 mentally unhealthy days and fewer reporting
higher values. This indicates the outcome is not normally distributed,
but with a large sample (n = 5,000) linear regression is generally
robust.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
library(GGally)
library(dplyr)
# Scatterplot matrix of continuous variables
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)

Interpretation:
There is a moderate positive correlation between physically unhealthy
days and mentally unhealthy days. Sleep hours show a weak negative
correlation with mentally unhealthy days, while age appears to have
little association with the outcome.
Task 2: Unadjusted (Simple) Linear Regression (15 points)
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
# Simple linear regression
model_sleep <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
summary(model_sleep)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
Fitted regression equation
menthlth_days=9.47−0.80(sleep_hrs)
2b. (5 pts) Interpret the slope for sleep in a
single sentence appropriate for a public health audience (no statistical
jargon).
Each additional hour of sleep per night is associated with about 0.8
fewer mentally unhealthy days in the past 30 days, on average.
2c. (5 pts) State the null and alternative
hypotheses for the slope, report the t-statistic and p-value, and state
your conclusion. What is the degree of freedom for this test? Null
hypothesis (H₀): β₁ = 0 (sleep hours are not associated with mentally
unhealthy days) Alternative hypothesis (H₁): β₁ ≠ 0 t-statistic: −10.02
p-value:2.2e-16 (<0.001) Degrees of freedom: 4998 Conclusion: We
reject the null hypothesis and conclude that sleep hours are
significantly associated with mentally unhealthy days. —
Task 3: Building the Multivariable Model (25 points)
3a. (5 pts) Fit three models:
- Model A:
menthlth_days ~ sleep_hrs
- Model B:
menthlth_days ~ sleep_hrs + age + sex
- Model C:
menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise
# Model A
mA <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model B
mB <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model C
mC <- lm(menthlth_days ~ sleep_hrs + age + sex +
physhlth_days + income_cat + exercise,
data = brfss_mlr)
summary(mA)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.306 -3.980 -2.515 -0.307 32.360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.76563 0.64447 18.256 < 2e-16 ***
## sleep_hrs -0.73391 0.07932 -9.253 < 2e-16 ***
## age -0.06648 0.00622 -10.688 < 2e-16 ***
## sexFemale 1.53992 0.21339 7.217 6.13e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.523 on 4996 degrees of freedom
## Multiple R-squared: 0.05036, Adjusted R-squared: 0.04979
## F-statistic: 88.32 on 3 and 4996 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days +
## income_cat + exercise, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9192 -3.4262 -1.7803 0.2948 30.0568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.475489 0.716959 17.401 < 2e-16 ***
## sleep_hrs -0.509160 0.075348 -6.757 1.57e-11 ***
## age -0.082307 0.005933 -13.872 < 2e-16 ***
## sexFemale 1.245053 0.202333 6.153 8.17e-10 ***
## physhlth_days 0.291657 0.013579 21.478 < 2e-16 ***
## income_cat -0.321323 0.052012 -6.178 7.02e-10 ***
## exerciseYes -0.342685 0.253138 -1.354 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared: 0.1569, Adjusted R-squared: 0.1559
## F-statistic: 154.9 on 6 and 4993 DF, p-value: < 2.2e-16
3b. (10 pts) Create a table comparing the sleep
coefficient (\(\hat{\beta}\), SE, 95%
CI, p-value) across Models A, B, and C. Does the sleep coefficient
change substantially when you add more covariates? What does this
suggest about confounding?
library(dplyr)
library(broom)
library(knitr)
library(kableExtra)
tribble(
~Model, ~`Sleep (hours/night) β̂`, ~`95% CI`, ~`Adj. R²`,
"Model A (unadjusted)", round(coef(mA)[2], 3),
paste0("(", round(confint(mA)[2,1],3), ", ", round(confint(mA)[2,2],3), ")"),
round(summary(mA)$adj.r.squared, 3),
"Model B (+ age, sex)", round(coef(mB)[2], 3),
paste0("(", round(confint(mB)[2,1],3), ", ", round(confint(mB)[2,2],3), ")"),
round(summary(mB)$adj.r.squared, 3),
"Model C (full)", round(coef(mC)[2], 3),
paste0("(", round(confint(mC)[2,1],3), ", ", round(confint(mC)[2,2],3), ")"),
round(summary(mC)$adj.r.squared, 3)
) %>%
kable(caption = "Table 2. Sleep Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
Table 2. Sleep Coefficient Across Sequential Models
|
Model
|
Sleep (hours/night) β
|
|95% CI
|
Adj. R
|
|
Model A (unadjusted)
|
-0.804
|
(-0.962, -0.647)
|
0.020
|
|
Model B (+ age, sex)
|
-0.734
|
(-0.889, -0.578)
|
0.050
|
|
Model C (full)
|
-0.509
|
(-0.657, -0.361)
|
0.156
|
Interpretation The sleep coefficient remains
negative across Models A, B, and C, indicating that more sleep is
associated with fewer mentally unhealthy days. If the coefficient
changes only slightly after adding covariates, this suggests little
confounding. Larger changes would suggest that age, sex, physical
health, income, or exercise partially confound the relationship between
sleep and mental health.
3c. (10 pts) For Model C, write out
the full fitted regression equation and interpret every
coefficient in plain language appropriate for a public health
report.
tidy(mC, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"sexFemale" = "Sex: Female (ref = Male)",
"physhlth_days"= "Physical health days",
"income_cat" = "Income (ordinal 1–8)",
"exerciseYes" = "Exercise: Yes (ref = No)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple
Predictors (BRFSS 2020, n = 5,000)
|
Term
|
Estimate (β̂
|
Std. Erro
|
|
Intercept
|
12.4755
|
0.7170
|
17.4006
|
0.0000
|
11.0699
|
13.8810
|
|
Sleep (hours/night)
|
-0.5092
|
0.0753
|
-6.7574
|
0.0000
|
-0.6569
|
-0.3614
|
|
Age (years)
|
-0.0823
|
0.0059
|
-13.8724
|
0.0000
|
-0.0939
|
-0.0707
|
|
Sex: Female (ref = Male)
|
1.2451
|
0.2023
|
6.1535
|
0.0000
|
0.8484
|
1.6417
|
|
Physical health days
|
0.2917
|
0.0136
|
21.4779
|
0.0000
|
0.2650
|
0.3183
|
|
Income (ordinal 1–8)
|
-0.3213
|
0.0520
|
-6.1778
|
0.0000
|
-0.4233
|
-0.2194
|
|
Exercise: Yes (ref = No)
|
-0.3427
|
0.2531
|
-1.3537
|
0.1759
|
-0.8389
|
0.1536
|
Interpretation Mental Health
Daysˆ=12.475+0.292(Phys. Health
Days)+−0.509(Sleep)+−0.082(Age)+−0.321(Income)+1.245(Female)+−0.343(Exercise:
Yes) Physical health days (β̂ = 0.292): Each additional physically
unhealthy day is associated with 0.292 more mentally unhealthy days on
average, holding sleep, age, sex, and other variables constant (95% CI:
0.265 to 0.318). This indicates a strong positive relationship between
physical and mental health.
Sleep hours (β̂ = −0.509): Each additional hour of sleep per night is
associated with 0.509 fewer mentally unhealthy days on average,
adjusting for other variables (95% CI: −0.657 to −0.361). This suggests
that more sleep is associated with better mental health.
Age (β̂ = −0.082): Each additional year of age is associated with
0.082 fewer mentally unhealthy days on average, holding other variables
constant (95% CI: −0.094 to −0.071).
Income category (β^= -0.321): Each one-unit increase in the income
category (on the 1–8 ordinal scale) is associated with 0.321 fewer
mentally unhealthy days on average, consistent with the well-established
socioeconomic gradient in mental health.
Sex: Female (β̂ = 1.245): Compared with males (reference group),
females report 1.245 more mentally unhealthy days on average, adjusting
for other predictors (95% CI: 0.848 to 1.642).
Exercise: Yes (β^= -0.343): People who engaged in any physical
activity in the past 30 days report an estimated 0.343 fewer mentally
unhealthy days compared to those who did not exercise, adjusting for all
other covariates.
Intercept (β̂₀ = 12.476): The intercept represents the estimated
number of mentally unhealthy days for a person with 0 sleep hours, age
0, and 0 physically unhealthy days who is male. This value serves mainly
as a mathematical baseline and is not meaningful in practice.
Task 4: Model Fit and ANOVA (20 points)
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B,
C). Create a table. Which model explains the most variance in mental
health days?
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R²`,
"Model A: Sleep only", 1,
round(summary(mA)$r.squared, 4),
round(summary(mA)$adj.r.squared, 4),
"Model B: + Age, Sex", 3,
round(summary(mB)$r.squared, 4),
round(summary(mB)$adj.r.squared, 4),
"Model C: Full (+ physical health, income, exercise)", 6,
round(summary(mC)$r.squared, 4),
round(summary(mC)$adj.r.squared, 4)
) %>%
kable(caption = "Table 4. R² and Adjusted R² Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")
Table 4. R² and Adjusted R² Across Sequential Models
|
Model
|
Predictors
|
R2
|
Adj. R²
|
|
Model A: Sleep only
|
1
|
0.0197
|
0.0195
|
|
Model B: + Age, Sex
|
3
|
0.0504
|
0.0498
|
|
Model C: Full (+ physical health, income, exercise)
|
6
|
0.1569
|
0.1559
|
Interpretation Model A shows that sleep alone
explains about 2.0% of the variation in mentally unhealthy days. When
age and sex are added in Model B, the explained variance increases to
about 5.0%, suggesting these variables contribute additional
information. Model C, which includes physical health, income, and
exercise, explains about 15.7% of the variance, indicating substantially
improved model fit. Therefore, Model C explains the most variance in
mental health days.
4b. (5 pts) What is the Root MSE for Model C?
Interpret it in practical terms — what does it tell you about prediction
accuracy?
rmse_m3 <- round(summary(m3)$sigma, 2)
cat("Root MSE (Model 3):", rmse_m3, "mentally unhealthy days\n")
## Root MSE (Model 3): 7.09 mentally unhealthy days
Interpretation The Root MSE for Model C is 7.09
days, meaning that the predicted number of mentally unhealthy days
typically differs from the observed value by about 7 days on average.
This reflects the typical prediction error of the model when estimating
mentally unhealthy days for individuals.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
| Model |
6 |
46,719 |
7,786.5 |
154.8 |
| Residual |
4,993 |
250,993 |
50.3 |
|
| Total |
4,999 |
297,712 |
|
|
State the null hypothesis for the overall F-test and your
conclusion.
Null hypothesis (H₀): All regression coefficients are equal to zero.
β1=β2=β3=β4=β5=β6=0
Alternative hypothesis (H₁): At least one predictor has a non-zero
coefficient.
Conclusion Because the overall F statistic is very
large (F ≈ 154.8) and the p-value < 0.001, we reject the null
hypothesis and conclude that the predictors in Model C jointly explain
variation in mentally unhealthy days.
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Task 5: Residual Diagnostics (15 points)
5a. (5 pts) For Model C, produce the four standard
diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location,
Cook’s Distance). Comment on what each plot tells you about the LINE
assumptions.
par(mfrow = c(2, 2))
plot(m3, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

Interpretation 1. Residuals vs Fitted (Linearity
& Homoscedasticity) The residuals are generally scattered around
zero, suggesting the linear relationship assumption is mostly
reasonable. However, there is a slight pattern and funnel shape,
indicating some heteroscedasticity (non-constant variance) as fitted
values increase.
Normal Q–Q Plot (Normality of residuals) Most points follow the
diagonal line, but the upper tail deviates noticeably, indicating the
residuals are not perfectly normally distributed. This is expected
because the outcome variable is bounded (0–30) and
right-skewed.
Scale–Location Plot (Equal variance) The upward trend of the red
line suggests that the spread of residuals increases with fitted values,
providing additional evidence of mild heteroscedasticity.
Cook’s Distance (Influential observations) Most Cook’s distance
values are very small and well below 1, indicating no highly influential
observations. A few observations stand out slightly but are not extreme
enough to strongly influence the model.
5b. (5 pts) Given the nature of the outcome (mental
health days, bounded at 0 and 30, heavily right-skewed), which
assumptions are most likely to be violated? Does this invalidate the
analysis? Explain.
Because the outcome is bounded (0–30) and right-skewed, the normality
and homoscedasticity assumptions are most likely violated. However, this
does not invalidate the analysis because the large sample size (n =
5,000) makes linear regression relatively robust to these violations.
The model can still provide useful estimates of associations, although
alternative models (e.g., Poisson or negative binomial regression) could
also be considered for count-like outcomes.
5c. (5 pts) Identify any observations with Cook’s
Distance > 1. How many are there? What would you do with them in a
real analysis?
There are 0 observations with Cook’s Distance > 1. In practice,
influential observations would be checked for data errors and evaluated
with sensitivity analyses, but not removed without justification.
Task 6: Interpretation for a Public Health Audience (10 points)
Suppose you are writing the results section of a public health paper.
Write a 3–4 sentence paragraph summarizing the findings
from Model C for a non-statistical audience. Your paragraph should:
- Identify which predictors were significantly associated with mental
health days
- State the direction and approximate magnitude of the most important
associations
- Appropriately caveat the cross-sectional nature of the data (no
causal language)
- Not use any statistical jargon (no “significant,” “coefficient,”
“p-value”)
In this study of 5,000 adults, several factors were related to the
number of mentally unhealthy days reported in the past month. People who
reported more physically unhealthy days also reported substantially more
mentally unhealthy days, while those who slept longer and those with
higher household income tended to report fewer mentally unhealthy days.
Older adults generally reported slightly fewer mentally unhealthy days,
whereas women reported about one more mentally unhealthy day per month
than men. Because the data are cross-sectional, these findings describe
associations only and cannot determine cause-and-effect
relationships.
End of Lab Activity