Part 2: In-Class Lab Activity
EPI 553 — Multiple Linear Regression Lab
Due: End of class, March 10, 2026
Instructions
In this lab, you will build and interpret multiple linear regression
models using the BRFSS 2020 analytic dataset. Work through each task
systematically. You may discuss concepts with classmates, but your
written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to
Brightspace by end of class.
Data for the Lab
Use the saved analytic dataset from today’s lecture. It contains
5,000 randomly sampled BRFSS 2020 respondents with the following
variables:
menthlth_days |
Mentally unhealthy days in past 30 |
Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 |
Continuous (0–30) |
sleep_hrs |
Sleep hours per night |
Continuous (1–14) |
age |
Age in years (capped at 80) |
Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) |
Ordinal |
sex |
Sex (Male/Female) |
Factor |
exercise |
Any physical activity past 30 days (Yes/No) |
Factor |
Task 1: Exploratory Data Analysis (15 points)
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age, income_f, exercise, sex) %>%
tbl_summary(
label = list(
menthlth_days ~"Mentally unhealthy days (past 30)",
physhlth_days ~"Physically unhealthy days (past 30)",
sleep_hrs ~"Sleep (hours/night)",
age ~"Age (years)",
income_f ~ "Household income",
sex ~"Sex",
exercise ~"Any physical activity (past 30 days)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits= all_continuous() ~ 1,
missing= "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1.Descriptive Statistics - BRFSS 2020 Analytic Sample (n=5,000)**")
Table 1.Descriptive Statistics - BRFSS 2020 Analytic Sample (n=5,000)
| Characteristic |
N |
N = 5,000 |
| Mentally unhealthy days (past 30) |
5,000 |
3.8 (7.7) |
| Physically unhealthy days (past 30) |
5,000 |
3.3 (7.8) |
| Sleep (hours/night) |
5,000 |
7.1 (1.3) |
| Age (years) |
5,000 |
54.3 (17.2) |
| Household income |
5,000 |
|
| <$10k |
|
190 (3.8%) |
| $10-15k |
|
169 (3.4%) |
| $15-20k |
|
312 (6.2%) |
| $20-25k |
|
434 (8.7%) |
| $25-35k |
|
489 (9.8%) |
| $35-50k |
|
683 (14%) |
| $50-75k |
|
841 (17%) |
| >$75k |
|
1,882 (38%) |
| Any physical activity (past 30 days) |
5,000 |
3,874 (77%) |
| Sex |
5,000 |
|
| Male |
|
2,331 (47%) |
| Female |
|
2,669 (53%) |
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
p_hist <- ggplot(brfss_mlr, aes(x= menthlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha =0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n=5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
)+
theme_minimal(base_size = 13)
ggplotly(p_hist)
** The outcome is right-skewed. Most respondents report zero mentally
unhealthy days, with a long tail toward 30.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)
**Mental health day and physical health days are negatively correlated
with sleep duration, indicating more sleep is associated fewer days or
poor physical health and mental health. Mental health days are slightly
negatively correlated with age, while physical health days and sleep
duration are positively correlated with age. —
Task 2: Unadjusted (Simple) Linear Regression (15 points)
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
#fitted regression equation
model_mlr <- lm(menthlth_days ~ sleep_hrs, data=brfss_mlr)
#summary output
summary(model_mlr)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
#coefficient table
tidy(model_mlr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(.,4))) %>%
kable(
caption = "Simple Linear Regression: menthlth_days ~ sleep_hrs(BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistics", "p-value", "95% CI Lower",
"95% CI Upper"),
align = "lrrrrrrr"
)%>%
kable_styling(bootstrap_options = c ("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
Simple Linear Regression: menthlth_days ~ sleep_hrs(BRFSS 2020)
|
Term
|
Estimate
|
Std. Error
|
t-statistics
|
p-value
|
95% CI Lower
|
95% CI Upper
|
|
(Intercept)
|
9.4743
|
0.5771
|
16.4165
|
0
|
8.3429
|
10.6057
|
|
sleep_hrs
|
-0.8042
|
0.0802
|
-10.0218
|
0
|
-0.9616
|
-0.6469
|
** hat{menthlth_days} = 9.4743 -0.8042x(sleep_hrs)
2b. (5 pts) Interpret the slope for sleep in a
single sentence appropriate for a public health audience (no statistical
jargon). For each additional hour of sleep per night, the average number
of poor mental health days in the past 30 days decreases by about 0.8
days.
2c. (5 pts) State the null and alternative
hypotheses for the slope, report the t-statistic and p-value, and state
your conclusion. What is the degree of freedom for this test?
H0=𝛽1 = 0 (There is no linear relationship between poor mental health
days and sleep hours) HA=𝛽1 ≠ 0 (There is a linear relationship between
poor mental health days and sleep hours) t-statistic: -10.0218
p-value< 0.05 reject the null hypothesis This suggest that sleep
hours is associated with the poor mental health days. Specifically,
people with more sleeping hours tend to report fewer poor mental health
days, about 0.8 fewer days for each additional an hour of sleep on
average. df= 4998
Task 3: Building the Multivariable Model (25 points)
3a. (5 pts) Fit three models:
- Model A:
menthlth_days ~ sleep_hrs
- Model B:
menthlth_days ~ sleep_hrs + age + sex
- Model C:
menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise
m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
m2 <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
m3<- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise, data=brfss_mlr)
summary(m3)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days +
## income_cat + exercise, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9192 -3.4262 -1.7803 0.2948 30.0568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.475489 0.716959 17.401 < 2e-16 ***
## sleep_hrs -0.509160 0.075348 -6.757 1.57e-11 ***
## age -0.082307 0.005933 -13.872 < 2e-16 ***
## sexFemale 1.245053 0.202333 6.153 8.17e-10 ***
## physhlth_days 0.291657 0.013579 21.478 < 2e-16 ***
## income_cat -0.321323 0.052012 -6.178 7.02e-10 ***
## exerciseYes -0.342685 0.253138 -1.354 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared: 0.1569, Adjusted R-squared: 0.1559
## F-statistic: 154.9 on 6 and 4993 DF, p-value: < 2.2e-16
3b. (10 pts) Create a table comparing the sleep
coefficient (\(\hat{\beta}\), SE, 95%
CI, p-value) across Models A, B, and C. Does the sleep coefficient
change substantially when you add more covariates? What does this
suggest about confounding?
tidy(m3, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"physhlth_days" = "Physical health days",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"income_cat" = "Income (ordinal 1-8)",
"sexFemale" = "Sex: Female (ref = Male)",
"exerciseYes" = "Exercise: Yes (ref = No)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE) %>%
row_spec(c(2, 3), background = "#EBF5FB") # highlight key predictors
Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple
Predictors (BRFSS 2020, n = 5,000)
|
Term
|
Estimate (β̂
|
Std. Erro
|
|
Intercept
|
12.4755
|
0.7170
|
17.4006
|
0.0000
|
11.0699
|
13.8810
|
|
Sleep (hours/night)
|
-0.5092
|
0.0753
|
-6.7574
|
0.0000
|
-0.6569
|
-0.3614
|
|
Age (years)
|
-0.0823
|
0.0059
|
-13.8724
|
0.0000
|
-0.0939
|
-0.0707
|
|
Sex: Female (ref = Male)
|
1.2451
|
0.2023
|
6.1535
|
0.0000
|
0.8484
|
1.6417
|
|
Physical health days
|
0.2917
|
0.0136
|
21.4779
|
0.0000
|
0.2650
|
0.3183
|
|
Income (ordinal 1-8)
|
-0.3213
|
0.0520
|
-6.1778
|
0.0000
|
-0.4233
|
-0.2194
|
|
Exercise: Yes (ref = No)
|
-0.3427
|
0.2531
|
-1.3537
|
0.1759
|
-0.8389
|
0.1536
|
**Yes, the sleep coefficient decreased from -0.8042 in the simple
regression model to -0.5092 in the multiple regression model after
adjusting for covariates. In the simple regression model, each
additional hour of sleep was associated with 0.8 fewer poor mental
health days. After controlling for age, sex, physical unhealthy days,
income and exercise in the multiple regression model, each additional
hour of sleep was associated with about 0.51 fewer poor mental health
days. This change suggest that some confounding was present. The
association remains statistically significant (p <0.05).
3c. (10 pts) For Model C, write out
the full fitted regression equation and interpret every
coefficient in plain language appropriate for a public health report. **
hat{menthlth_days} = 12.4755 -0.5092x(sleep_hrs)
-0.0823x(age)-1.2451x(sex)- 0.2917x(physical health days)-
0.3213x(income)- 0.3427x(exericise)
Sleep: For each additional hour of sleep per night, the expected
number of poor mental health days in the past 30 days decreases by
0.5092 days holding all other variable constant.
Age: For each additional year of age, the expected number of poor
mental health days decreases by 0.0823 days.
Sex: Female reported an estimated 1.25 more poor mental health days
compared to males.
Physical health days: Each additional poor physical health days is
associated with an additionally 0.2917 poor mental health days.
Income: For each one category increase in the income level, the
expected number of poor mental health days decreased by 0.32 days.
Exercise: People who engage in physical activity in the past 30 days
reported about 0.3427 fewer poor mental health days compared to those
who did not exercise. **
Task 4: Model Fit and ANOVA (20 points)
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B,
C). Create a table. Which model explains the most variance in mental
health days?
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R²`,
"M1: Unadjusted", 1, round(summary(m1)$r.squared, 4), round(summary(m1)$adj.r.squared, 4),
"M2: + Sleep", 2, round(summary(m2)$r.squared, 4), round(summary(m2)$adj.r.squared, 4),
"M3: (Full)", 6,
round(summary(m3)$r.squared, 4), round(summary(m3)$adj.r.squared, 4)
) %>%
kable(caption = "Table 4. R² and Adjusted R² Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")
Table 4. R² and Adjusted R² Across Sequential Models
|
Model
|
Predictors
|
R2
|
Adj. R²
|
|
M1: Unadjusted
|
1
|
0.0197
|
0.0195
|
|
M2: + Sleep
|
2
|
0.0504
|
0.0498
|
|
M3: (Full)
|
6
|
0.1569
|
0.1559
|
**Model 3 Full explains the most variance in mental health days.
4b. (5 pts) What is the Root MSE for Model C?
Interpret it in practical terms — what does it tell you about prediction
accuracy?
rmse_m3 <- round(summary(m3)$sigma, 2)
cat("Root MSE (Model 3):", rmse_m3, "mentally unhealthy days\n")
## Root MSE (Model 3): 7.09 mentally unhealthy days
**The Root MSE for model c is 7.09 mentally unhealthy days. This
indicates that model’s predicted number of mentally unhealthy days
differs from the observed value by 7 days.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
| Model |
? |
? |
? |
? |
| Residual |
? |
? |
? |
|
| Total |
? |
? |
|
|
State the null hypothesis for the overall F-test and your
conclusion.
anova_m3 <- anova(m3)
print(anova_m3)
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glance(m3) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R²",
"adj.r.squared" = "Adjusted R²",
"sigma" = "Residual Std. Error (Root MSE)",
"statistic" = "F-statistic",
"p.value" = "p-value (overall F-test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n − p − 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Table 6. Overall Model Summary — Model 3") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 6. Overall Model Summary — Model 3
|
Statistic
|
Value
|
|
R²
|
0.1569
|
|
Adjusted R²
|
0.1559
|
|
Residual Std. Error (Root MSE)
|
7.0901
|
|
F-statistic
|
154.8953
|
|
p-value (overall F-test)
|
0.0000
|
|
Model df (p)
|
6.0000
|
|
Residual df (n − p − 1)
|
4993.0000
|
|
n (observations)
|
5000.0000
|
**
| Model |
6 |
46,719 |
7786.5 |
154.8953 |
| Residual |
4,993 |
250,993 |
50.3 |
|
| Total |
4,499 |
297,712 |
|
|
** The null hypothesis for the overall F-test is no predictors
matters: H0: beta1=beta2=beta3…=0. F = 154.8953 and p<0.05
statistically significant. Therefore, we reject the null hypothesis
since F value is large, indicating the predictors explain a significant
portion of the variation in poor mental health days. —
Task 5: Residual Diagnostics (15 points)
5a. (5 pts) For Model C, produce the four standard
diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location,
Cook’s Distance). Comment on what each plot tells you about the LINE
assumptions.
par(mfrow = c(2, 2))
plot(m3, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)
**The red line in Residuals vs Fitted plot is not horizontal, suggesting
potential non-linearity.The Q-Q Residuals plot shows that the residuals
deviate from the reference line, indicating they are not perfectly
normally distributed and may be right-skewed. The scale location plot is
not flat suggesting heteroscedasticity. The cook’s distance plot shows
three potential influential observations that could have a strong impact
on the model fit.
5b. (5 pts) Given the nature of the outcome (mental
health days, bounded at 0 and 30, heavily right-skewed), which
assumptions are most likely to be violated? Does this invalidate the
analysis? Explain. **Given the outcome, the assumptions most likely to
be violated are the linearity and normality of residuals assumptions.
The relationship between predictors and the outcome may not be perfectly
linear. The residuals may not normally distributed. With a large sample
size n=5,000, linear regression can still provide unbiased estimate of
the association.
5c. (5 pts) Identify any observations with Cook’s
Distance > 1. How many are there? What would you do with them in a
real analysis? ** There are zero observations with Cook’s Distance >
1, indicating no extremely influential points. In the real analysis, I
would keep all observations in the model, but I might also compare the
results with and without potentially influential points to see if they
meaningfully affect the estimates. —
Task 6: Interpretation for a Public Health Audience (10 points)
Suppose you are writing the results section of a public health paper.
Write a 3–4 sentence paragraph summarizing the findings
from Model C for a non-statistical audience. Your paragraph should:
- Identify which predictors were significantly associated with mental
health days
- State the direction and approximate magnitude of the most important
associations
- Appropriately caveat the cross-sectional nature of the data (no
causal language)
- Not use any statistical jargon (no “significant,” “coefficient,”
“p-value”)
Based on the results of the full model, several factors were
associated with number of poor mental health days in the past
month.People who slept more each night or had higher income levels
tended to report fewer poor mental health days, while females and those
with more physically unhealthy days tend to report more. Engaging in
physical activity in the past 30 days were associated with a small
decrease in poor mental health days. These findings are based on cross
sectional data so they do not imply cause-and-effect relationships.
End of Lab Activity