EPI 553 — Multiple Linear Regression Lab Due: End of class, March 10, 2026
In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) | Ordinal |
sex |
Sex (Male/Female) | Factor |
exercise |
Any physical activity past 30 days (Yes/No) | Factor |
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
#recode for income_cat
brfss_mlr <- brfss_mlr %>%
mutate(
income_cat = factor(
income_cat,
levels = 1:8,
labels = c(
"<$10,000",
"$10,000–<15,000",
"$15,000–<20,000",
"$20,000–<25,000",
"$25,000–<35,000",
"$35,000–<50,000",
"$50,000–<75,000",
"$75,000+"
)
)
)
# Descriptive statistics table
brfss_mlr %>%
select(menthlth_days,physhlth_days, sleep_hrs,age, income_cat, sex, exercise) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally Unhealthy Days",
physhlth_days ~ "Physically Unhealthy Days",
sleep_hrs ~ "Sleep (hours/night)",
income_cat ~ "Household Income",
sex ~ "Sex",
exercise ~ "Physical Activity"),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p})"
),
digits = all_continuous() ~ 1
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("Table 1: Descriptive Statistics (BRFSS 2020)")
| Characteristic | N | N = 5,0001 |
|---|---|---|
| Mentally Unhealthy Days | 5,000 | 3.8 (7.7) |
| Physically Unhealthy Days | 5,000 | 3.3 (7.8) |
| Sleep (hours/night) | 5,000 | 7.1 (1.3) |
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 | 5,000 | 54.3 (17.2) |
| Household Income | 5,000 | |
| <$10,000 | 190 (3.8) | |
| $10,000–<15,000 | 169 (3.4) | |
| $15,000–<20,000 | 312 (6.2) | |
| $20,000–<25,000 | 434 (8.7) | |
| $25,000–<35,000 | 489 (9.8) | |
| $35,000–<50,000 | 683 (14) | |
| $50,000–<75,000 | 841 (17) | |
| $75,000+ | 1,882 (38) | |
| Sex | 5,000 | |
| Male | 2,331 (47) | |
| Female | 2,669 (53) | |
| Physical Activity | 5,000 | 3,874 (77) |
| 1 Mean (SD); n (%) | ||
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
#Histogram of menthlth_days
ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(
bins = 30,
color = "white", fill = "steelblue") +
labs(
title = "Histogram of Mentally Unhealthy Days (BRFSS 2020)",
x = "Mentally Unhealthy Days",
y = "Count")
Interpretation:
The distribution is right-skewed. Many respondents report 0 mentally unhealthy days. With the heavy concentration at 0, it’s difficult for the linear regression to satisfy assumptions of normality and equal-variance.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)
Mental health days and physical helath days have a moderate positive correlation, meaning individuals reporting more physically unhealhty days tend to have more mentally unhealthy days.
Age and mental health days have a small negative correlation, indicating younger adults report more mentally unhealthy days than older adults.
Hours of sleep has a small negative association with mental health days, meaning individuals who get more hours of sleep tend to report fewer mentally unhealthy days
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
\(\hat{Y} = 9.474 + -0.804 * sleep\)
model_slr <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
summary(model_slr)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).
Adults who sleep more hours a night tend to report fewer mentally unhealthy days, where each additional hour of sleep is associated with 0.8 fewer mentally unhealthy days.
2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?
H0: b1 = 0 Ha: b1 != 0 Test statistic: t= -10.02 p-value: <2e-16 Degree of Freedom: 4998
The p-value is <2e-16, so we reject the null hypothesis. we can conclude that hours of sleep per night is significantly associated with mentally unhealthy days in this sample.
3a. (5 pts) Fit three models:
menthlth_days ~ sleep_hrsmenthlth_days ~ sleep_hrs + age + sexmenthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercisemA <- lm(menthlth_days ~ sleep_hrs, data= brfss_mlr)
mB <- lm(menthlth_days ~ sleep_hrs + age + sex, data= brfss_mlr)
mC <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise, data = brfss_mlr)
3b. (10 pts) Create a table comparing the sleep coefficient (\(\hat{\beta}\), SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?
tribble(
~Model, ~ `Sleep β̂`, ~`95% CI`, ~`Adj. R²`,
"M1 (unadjusted)", round(coef(mA)[2], 3),
paste0("(", round(confint(mA)[2,1],3), ", ", round(confint(mA)[2,2],3), ")"),
round(summary(mA)$adj.r.squared, 3),
"M2 (+age, +sex)", round(coef(mB)[2], 3),
paste0("(", round(confint(mB)[2,1],3), ", ", round(confint(mB)[2,2],3), ")"),
round(summary(mB)$adj.r.squared, 3),
"M3 (full)", round(coef(mC)[2], 3),
paste0("(", round(confint(mC)[2,1],3), ", ", round(confint(mC)[2,2],3), ")"),
round(summary(mC)$adj.r.squared, 3)
) %>%
kable(caption = "Sleep Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
| Model | Sleep β | |95% CI |
Adj. R
|
|---|---|---|---|
| M1 (unadjusted) | -0.804 | (-0.962, -0.647) | 0.020 |
| M2 (+age, +sex) | -0.734 | (-0.889, -0.578) | 0.050 |
| M3 (full) | -0.505 | (-0.653, -0.357) | 0.156 |
Interpretation:
Across the 3 models, the effect of sleep becomes slightly less negative as more covariates. This suggests that physical health days, income, exercise, age, and sex explain some of the relationship between sleep and mental health.
3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.
summary(mC)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days +
## income_cat + exercise, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4299 -3.4508 -1.7554 0.3154 29.7449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.62902 0.81563 15.484 < 2e-16 ***
## sleep_hrs -0.50502 0.07537 -6.700 2.31e-11 ***
## age -0.08245 0.00596 -13.834 < 2e-16 ***
## sexFemale 1.24182 0.20236 6.137 9.09e-10 ***
## physhlth_days 0.29136 0.01358 21.451 < 2e-16 ***
## income_cat$10,000–<15,000 -0.04980 0.75075 -0.066 0.947111
## income_cat$15,000–<20,000 -1.47121 0.65316 -2.252 0.024337 *
## income_cat$20,000–<25,000 -1.80753 0.61835 -2.923 0.003481 **
## income_cat$25,000–<35,000 -2.04393 0.60934 -3.354 0.000801 ***
## income_cat$35,000–<50,000 -1.77637 0.58710 -3.026 0.002493 **
## income_cat$50,000–<75,000 -2.68254 0.57658 -4.653 3.36e-06 ***
## income_cat$75,000+ -2.61495 0.54904 -4.763 1.96e-06 ***
## exerciseYes -0.36319 0.25369 -1.432 0.152315
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.088 on 4987 degrees of freedom
## Multiple R-squared: 0.1584, Adjusted R-squared: 0.1563
## F-statistic: 78.19 on 12 and 4987 DF, p-value: < 2.2e-16
$ = 12.629 + -0.505(sleep_hrs) + -0.082(age) + 1.242(sex-female) + 0.291(physhlth_days) + -0.0498(10k - <15k) + -1.471(15k - <20k)+ -1.808 (20k-<25k) + -2.044 (25k - <35k) + -1.776(35k-<50k) + -2.683(50k-<75k) + -2.615(75k+) + -0.363(exercise-yes) $
Intercept (12.629): The represents the expected number of mentally unhealthy days for someone in all reference groups who is age 0, sleeps 0 hours, and has 0 physically unhealthy days. The intercept is not meaningful on its own, since these are not realistic conditions.
Sleep (-0.505): For each additional hour of sleep per night, adults report about half a day fewer mentally unhealthy days.
Age(-0.082): Each additional year of age is associated with slightly fewer mentally unhealthy days.
Sex-Female (1.242): Women report about 1.2 more mentally unhealthy days per month than men.
Physically Unhealthy Days (0.291): Each additional physically unhealthy day is linked to about 0.291 more mentally unhealthy days.
Income: Adults in higher-income groups consistently report fewer mentally unhealthy days. $10k - <15k: -0.0498 $15 - <20k: -1.471 $20k - <25k: -1.808 $25k - <35k: -2.044 $35k - <50k: - 1.776 $50k - <75k: -2.683 $75k: -2.615
Exercise-yes (-0.363): Adults who report regular exercise have slightly fewer mentally unhealthy days.
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?
r2_table <- bind_rows(
glance(mA) %>% mutate(Model = "Model A: Sleep only"),
glance(mB) %>% mutate(Model = "Model B: + Age + Sex"),
glance(mC) %>% mutate(Model = "Model C: Full model")
) %>%
select(Model, r.squared, adj.r.squared) %>%
mutate(across(where(is.numeric), ~ round(., 4)))
r2_table %>%
kable(
caption = "Table. R² and Adjusted R² for Models A, B, and C",
col.names = c("Model", "R²", "Adjusted R²")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
| Model | R² | Adjusted R² |
|---|---|---|
| Model A: Sleep only | 0.0197 | 0.0195 |
| Model B: + Age + Sex | 0.0504 | 0.0498 |
| Model C: Full model | 0.1584 | 0.1563 |
Model A: \(R^2\) = 0.0197, Adjusted \(R^2\) = 0.0195 Model B: \(R^2\) = 0.0504, Adjusted \(R^2\) = 0.0498 Model C: \(R^2\) = 0.1584, Adjusted \(R^2\) = 0.1563
Model C explains the most variance in mentally unhealthy days since it incorporates the strongest predictors.
4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?
The Root MSE for Model C is 7.088. This indicates that Model C’s predictions differ from actual mentally unhealthy days by about 7 days.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
anova(mC)
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.7255 < 2.2e-16 ***
## age 1 6182 6182.2 123.0441 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6557 2.242e-14 ***
## physhlth_days 1 29456 29455.5 586.2487 < 2.2e-16 ***
## income_cat 7 2592 370.2 7.3687 7.870e-09 ***
## exercise 1 103 103.0 2.0495 0.1523
## Residuals 4987 250567 50.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Source | df | SS | MS | F |
|---|---|---|---|---|
| Model | 12 | 47145 | 3928.75 | 78.26 |
| Residual | 4987 | 250567 | 50.2 | |
| Total | 4999 | 297712 |
State the null hypothesis for the overall F-test and your conclusion.
H0: All regression coefficients are equal to zero. The overall F-statistic is 78.26 with a p-value of <2.2e-16, indicating we reject the null hypothesis. Model C explains significantly more variation in mentally unhealthy days than models A and B.
5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.
par(mfrow = c(2, 2))
plot(mC, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)
Interpretation: The Residuals vs Fitted plot shows a slight curve in the reference line. This indicates that the relationship between all predictors and mentally unhealthy days is not perfectly linear. The Q-Q plot shows clear deviations from the reference line, particularly in the upper tail. This reflects a skewed outcome. Scale-Location plot shows an increasing spread of residuals as fitted values increase, indicating heteroscedasticity. The Cook’s Distance plot shows no influential observations.
5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain.
Since mentally unhealthy days are bounded at 0 and 30 and strongly right skewed,the assumptions most likely to be strained are normality, equal variance, and linearity. They don’t invalidate the analysis as this is a large sample size so linear regression still provides estimates of association. The predictions at the high end of the scale are less precise.
5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis?
cooks_d <- cooks.distance(mC)
ggplot(mC, aes(x = seq_along(cooks_d), y= cooks_d)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
labs(
title = "Cook's Distance",
x = "Observation",
y = "Cook's Distance"
)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggplot2 package.
## Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpretation: Based on Cook’s Distance, there are no influential observations. If present in an analysis, we would need to inspect for data entry errors, check to see whether they represent unusual but valid cases, and consider sensitivity analyses if they have meaning to the results.
Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:
End of Lab Activity