EPI 553 — Multiple Linear Regression Lab Due: End of class, March 10, 2026
In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) | Ordinal |
sex |
Sex (Male/Female) | Factor |
exercise |
Any physical activity past 30 days (Yes/No) | Factor |
# Load the dataset
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
## Warning: package 'broom' was built under R version 4.5.2
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.2
brfss_mlr <- readRDS(
"C:/Users/abbym/OneDrive/Desktop/STATS553/R Materials/epi553/scripts/brfss_mlr_2020.rds"
)
# Create a descriptive statistics table
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_f, sex, exercise) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
income_f ~ "Household income",
sex ~ "Sex",
exercise ~ "Any physical activity (past 30 days)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")
| Characteristic | N | N = 5,0001 |
|---|---|---|
| Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.7) |
| Physically unhealthy days (past 30) | 5,000 | 3.3 (7.8) |
| Sleep (hours/night) | 5,000 | 7.1 (1.3) |
| Age (years) | 5,000 | 54.3 (17.2) |
| Household income | 5,000 | |
| <$10k | 190 (3.8%) | |
| $10-15k | 169 (3.4%) | |
| $15-20k | 312 (6.2%) | |
| $20-25k | 434 (8.7%) | |
| $25-35k | 489 (9.8%) | |
| $35-50k | 683 (14%) | |
| $50-75k | 841 (17%) | |
| >$75k | 1,882 (38%) | |
| Sex | 5,000 | |
| Male | 2,331 (47%) | |
| Female | 2,669 (53%) | |
| Any physical activity (past 30 days) | 5,000 | 3,874 (77%) |
| 1 Mean (SD); n (%) | ||
# Create histogram of menthlth_days
p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(p_hist)
# Create a scatterplot matrix for continuous variables
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling? The histogram of bad mental health
days is heavily right skewed. This goes against the assumption of
normality for linear regression.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome. Physical
health Days and mental health days have a moderate positive correlation.
Sleep and mental health days have a weak negative correlation. Sleep and
physical health days have a weak negative correlation. Mental Health
days and age have a weak negative correlation. Physical health days and
age have a weak positive correlation. Sleep and age have a weak positive
correlation. —
# Fit a simple linear regression model
m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
tidy(m1, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"menthlth_days" = "Mental health days",
"sleep_hrs" = "Sleep (hours/night)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Hours of Sleep (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
| Term | Estimate (β̂ |
Std. Erro
| ||||
|---|---|---|---|---|---|---|
| Intercept | 9.4743 | 0.5771 | 16.4165 | 0 | 8.3429 | 10.6057 |
| Sleep (hours/night) | -0.8042 | 0.0802 | -10.0218 | 0 | -0.9616 | -0.6469 |
# Conduct a slope test
slope_test <- tidy(m1, conf.int = TRUE) %>% filter(term == "sleep_hrs")
tibble(
Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
"Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
Value = c(
round(slope_test$estimate, 4),
round(slope_test$std.error, 4),
round(slope_test$statistic, 3),
4998,
format.pval(slope_test$p.value, digits = 3),
round(slope_test$conf.low, 4),
round(slope_test$conf.high, 4)
)
) %>%
kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Quantity | Value |
|---|---|
| Slope (b₁) | -0.8042 |
| SE(b₁) | 0.0802 |
| t-statistic | -10.022 |
| Degrees of freedom | 4998 |
| p-value | <2e-16 |
| 95% CI Lower | -0.9616 |
| 95% CI Upper | -0.6469 |
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation. menthlth_days= 9.4743 -
0.8042(sleep_hrs) 2b. (5 pts) Interpret the slope for
sleep in a single sentence appropriate for a public health audience (no
statistical jargon). Each additional hour of sleep is associated with
0.8042 fewer bad mental health days.
2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test? The null hypothesis is that there is no linear relationship between hours of sleep and number of bad mental health days (slope=0) and the null hypothesis is that there is a linear relationship between hours of sleep and number of bad mental health days (slope does not equal 0). The t-statistic is -10.022, the p-value is <2e-16 and the degrees of freedom are 4998. You reject the null hypothesis that there is no linear relationship between sleep and bad mental health days. The slope is statistically significant, and shows a negative association between sleep and bad mental health days.
# Fit three models
# Model 1: Unadjusted
m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model 2: Add sleep
m2 <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model 3: Full multivariable model
m3 <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
data = brfss_mlr)
# Create comparison table
tribble(
~Model, ~`Mental Health Days β̂`, ~`SE`, ~`p-value`, ~`95% CI`, ~`Adj. R²`,
"M1 (unadjusted)",
round(coef(m1)[2], 3),
round(summary(m1)$coefficients[2,2], 3),
round(summary(m1)$coefficients[2,4], 4),
paste0("(", round(confint(m1)[2,1],3), ", ", round(confint(m1)[2,2],3), ")"),
round(summary(m1)$adj.r.squared, 3),
"M2 (+age +sex)",
round(coef(m2)[2], 3),
round(summary(m2)$coefficients[2,2], 3),
round(summary(m2)$coefficients[2,4], 4),
paste0("(", round(confint(m2)[2,1],3), ", ", round(confint(m2)[2,2],3), ")"),
round(summary(m2)$adj.r.squared, 3),
"M3 (full)",
round(coef(m3)[2], 3),
round(summary(m3)$coefficients[2,2], 3),
round(summary(m3)$coefficients[2,4], 4),
paste0("(", round(confint(m3)[2,1],3), ", ", round(confint(m3)[2,2],3), ")"),
round(summary(m3)$adj.r.squared, 3)
) %>%
kable(caption = "Table 4. Physical Health Days Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
| Model | Mental Health Days β |
S
| |||
|---|---|---|---|---|---|
| M1 (unadjusted) | -0.804 | 0.080 | 0 | (-0.962, -0.647) | 0.020 |
| M2 (+age +sex) | -0.734 | 0.079 | 0 | (-0.889, -0.578) | 0.050 |
| M3 (full) | -0.509 | 0.075 | 0 | (-0.657, -0.361) | 0.156 |
3a. (5 pts) Fit three models:
menthlth_days ~ sleep_hrsmenthlth_days ~ sleep_hrs + age + sexmenthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise3b. (10 pts) Create a table comparing the sleep coefficient (\(\hat{\beta}\), SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding? The sleep coeffecient does change a good amount when other covariates are included. The coefficient decreases the number of fewer days 1 hour of additional sleep impacts, meaning that the other variables are confounders that enhance the relationship between sleep and bad mental health days. 3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report. \(\hat{Mental Health Days}\)= 12.4755 -0.5092(sleep_hrs)-0.0823(age)+1.2451(sex)+0.2917(physhlth_days)-0.3213(income_cat)-0.3427(exercizse) Each additional hour of sleep is associated with 0.5092 fewer bad mental health days, holding everything else constant. Each year increase in age is associated with 0.0823 fewer bad mental health days, holding everything else constant. Compared to males, females experience 1.2451 more bad mental health days on average, holding everything else constant. Each additional physical health day is associated with 0.2917 more bad mental health days, holding everything else constant. Each one unit increase in income level is associated with 0.3213 fewer bad mental health days, holding everything else constant. Compared to those who do not exercise, those who do experience 0.3427 fewer bad mental health days on average, holding everything else constant. —
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R²`,
"M1: Sleep only", 1, round(summary(m1)$r.squared, 4), round(summary(m1)$adj.r.squared, 4),
"M2: + Sleep + Age", 2, round(summary(m2)$r.squared, 4), round(summary(m2)$adj.r.squared, 4),
"M3: Full (+ age, income, sex, exercise, physical days)", 6,
round(summary(m3)$r.squared, 4), round(summary(m3)$adj.r.squared, 4)
) %>%
kable(caption = "Table 7. R² and Adjusted R² Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")
| Model | Predictors | R2 | Adj. R² |
|---|---|---|---|
| M1: Sleep only | 1 | 0.0197 | 0.0195 |
| M2: + Sleep + Age | 2 | 0.0504 | 0.0498 |
| M3: Full (+ age, income, sex, exercise, physical days) | 6 | 0.1569 | 0.1559 |
glance(m3) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R²",
"adj.r.squared" = "Adjusted R²",
"sigma" = "Residual Std. Error (Root MSE)",
"statistic" = "F-statistic",
"p.value" = "p-value (overall F-test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n − p − 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Table 6. Overall Model Summary — Model 3") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Statistic | Value |
|---|---|
| R² | 0.1569 |
| Adjusted R² | 0.1559 |
| Residual Std. Error (Root MSE) | 7.0901 |
| F-statistic | 154.8953 |
| p-value (overall F-test) | 0.0000 |
| Model df (p) | 6.0000 |
| Residual df (n − p − 1) | 4993.0000 |
| n (observations) | 5000.0000 |
anova_m3 <- anova(m3)
print(anova_m3)
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days? The full model (model C) explains the most variance in mental health days.
4b. (5 pts) What is the Root MSE for Model C?
Interpret it in practical terms — what does it tell you about prediction
accuracy? The root MSE is 7.0901. This tells us how many bad mental
health days the model is off on average. 4c. (10 pts)
Using the ANOVA output for Model C, fill in the following table manually
(i.e., compute the values using the output from anova() or
glance()):
| Source | df | SS | MS | F |
|---|---|---|---|---|
| Model | 6.0000 | 46719 | 7786.5 | 154.8953 |
| Residual | 4993.0000 | 250993 | 50.3 | |
| Total | 4999.0000 | 297712 |
State the null hypothesis for the overall F-test and your conclusion. The null hypothesis for the F-test is that all of the coefficients are equal to 0. There is statistically significant evidence that at least one predictor’s coefficient is not equal to 0, meaning it is associated with the number of poor mental health days. —
par(mfrow = c(2, 2))
plot(m3, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)
par(mfrow = c(1, 1))
# Cook's Distance
augmented <- augment(m3)
augmented <- augmented %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(m3),
influential = ifelse(cooks_d > 1, "Potentially influential", "Not influential")
)
n_influential <- sum(augmented$cooks_d > 1)
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 1, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 1 threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooks
5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions. The residuals vs. fitted plot shows a somewhat random distribution of the points, but the red line isn’t exactly horizontal. This tells us that the linearity assumption could be violated.The Q-Q plot shows a deviation from the line, meaning the data are not normally distributed. The scale-location test shows a non-horizontal red line, meaning the equivalence of variances assumptions could be violated. There are no observations about Cook’s distance, meaning there are no influential observations. 5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain. The normality, linearity, and equivalent variance assumptions will most likely be violated. The normality violation will unlikely invalidate the results, because it is such a large sample size, the central limit theorem applies. The linearity and equivalence of variance violations are more likely to invalidate the results. The results of these tests can still have give information regarding the relationship between the predictors and bad mental health days. 5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis? There are no observations with Cook’s distance > 1. I would remove them in a real analysis. —
Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:
End of Lab Activity