In the previous lectures, we used Simple Linear Regression (SLR) to model a continuous outcome as a function of a single predictor. But real-world health phenomena are never driven by one variable alone. A person’s mental health is shaped by how much they sleep, how physically ill they are, how old they are, their income, their sex, and much more — simultaneously.
Multiple Linear Regression (MLR) extends SLR to accommodate this complexity. It allows us to:
In epidemiology, MLR is our primary tool for multivariable analysis of continuous outcomes.
In this lab, you will apply Simple Linear Regression to the
BRFSS 2020 dataset using a different outcome variable:
number of days of poor physical health in the past 30
days (phys_days). You will model it as a
continuous outcome predicted by BMI.
Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?
Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.
# Load packages
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)
library(haven)
library(janitor)
library(plotly)
library(GGally)
library(here)
library(lmtest)
library(corrplot)
# Load dataset
lab_rds_path <- "~/R/site-library/Epi553/code/brfss_slr_2020_lab.rds"
if (file.exists(lab_rds_path)) {
brfss_lab <- read_rds(lab_rds_path)
} else {
brfss_lab <- read_rds(raw_rds_path) %>%
clean_names()
set.seed(553)
brfss_lab <- brfss_lab %>%
select(bmi, age, sex, phys_days, sleep_hrs) %>%
filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
drop_na() %>%
slice_sample(n = 3000)
saveRDS(brfss_lab, lab_rds_path)
}# (a) Create a summary table of phys_days and bmi
brfss_lab %>%
select(phys_days, bmi) %>%
summary() %>%
kable(caption = "Descriptive Statistics: phys_days and bmi") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| phys_days | bmi | |
|---|---|---|
| Min. : 1.00 | Min. :14.63 | |
| 1st Qu.: 2.00 | 1st Qu.:24.32 | |
| Median : 6.00 | Median :27.89 | |
| Mean :11.66 | Mean :29.18 | |
| 3rd Qu.:20.00 | 3rd Qu.:32.89 | |
| Max. :30.00 | Max. :59.60 |
# (b) Create a histogram of phys_days — describe the distribution
ggplot(brfss_lab, aes(x = phys_days)) +
geom_histogram(bins = 31, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Poor Physical Health Days (Past 30 Days)",
subtitle = "BRFSS 2020, n = 3,000",
x = "Number of Poor Physical Health Days",
y = "Count"
) +
theme_minimal(base_size = 13)# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
geom_point(alpha = 0.15, color = "darkgreen", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "purple", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "Poor Physical Health Days vs. BMI (BRFSS 2020)",
subtitle = "Red = Linear fit | Blue dashed = LOESS smoother",
x = "BMI (kg/m²)",
y = "Poor Physical Health Days (past 30)"
) +
theme_minimal(base_size = 13)Questions:
What is the mean and standard deviation of
phys_days? Of bmi? What do you notice about
the distribution of phys_days? #The distribution of poor
physical health days is strongly right-skewed, with a large spike at 0
days and another concentration at 30 days, indicating a ceiling
effect.
Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers? #The scatterplot shows a weak positive linear trend between BMI and poor physical health days. —
# (a) Fit the SLR model: phys_days ~ bmi
model_lab <- lm(phys_days ~ bmi, data = brfss_lab)
summary(model_lab)##
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.808 -9.160 -5.623 8.943 20.453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.42285 0.86881 8.544 < 2e-16 ***
## bmi 0.14513 0.02895 5.013 5.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared: 0.008314, Adjusted R-squared: 0.007983
## F-statistic: 25.13 on 1 and 2998 DF, p-value: 5.659e-07
# (b) Display a tidy coefficient table with 95% CIs
tidy(model_lab, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.4228 | 0.8688 | 8.5437 | 0 | 5.7193 | 9.1264 |
| bmi | 0.1451 | 0.0289 | 5.0134 | 0 | 0.0884 | 0.2019 |
# (c) Extract and report: slope, intercept, t-statistic, p-value
b0_lab <- round(coef(model_lab)[1], 3)
b1_lab <- round(coef(model_lab)[2], 4)
slope_info <- tidy(model_lab, conf.int = TRUE) %>%
filter(term == "bmi")
tibble(
Quantity = c("Intercept (b₀)", "Slope (b₁)", "t-statistic", "p-value",
"95% CI Lower", "95% CI Upper"),
Value = c(
round(slope_info$estimate[1], 4),
round(b1_lab, 4),
round(slope_info$statistic, 3),
format.pval(slope_info$p.value, digits = 3),
round(slope_info$conf.low, 4),
round(slope_info$conf.high, 4)
)
) %>%
kable(caption = "Key Model Statistics: phys_days ~ bmi") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| Intercept (b₀) | 0.1451 |
| Slope (b₁) | 0.1451 |
| t-statistic | 5.013 |
| p-value | 5.66e-07 |
| 95% CI Lower | 0.0884 |
| 95% CI Upper | 0.2019 |
Questions:
Interpret the slope (\(b_1\)) in context — what does it mean in plain English? #For each 1-unit increase in BMI, the mean number of poor physical health days increases by approximately 0.145 days, on average.
Is the intercept (\(b_0\)) interpretable in this context? Why or why not? #The intercept represents the predicted number of poor physical health days when BMI equals 0. Because a BMI of 0 is not realistic, the intercept is not practically interpretable.
Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value.
# (a) Display the ANOVA table
anova_lab <- anova(model_lab)
anova_lab %>%
kable(
caption = "ANOVA Table: phys_days ~ bmi",
digits = 3,
col.names = c("Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| bmi | 1 | 3105.365 | 3105.365 | 25.134 | 0 |
| Residuals | 2998 | 370411.743 | 123.553 | NA | NA |
# (b) Compute and report SSTotal, SSRegression, and SSResidual
augmented_lab <- augment(model_lab)
ss_total <- sum((brfss_lab$phys_days - mean(brfss_lab$phys_days))^2)
ss_resid <- sum(augmented_lab$.resid^2)
ss_reg <- ss_total - ss_resid
n_lab <- nrow(brfss_lab)
tibble(
Component = c("SS Total (SST)", "SS Regression (SSReg)", "SS Residual (SSE)"),
Formula = c("Σ(Yᵢ − Ȳ)²", "Σ(Ŷᵢ − Ȳ)²", "Σ(Yᵢ − Ŷᵢ)²"),
Value = c(round(ss_total, 2), round(ss_reg, 2), round(ss_resid, 2)),
df = c(n_lab - 1, 1, n_lab - 2)
) %>%
kable(caption = "ANOVA Decomposition: Manual Calculation") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Component | Formula | Value | df |
|---|---|---|---|
| SS Total (SST) | Σ(Yᵢ − Ȳ)² | 373517.11 | 2999 |
| SS Regression (SSReg) | Σ(Ŷᵢ − Ȳ)² | 3105.36 | 1 |
| SS Residual (SSE) | Σ(Yᵢ − Ŷᵢ)² | 370411.74 | 2998 |
# (c) Compute R² two ways: from the model object and from the SS decomposition
r2_model <- summary(model_lab)$r.squared
r2_manual <- ss_reg / ss_total
tibble(
Method = c("From model object (summary()$r.squared)", "From SS decomposition (SSReg / SSTotal)"),
R2 = c(round(r2_model, 6), round(r2_manual, 6)),
Match = c("—", as.character(round(r2_model, 6) == round(r2_manual, 6)))
) %>%
kable(caption = "R² Computed Two Ways") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Method | R2 | Match |
|---|---|---|
| From model object (summary()$r.squared) | 0.008314 | — |
| From SS decomposition (SSReg / SSTotal) | 0.008314 | TRUE |
Questions:
Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic. #The F-test indicates that the regression model is statistically significant.
What is the \(R^2\) value? Interpret it in plain English. # the BMI explains very little of the differences in poor physical health days among individuals over 99% of the variation is due to other factors not included in this simple model.
What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation? # the r^2 value of 0.0083 indicates that BMI alone explains less than 1% of the variation in poor physical health days. —
# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi25 <- data.frame(bmi = 25)
ci_bmi25 <- predict(model_lab, newdata = new_bmi25, interval = "confidence") %>%
as.data.frame()
ci_bmi25 %>%
mutate(across(where(is.numeric), ~ round(., 3))) %>%
kable(
caption = "Fitted Value and 95% CI for Mean phys_days at BMI = 25",
col.names = c("Fitted (Mean Estimate)", "CI Lower", "CI Upper")
) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Fitted (Mean Estimate) | CI Lower | CI Upper |
|---|---|---|
| 11.051 | 10.588 | 11.514 |
# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_bmi25 <- predict(model_lab, newdata = new_bmi25, interval = "prediction") %>%
as.data.frame()
pi_bmi25 %>%
mutate(across(where(is.numeric), ~ round(., 3))) %>%
kable(
caption = "95% Prediction Interval for an Individual with BMI = 25",
col.names = c("Fitted (Point Estimate)", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Fitted (Point Estimate) | PI Lower | PI Upper |
|---|---|---|
| 11.051 | -10.749 | 32.851 |
# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(15, 60, length.out = 300))
ci_band_lab <- predict(model_lab, newdata = bmi_grid, interval = "confidence") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
pi_band_lab <- predict(model_lab, newdata = bmi_grid, interval = "prediction") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
ggplot() +
geom_point(data = brfss_lab, aes(x = bmi, y = phys_days),
alpha = 0.10, color = "steelblue", size = 1) +
geom_ribbon(data = pi_band_lab, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "lightblue", alpha = 0.30) +
geom_ribbon(data = ci_band_lab, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "steelblue", alpha = 0.45) +
geom_line(data = ci_band_lab, aes(x = bmi, y = fit),
color = "red", linewidth = 1.2) +
labs(
title = "Simple Linear Regression: phys_days ~ BMI",
subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual",
x = "BMI (kg/m²)",
y = "Poor Physical Health Days (past 30)",
caption = "BRFSS 2020, n = 3,000"
) +
theme_minimal(base_size = 13)Questions:
#This interval represents uncertainty around the estimated mean for individuals with BMI = 25. b) If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days? #This interval is much wider because it reflects variability in individual outcomes, not just uncertainty in the mean.
# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_lab, which = 1:4,
col = adjustcolor("blue3", 0.4),
pch = 19, cex = 0.6)par(mfrow = c(1, 1))
# (b) Create a residuals vs. fitted plot using ggplot
augmented_lab <- augmented_lab %>%
mutate(obs_num = row_number())
ggplot(augmented_lab, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", color = "green", se = FALSE, linewidth = 1) +
labs(
title = "Residuals vs. Fitted Values",
subtitle = "Red line = zero | Orange = LOESS smoother",
x = "Fitted Values (Predicted phys_days)",
y = "Residuals"
) +
theme_minimal(base_size = 13)# (c) Create a normal Q-Q plot of residuals using ggplot
ggplot(augmented_lab, aes(sample = .resid)) +
stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Normal Q-Q Plot of Residuals",
subtitle = "Points should lie on the red line if residuals are normally distributed",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 13)# (d) Create a Cook's distance plot
augmented_lab <- augmented_lab %>%
mutate(
cooks_d = cooks.distance(model_lab),
influential = ifelse(cooks_d > 4 / n_lab, "Potentially influential", "Not influential")
)
n_influential_lab <- sum(augmented_lab$cooks_d > 4 / n_lab)
ggplot(augmented_lab, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 4 / n_lab, linetype = "dashed",
color = "black", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential_lab, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")Questions:
#The LOESS smoother shows mild curvature, suggesting some nonlinearity in the relationship. There is also evidence of unequal spread across fitted values, indicating potential heteroscedasticity.
Examine the Q-Q plot. Are the residuals
approximately normal? What do departures from normality in this context
suggest about the distribution of phys_days? #The Q-Q plot
shows substantial deviation from the reference line, particularly in the
tails, indicating that residuals are not normally distributed.
Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them? #There are 136 observations exceeding the Cook’s Distance threshold of 4/n. While these observations may influence the model, they should not be removed without further investigation or sensitivity analysis.
Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.) #The independence assumption likely holds. However, the most problematic assumption is normality, given that poor physical health days is a bounded count variable, which is not well approximated by a normal distribution
Now fit a second SLR model using age as the
predictor of phys_days instead of BMI.
# (a) Fit SLR: phys_days ~ age
model_age <- lm(phys_days ~ age, data = brfss_lab)
summary(model_age)##
## Call:
## lm(formula = phys_days ~ age, data = brfss_lab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.872 -8.803 -4.733 9.460 23.267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.37023 0.66608 6.561 6.27e-11 ***
## age 0.13127 0.01145 11.467 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.92 on 2998 degrees of freedom
## Multiple R-squared: 0.04202, Adjusted R-squared: 0.0417
## F-statistic: 131.5 on 1 and 2998 DF, p-value: < 2.2e-16
# (b) Display results and compare to the BMI model
tidy(model_age, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "SLR: phys_days ~ Age",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 4.3702 | 0.6661 | 6.5611 | 0 | 3.0642 | 5.6762 |
| age | 0.1313 | 0.0114 | 11.4675 | 0 | 0.1088 | 0.1537 |
# (c) Which predictor has the stronger association? Compare R² values.
r2_bmi <- summary(model_lab)$r.squared
r2_age <- summary(model_age)$r.squared
tibble(
Model = c("phys_days ~ BMI", "phys_days ~ Age"),
R_squared = c(round(r2_bmi, 4), round(r2_age, 4)),
Adj_R2 = c(round(summary(model_lab)$adj.r.squared, 4),
round(summary(model_age)$adj.r.squared, 4)),
AIC = c(round(AIC(model_lab), 1), round(AIC(model_age), 1))
) %>%
kable(
caption = "Model Comparison: BMI vs. Age as Predictor of phys_days",
col.names = c("Model", "R²", "Adj. R²", "AIC")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | R² | Adj. R² | AIC |
|---|---|---|---|
| phys_days ~ BMI | 0.0083 | 0.0080 | 22967.6 |
| phys_days ~ Age | 0.0420 | 0.0417 | 22863.9 |
Questions:
#Both BMI and age show positive associations with poor physical health days. However, age has a stronger association (R² = 0.0420) compared to BMI (R² = 0.0083), and a much larger F-statistic, indicating stronger statistical evidence.
Compare the \(R^2\) values of
the two models. Which predictor explains more variability in
phys_days? #Age explains approximately 4.2% of the
variability in poor physical health days, while BMI explains only
0.83%.
Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome? #Both age and BMI are statistically significant predictors of poor physical health days, but neither explains much variability when modeled alone. The outcome variable is bounded and non-normally distributed, violating key linear regression assumptions.
Submit your completed .Rmd file and the RPubs
link to your knitted HTML document.
Your .Rmd must knit without errors. Make sure all code
chunks produce visible output and all questions are answered in complete
sentences below each code chunk.
Due: Before the next class session.
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] corrplot_0.92 lmtest_0.9-40 zoo_1.8-12 here_1.0.1
## [5] GGally_2.2.1 plotly_4.10.4 janitor_2.2.0 haven_2.5.4
## [9] ggeffects_1.5.0 gtsummary_1.7.2 kableExtra_1.4.0 knitr_1.45
## [13] broom_1.0.5 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
## [17] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
## [21] tibble_3.2.1 ggplot2_3.5.0 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 viridisLite_0.4.2 farver_2.1.1
## [4] fastmap_1.1.1 lazyeval_0.2.2 broom.helpers_1.14.0
## [7] digest_0.6.34 timechange_0.3.0 lifecycle_1.0.4
## [10] ellipsis_0.3.2 magrittr_2.0.3 compiler_4.3.2
## [13] rlang_1.1.3 sass_0.4.8 tools_4.3.2
## [16] utf8_1.2.4 yaml_2.3.8 gt_0.10.1
## [19] data.table_1.15.99 labeling_0.4.3 htmlwidgets_1.6.4
## [22] plyr_1.8.9 xml2_1.3.6 RColorBrewer_1.1-3
## [25] withr_3.0.0 grid_4.3.2 fansi_1.0.6
## [28] colorspace_2.1-0 scales_1.3.0 insight_0.19.8
## [31] cli_3.6.2 rmarkdown_2.25 generics_0.1.3
## [34] rstudioapi_0.15.0 httr_1.4.7 tzdb_0.4.0
## [37] cachem_1.0.8 splines_4.3.2 vctrs_0.6.5
## [40] Matrix_1.6-1.1 jsonlite_1.8.8 hms_1.1.3
## [43] systemfonts_1.0.5 jquerylib_0.1.4 glue_1.7.0
## [46] ggstats_0.5.1 stringi_1.8.3 gtable_0.3.4
## [49] munsell_0.5.0 pillar_1.9.0 htmltools_0.5.7
## [52] R6_2.5.1 rprojroot_2.0.4 evaluate_0.23
## [55] lattice_0.21-9 highr_0.10 backports_1.4.1
## [58] snakecase_0.11.1 bslib_0.6.1 Rcpp_1.0.12
## [61] svglite_2.1.3 nlme_3.1-163 mgcv_1.9-0
## [64] xfun_0.42 pkgconfig_2.0.3