In this lab, you will apply Simple Linear Regression to the
BRFSS 2020 dataset using a different outcome variable:
number of days of poor physical health in the past 30
days (phys_days). You will model it as a
continuous outcome predicted by BMI.
Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?
# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)
library(plotly)
library (gtsummary)
# Load BRFSS 2020 analytic dataset
brfss_slr <- read_rds("/Users/nataliasmall/Downloads/EPI 553/brfss_slr_2020.rds") %>%
janitor::clean_names()
# Display dataset dimensions
names(brfss_slr)## [1] "bmi" "age" "sex" "education"
## [5] "gen_health_num" "sleep_hrs" "phys_days"
# (a) Create a summary table of phys_days and bmi
summ_table <- brfss_slr %>%
select(bmi, phys_days) %>%
summary() %>%
kable(caption = "Descriptive Statistics: phys_days and bmi") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
summ_table| bmi | phys_days | |
|---|---|---|
| Min. :14.63 | Min. : 1.00 | |
| 1st Qu.:24.32 | 1st Qu.: 2.00 | |
| Median :27.89 | Median : 6.00 | |
| Mean :29.18 | Mean :11.66 | |
| 3rd Qu.:32.89 | 3rd Qu.:20.00 | |
| Max. :59.60 | Max. :30.00 |
# Descriptive statistics with SD and Mean
brfss_slr %>%
select(bmi, phys_days) %>%
tbl_summary(
label = list(
bmi ~ "BMI (kg/m²)",
phys_days ~ "Poor Physical Health Days"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})"
),
digits = all_continuous() ~ 1
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**")| Characteristic | N | N = 3,0001 |
|---|---|---|
| BMI (kg/m²) | 3,000 | 29.2 (7.0) |
| Poor Physical Health Days | 3,000 | 11.7 (11.2) |
| 1 Mean (SD) | ||
# (b) Create a histogram of phys_days — describe the distribution
p_hist <- ggplot(brfss_slr, aes(x = phys_days)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
labs(
title = "Distribution of Poor Physical Health Days",
subtitle = "Number of Poor Physical Health Days",
x = "Poor Physical Health Days",
y = "Frequency"
) +
theme_minimal(base_size = 13)
ggplotly(p_hist)# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
p_scatter <- ggplot(brfss_slr, aes(x = bmi, y = phys_days)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "blue", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "Poor Physical Health Days vs. BMI (BRFSS 2020)",
subtitle = "Red = Linear fit | Blue Dashed = LOESS smoother",
x = "BMI (kg/m²)",
y = "Poor Physical Health Days"
) +
theme_minimal(base_size = 13)
ggplotly(p_scatter)Questions:
phys_days?
Of bmi? What do you notice about the distribution of
phys_days?# (a) Fit the SLR model: phys_days ~ bmi
model_slr <- lm(phys_days ~ bmi, data = brfss_slr)
summary(model_slr)##
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_slr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.808 -9.160 -5.623 8.943 20.453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.42285 0.86881 8.544 < 2e-16 ***
## bmi 0.14513 0.02895 5.013 5.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared: 0.008314, Adjusted R-squared: 0.007983
## F-statistic: 25.13 on 1 and 2998 DF, p-value: 5.659e-07
# (b) Display a tidy coefficient table with 95% CIs
tidy(model_slr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.4228 | 0.8688 | 8.5437 | 0 | 5.7193 | 9.1264 |
| bmi | 0.1451 | 0.0289 | 5.0134 | 0 | 0.0884 | 0.2019 |
# (c) Extract and report: slope, intercept, t-statistic, p-value
slope_test <- tidy(model_slr, conf.int = TRUE) %>% filter(term == "bmi")
n <- nrow(brfss_slr)
tibble(
Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
"Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
Value = c(
round(slope_test$estimate, 4),
round(slope_test$std.error, 4),
round(slope_test$statistic, 3),
n - 2,
format.pval(slope_test$p.value, digits = 3),
round(slope_test$conf.low, 4),
round(slope_test$conf.high, 4)
)
) %>%
kable(caption = "t-Test for the Slope (H₀: β₁ = 0): From SLR model, phys_days ~ bmi") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| Slope (b₁) | 0.1451 |
| SE(b₁) | 0.0289 |
| t-statistic | 5.013 |
| Degrees of freedom | 2998 |
| p-value | 5.66e-07 |
| 95% CI Lower | 0.0884 |
| 95% CI Upper | 0.2019 |
Questions:
Plain English: For every unit increase in BMI, it is assumed that people will report about a 0.15 increase in number of poor physical health days.
Is the intercept (\(b_0\)) interpretable in this context? Why or why not? Intercept (𝑏0 = 7.4228): The estimated mean phys_days when bmi = 0. This is a mathematical artifact — no living human being can have a bmi = 0, so no amount of physical health days would be reasonable. The intercept is not directly interpretable in this context, but is necessary to anchor the line.
Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value.
# (a) Display the ANOVA table
anova_slr <- anova(model_slr)
anova_slr %>%
kable(
caption = "ANOVA Table: phys_days ~ bmi",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| bmi | 1 | 3105.365 | 3105.365 | 25.134 | 0 |
| Residuals | 2998 | 370411.743 | 123.553 | NA | NA |
# (b) Compute and report SSTotal, SSRegression, and SSResidual
# Augment dataset
augmented <- augment(model_slr)
n <- nrow(brfss_slr)
ss_total <- sum((brfss_slr$phys_days - mean(brfss_slr$phys_days))^2)
ss_reg <- sum((augmented$.fitted - mean(brfss_slr$phys_days))^2)
ss_resid <- sum(augmented$.resid^2)
tibble(
Quantity = c("SSTotal", "SSRegression", "SSResidual"),
Value = c(round(ss_total, 2), round(ss_reg, 3), round(ss_resid, 3)),
) %>%
kable(caption = "ANOVA Decomposition") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| SSTotal | 373517.110 |
| SSRegression | 3105.365 |
| SSResidual | 370411.743 |
# (c) Compute R² two ways: from the model object and from the SS decomposition
r_sq <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared
ss_r_sq <- ss_reg / ss_total
tibble(
Metric = c("R²", "Adjusted R²", "SS decomposition R²", "Variance Explained"),
Value = c(
round(r_sq, 4),
round(adj_r_sq, 4),
round(ss_r_sq, 4),
paste0(round(r_sq * 100, 2), "%")
)
) %>%
kable(caption = "Model R², Adjusted Model R², and SS decomposition R²") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| R² | 0.0083 |
| Adjusted R² | 0.008 |
| SS decomposition R² | 0.0083 |
| Variance Explained | 0.83% |
Questions:
# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = c(25))
ci_pred <- predict(model_slr, newdata = new_bmi, interval = "confidence") %>%
as.data.frame() %>%
rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_pred <- predict(model_slr, newdata = new_bmi, interval = "prediction") %>%
as.data.frame() %>%
rename(PI_Lower = lwr, PI_Upper = upr) %>%
select(-fit)
results_table <- bind_cols(new_bmi, ci_pred, pi_pred) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
results_table %>%
kable(
caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals for BMI = 25",
col.names = c("BMI", "Fitted phys_days", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))| BMI | Fitted phys_days | CI Lower | CI Upper | PI Lower | PI Upper |
|---|---|---|---|---|---|
| 25 | 11.05 | 10.59 | 11.51 | -10.75 | 32.85 |
# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(min(brfss_slr$bmi), max(brfss_slr$bmi), length.out = 200))
ci_band <- predict(model_slr, newdata = bmi_grid, interval = "confidence") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
pi_band <- predict(model_slr, newdata = bmi_grid, interval = "prediction") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
p_ci_pi <- ggplot() +
geom_point(data = brfss_slr, aes(x = bmi, y = phys_days),
alpha = 0.10, color = "steelblue", size = 1) +
geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "lightblue", alpha = 0.3) +
geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "steelblue", alpha = 0.4) +
geom_line(data = ci_band, aes(x = bmi, y = fit),
color = "red", linewidth = 1.2) +
labs(
title = "Simple Linear Regression: Poor Physical Health Days ~ BMI",
subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
x = "BMI (kg/m²)",
y = "Poor Physical Health Days",
caption = "BRFSS 2020, n = 3,000"
) +
theme_minimal(base_size = 13)
p_ci_piQuestions:
# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)# (b) Create a residuals vs. fitted plot using ggplot
augmented <- augment(model_slr)
p_resid_fit <- ggplot(augmented, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
labs(
title = "Residuals vs. Fitted phys_days",
subtitle = "Should show no pattern — random scatter around zero",
x = "Fitted Poor Physical Health Days",
y = "Residuals"
) +
theme_minimal(base_size = 13)
p_resid_fit# (c) Create a normal Q-Q plot of residuals using ggplot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Normal Q-Q Plot of Residuals",
subtitle = "Points should lie on the red line if residuals are normally distributed",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 13)
p_qq# (d) Create a Cook's distance plot
n <- nrow(brfss_slr)
augmented <- augmented %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(model_slr),
influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
)
n_influential <- sum(augmented$cooks_d > 4 / n)
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 4 / n, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooksQuestions:
Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see. There is evidence of non linearity. The LOESS smooth line clearly shows a curve indicating a non-linear relationship between BMI and poor physical health days.
Examine the Q-Q plot. Are the residuals
approximately normal? What do departures from normality in this context
suggest about the distribution of phys_days? Points deviate
from the line, especially in the upper and lower tails, creating a
S-shape curve significantly non-normality. Departures from normality in
this context suggest that the distribution of number of poor physical
health days is highly skewed.
Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them? Yes, there are 136 potentially influential observations. You could transform them using Log or square root. You could also assess the 136 data points to determine if they are valid or errors made during entry.
Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.) No, LINE assumptions do not appear to be met. Normality and linearity assumptions are violated. Normality and Linearity assumptions may be the most problematic for this model since the outcome variable is bounded and highly skewed.
Now fit a second SLR model using age as the
predictor of phys_days instead of BMI.
# (a) Fit SLR: phys_days ~ age
model_age <- lm(phys_days ~ age, data = brfss_slr)
summary(model_age)##
## Call:
## lm(formula = phys_days ~ age, data = brfss_slr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.872 -8.803 -4.733 9.460 23.267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.37023 0.66608 6.561 6.27e-11 ***
## age 0.13127 0.01145 11.467 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.92 on 2998 degrees of freedom
## Multiple R-squared: 0.04202, Adjusted R-squared: 0.0417
## F-statistic: 131.5 on 1 and 2998 DF, p-value: < 2.2e-16
# (b) Display results and compare to the BMI model
# b.1 display results
tidy(model_age, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Poor Physical Health Days ~ Age (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 4.3702 | 0.6661 | 6.5611 | 0 | 3.0642 | 5.6762 |
| age | 0.1313 | 0.0114 | 11.4675 | 0 | 0.1088 | 0.1537 |
# b.2 compare to the BMI model
r2_bmi <- summary(model_slr)$r.squared
r2_age <- summary(model_age)$r.squared
comp_tbl <- tibble(
Model = c("BMI model", "Age model"),
Predictor = c("BMI", "Age"),
R_squared = round(c(r2_bmi, r2_age), 4),
Variance_Explained = paste0(round(c(r2_bmi, r2_age) * 100, 2), "%")
) %>%
kable(caption = "Comparison: Age Model vs BMI Model") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
comp_tbl| Model | Predictor | R_squared | Variance_Explained |
|---|---|---|---|
| BMI model | BMI | 0.0083 | 0.83% |
| Age model | Age | 0.0420 | 4.2% |
# (c) Which predictor has the stronger association? Compare R² values.
r2_age <- summary(model_age)$r.squared
r2_age## [1] 0.04202056
## [1] 0.008313849
Questions:
How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance? Both age and BMI have positive and statistical significant relationships with poor physical health days (p < 0.001). But, Age has a stronger magnitude and statistical significance. Age also has a slope of 0.1313. Age had a much higher F-Statistic, 131.5, than BMI, 25.134. Also, age has a higher R² value, 0.0420, than BMI, 0.0083.
Compare the \(R^2\) values of
the two models. Which predictor explains more variability in
phys_days?
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gtsummary_2.5.0 plotly_4.12.0 broom_1.0.8 kableExtra_1.4.0
## [5] knitr_1.50 here_1.0.2 haven_2.5.4 lubridate_1.9.4
## [9] forcats_1.0.0 stringr_1.5.1 dplyr_1.2.0 purrr_1.0.4
## [13] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_4.0.2
## [17] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.9.0 htmlwidgets_1.6.4
## [5] lattice_0.22-7 tzdb_0.5.0 crosstalk_1.2.2 vctrs_0.7.1
## [9] tools_4.5.2 generics_0.1.4 pkgconfig_2.0.3 Matrix_1.7-4
## [13] data.table_1.17.4 RColorBrewer_1.1-3 S7_0.2.1 gt_1.3.0
## [17] lifecycle_1.0.5 compiler_4.5.2 farver_2.1.2 textshaping_1.0.1
## [21] janitor_2.2.1 snakecase_0.11.1 litedown_0.9 htmltools_0.5.8.1
## [25] sass_0.4.10 yaml_2.3.10 lazyeval_0.2.2 pillar_1.10.2
## [29] jquerylib_0.1.4 cachem_1.1.0 nlme_3.1-168 commonmark_2.0.0
## [33] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7 splines_4.5.2
## [37] labeling_0.4.3 rprojroot_2.1.1 fastmap_1.2.0 grid_4.5.2
## [41] cli_3.6.5 magrittr_2.0.3 cards_0.7.1 withr_3.0.2
## [45] scales_1.4.0 backports_1.5.0 timechange_0.3.0 rmarkdown_2.29
## [49] httr_1.4.7 hms_1.1.3 evaluate_1.0.3 viridisLite_0.4.2
## [53] mgcv_1.9-3 markdown_2.0 rlang_1.1.7 glue_1.8.0
## [57] xml2_1.3.8 svglite_2.2.2 rstudioapi_0.17.1 jsonlite_2.0.0
## [61] R6_2.6.1 systemfonts_1.3.1 fs_1.6.6