In this lab, you will apply Simple Linear Regression to the
BRFSS 2020 dataset using a different outcome variable:
number of days of poor physical health in the past 30
days (phys_days). You will model it as a
continuous outcome predicted by BMI.
Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?
Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.
# Load packages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven)
## Warning: package 'haven' was built under R version 4.4.3
library(here)
## Warning: package 'here' was built under R version 4.4.3
## here() starts at C:/Users/sarah/OneDrive/Documents/EPI 553
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.3
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.4.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(broom)
## Warning: package 'broom' was built under R version 4.4.3
# Load raw BRFSS 2020 data
brfss_full <- read_xpt(
"/Users/sarah/OneDrive/Documents/EPI 553/data/LLCP2020.XPT"
) %>%
janitor::clean_names()
# Select variables of interest
brfss_lab <- brfss_full %>%
select(bmi5, age80, sex, physhlth, sleptim1)
# Recode variables
brfss_lab <- brfss_lab %>%
mutate(
bmi = bmi5 / 100,
age = age80,
sex = factor(ifelse(sex == 1, "Male", "Female")),
phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
)
# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_lab <- brfss_lab %>%
select(bmi, age, sex, phys_days, sleep_hrs) %>%
filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
drop_na() %>%
slice_sample(n = 3000)
# Save analytic dataset
saveRDS(brfss_lab, here::here(
"data", "brfss_lab_2020.rds"
))
# (a) Create a summary table of phys_days and bmi
summary_table <- brfss_lab %>%
summarise(
n_phys = sum(!is.na(phys_days)),
mean_phys = mean(phys_days),
sd_phys = sd(phys_days),
median_phys = median(phys_days),
min_phys = min(phys_days),
max_phys = max(phys_days),
n_bmi = sum(!is.na(bmi)),
mean_bmi = mean(bmi),
sd_bmi = sd(bmi),
median_bmi = median(bmi),
min_bmi = min(bmi),
max_bmi= max(bmi)
)
print(summary_table)
## # A tibble: 1 × 12
## n_phys mean_phys sd_phys median_phys min_phys max_phys n_bmi mean_bmi sd_bmi
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 3000 12.0 11.2 7 1 30 3000 29.5 6.87
## # ℹ 3 more variables: median_bmi <dbl>, min_bmi <dbl>, max_bmi <dbl>
# (b) Create a histogram of phys_days — describe the distribution
ggplot(brfss_lab, aes(x=phys_days)) +
geom_histogram(bins = 30, color = "white", fill = "steelblue") +
labs(
title = "Distribution of Poor Physical Health Days",
x = "Days of Poor Physical Health (past 30 days)",
y = "Count"
)
# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "blue", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "Scatter Plot of BMI vs Poor Physical Health Days",
x = "BMI",
y = "Poor Physical Health Days"
)+
theme_minimal(base_size = 13)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Questions:
phys_days?
Of bmi? What do you notice about the distribution of
phys_days?The mean of phys_days is 12 and the standard deviation
is 11.2. The mean bmi is 29.5 with a standard deviation of 6.87. The
distribution of poor physical health days shows two peaks one with a low
number of poor physical health days and a spike at 30 days. The count
does taper off as the days increase other than the spike at 30 days.
# (a) Fit the SLR model: phys_days ~ bmi
slr_model <- lm(phys_days ~ bmi, data= brfss_lab)
summary(slr_model)
##
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.412 -9.294 -5.379 9.803 20.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.51927 0.89781 8.375 < 2e-16 ***
## bmi 0.15306 0.02967 5.158 2.65e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.16 on 2998 degrees of freedom
## Multiple R-squared: 0.008798, Adjusted R-squared: 0.008467
## F-statistic: 26.61 on 1 and 2998 DF, p-value: 2.652e-07
# (b) Display a tidy coefficient table with 95% CIs
tidy(slr_model, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Poor Physical Health Days ~ BMI",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.5193 | 0.8978 | 8.3751 | 0 | 5.7589 | 9.2796 |
| bmi | 0.1531 | 0.0297 | 5.1584 | 0 | 0.0949 | 0.2112 |
# (c) Extract and report: slope, intercept, t-statistic, p-value
b0 <- round(coef(slr_model) [1], 3)
b1 <- round(coef(slr_model) [2], 4)
Questions:
\(\hat{Y} = 7.519 + 0.153 * BMI\)
The slope, 0.1531, indicates that for each unit increase in BMI, there will be an increase of about .15 poor physical health days.
The intercept is not interpretable. It represents the predicted number of poor physical health days when BMI = 0. A BMI of 0 is not possible, therefore the intercept doesn’t have a meaningful real-world interpretation.
H0: b1 = 0 Ha: b1 != 0 Test statistic: t= 5.1584 p-value: <.0001
# (a) Display the ANOVA table
slr_anova <- anova(slr_model)
slr_anova %>%
kable (
caption = "ANOVA Table: Days of Poor Physical Health ~ BMI",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| bmi | 1 | 3314.969 | 3314.969 | 26.609 | 0 |
| Residuals | 2998 | 373487.391 | 124.579 | NA | NA |
# (b) Compute and report SSTotal, SSRegression, and SSResidual
ss_regression <- slr_anova["bmi", "Sum Sq"]
ss_residual <- slr_anova["Residuals", "Sum Sq"]
ss_total <- ss_regression + ss_residual
# (c) Compute R² two ways: from the model object and from the SS decomposition
summary(slr_model)$r.squared
## [1] 0.008797634
r_squared <- ss_regression / ss_total
Questions:
\(SS_{Total}\) : 376802.36 \(SS_{Regression}\) : 3314.969 \(SS_{Residual}\) : 373487.391 \(df\) : 1 \(F\) : 26.609 b) What is the \(R^2\) value? Interpret it in plain English.
The R^2 value is 0.0088. This indicates that BMI explains less than 1% of variance in poor physical health days, meaning The association is statistically significant, but the effect size is small.
# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = c(25))
predict(slr_model, newdata = new_bmi, interval = "confidence") %>%
as.data.frame() %>%
rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
## Fitted CI_Lower CI_Upper
## 1 11.34566 10.86895 11.82236
# (b) Calculate the 95% prediction interval for a person with BMI = 25
predict(slr_model, newdata = new_bmi, interval = "prediction")
## fit lwr upr
## 1 11.34566 -10.54449 33.2358
# (c) Plot the regression line with both the CI band and PI band
bmi_seq <- data.frame(bmi = seq(min(brfss_lab$bmi), max(brfss_lab$bmi), length.out = 200))
ci_band <- predict(slr_model, newdata = bmi_seq, interval = "confidence") %>%
as.data.frame() %>%
bind_cols(bmi_seq)
pi_band <- predict(slr_model, newdata = bmi_seq, interval = "prediction") %>%
as.data.frame() %>%
bind_cols(bmi_seq)
p_ci_pi <- ggplot() +
geom_point(data = brfss_lab, aes(x = bmi, y = phys_days),
alpha = 0.10, color = "steelblue", size = 1) +
geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "lightblue", alpha = 0.3) +
geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "steelblue", alpha = 0.4) +
geom_line(data = ci_band, aes(x = bmi, y = fit),
color = "red", linewidth = 1.2) +
labs(
title = "Simple Linear Regression: phys_days ~ BMI",
subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
x = "BMI (kg/m²)",
y = "Poor Physical Health Days",
caption = "BRFSS 2020, n = 3,000"
) +
theme_minimal(base_size = 13)
p_ci_pi
Questions:
For someone with a BMI of 25, the estimated mean number of poor physical health days is 11.35 days. The 95% confidence interval for this mean is [10.87, 11.82].
For a specific new person with a BMI of 25, the 95% confidence interval for their number of poor physical health days is [-10.54, 33.24].
CI is used when estimating the average outcome at a given X, like in part (a) when estimating the average number of poor physical health days for someone with a BMI of 25. PI would be used to predict the individual outcome at X, like predicting the actual number of poor physical health days for a new person with a BMI of 25.
# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2,2))
plot(slr_model, which = 1:4)
# (b) Create a residuals vs. fitted plot using ggplot
augmented <- augment(slr_model)
ggplot(augmented, aes( x= .fitted, y= .resid)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
labs(
title = "Residuals vs Fitted Values",
x = "Fitted Values",
y = "Residuals"
)
## `geom_smooth()` using formula = 'y ~ x'
# (c) Create a normal Q-Q plot of residuals using ggplot
ggplot(augmented, aes(sample = .resid)) +
geom_qq(color = "steelblue", size = 2)+
geom_qq_line(color = "red", linetype = "dashed") +
labs(
title = "Q-Q Plot of Residuals",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
)
# (d) Create a Cook's distance plot
ggplot(augmented, aes(x = seq_along(.cooksd), y= .cooksd)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
labs(
title = "Cook's Distance",
x = "Observation",
y = "Cook's Distance"
)
Questions:
Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see. The plot shows evidence of nonlinearity as the smooth curve bends meaning the linear model doesn’t fully capture the relationship between BMI and poor physical health days. There is also evidence of heteroscedasticity where the residuals are tightly clustered at lower fitted values, spreading out as the fitted values increase.
Examine the Q-Q plot. Are the residuals
approximately normal? What do departures from normality in this context
suggest about the distribution of phys_days?
The middle points seem to follow the reference line, but the tails deviate away from the line. The deviations indicate heavier tails and some skewness. The distribution of phys_days is right-skewed with a spike at 30 days, suggesting a non normal distribution.
There were no influential observations. All values were extremely small, no data points need to be adjusted.
Now fit a second SLR model using age as the
predictor of phys_days instead of BMI.
# (a) Fit SLR: phys_days ~ age
slr_model_age <- lm(phys_days ~ age, data = brfss_lab)
summary(slr_model_age)
##
## Call:
## lm(formula = phys_days ~ age, data = brfss_lab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.354 -8.911 -4.432 10.992 22.910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.33587 0.68259 6.352 2.45e-10 ***
## age 0.13773 0.01168 11.789 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.96 on 2998 degrees of freedom
## Multiple R-squared: 0.04431, Adjusted R-squared: 0.04399
## F-statistic: 139 on 1 and 2998 DF, p-value: < 2.2e-16
# (b) Display results and compare to the BMI model
#Results for age model
tidy(slr_model_age, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Poor Physical Health Days ~ Age",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 4.3359 | 0.6826 | 6.3521 | 0 | 2.9975 | 5.6743 |
| age | 0.1377 | 0.0117 | 11.7893 | 0 | 0.1148 | 0.1606 |
#Results for BMI model
tidy(slr_model, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Poor Physical Health Days ~ BMI",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.5193 | 0.8978 | 8.3751 | 0 | 5.7589 | 9.2796 |
| bmi | 0.1531 | 0.0297 | 5.1584 | 0 | 0.0949 | 0.2112 |
# (c) Which predictor has the stronger association? Compare R² values.
r2_age <- summary(slr_model_age)$r.squared
r2_bmi <- summary(slr_model)$r.squared
r2_age
## [1] 0.04430638
r2_bmi
## [1] 0.008797634
Questions:
Both predictors have a positive association with poor physical health
days. Age and BMI move in the same direction based on the slope. The
slope sizes are similar. For age the slope was 0.138 and bmi slope was
0.153. Both are statistically significant but the t-value for age was
11.79, much higher than 5.16 for bmi. b) Compare the \(R^2\) values of the two models. Which
predictor explains more variability in phys_days?
The r^2 value for age was 0.044, whereas the r^2 for BMI was 0.0088. Age explains more variability as it accounts for 4.4% of variance. BMI only accounts for 0.88%.
Based on the two models, age is the stronger predictor because it explains more variation in poor physical health days than BMI. Both predictors explain little variation indicating other factors have a larger effect on poor physical health days. Limitations of the simple linear regression was phys_day was right-skewed causing non-normal residuals and nonlinearity.
Submit your completed .Rmd file and the RPubs
link to your knitted HTML document.
Your .Rmd must knit without errors. Make sure all code
chunks produce visible output and all questions are answered in complete
sentences below each code chunk.
Due: Before the next class session.