In this lab, you will apply Simple Linear Regression to the BRFSS 2020 dataset using a different outcome variable: number of days of poor physical health in the past 30 days (phys_days). You will model it as a continuous outcome predicted by BMI.
Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?
# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)
# Load full BRFSS 2020 data
brfss_full <- read_xpt("data/LLCP2020.XPT") %>%
janitor::clean_names()
# Select variables of interest
brfss_lab <- brfss_full %>%
select(bmi5, age80, sex, physhlth, sleptim1)
# Recode variables
brfss_lab <- brfss_lab %>%
mutate(
bmi = bmi5 / 100,
age = age80,
sex = factor(ifelse(sex == 1, "Male", "Female")),
phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
)
# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_lab <- brfss_lab %>%
select(bmi, age, sex, phys_days, sleep_hrs) %>%
filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
drop_na() %>%
slice_sample(n = 3000)
# Save analytic dataset
saveRDS(brfss_lab, here::here("data/brfss_lab_2020.rds"))data_1 <- readRDS("data/brfss_lab_2020.rds")
data_1b <- readRDS("data/brfss_slr_2020.rds")
# (a) Create a summary table of phys_days and bmi
summary_table <- data_1 %>%
summarise(
N = n(),
Mean_BMI = round(mean(bmi, na.rm = TRUE), 2),
SD_BMI = sd(bmi, na.rm = TRUE),
Lower_BMI = (Mean_BMI - SD_BMI),
Upper_BMI = (Mean_BMI + SD_BMI),
Mean_phys_days = round(mean(phys_days, na.rm = TRUE), 2),
SD_phys_days = sd(phys_days, na.rm = TRUE),
Lower_phys_days = (Mean_phys_days - SD_phys_days),
Upper_phys_days = (Mean_phys_days + SD_phys_days))
summary_table <- summary_table %>%
select(
N,
Mean_BMI,
SD_BMI,
Lower_BMI,
Upper_BMI,
Mean_phys_days,
SD_phys_days,
Lower_phys_days,
Upper_phys_days)
kable(summary_table,
caption = "Bone Marrow Density by Ethnicity",
format = "html") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| N | Mean_BMI | SD_BMI | Lower_BMI | Upper_BMI | Mean_phys_days | SD_phys_days | Lower_phys_days | Upper_phys_days |
|---|---|---|---|---|---|---|---|---|
| 3000 | 29.47 | 6.869134 | 22.60087 | 36.33913 | 12.03 | 11.20904 | 0.820956 | 23.23904 |
# (b) Create a histogram of phys_days — describe the distribution
ggplot(data=data_1, aes(phys_days)) +
geom_histogram(bins=30)# The number of days of poor physical health in the past 30 days is heavily distributed
# around the maximum (30) and minimum (1). This means that the study sample usually has
# very few days with poor health (<5) or nearly every day with poor health.
# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
ggplot(data_1, aes(y = phys_days, x = bmi)) +
geom_point(alpha = 0.65, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "blue", linewidth = 1,
linetype = "dashed", se = FALSE) +
theme_minimal()+
labs(title = "Scatter plot of phys_days and bmi",
y = "Days with poor physical health",
x = "BMI")Questions:
What is the mean and standard deviation of
phys_days? Of bmi? What do you notice about
the distribution of phys_days? The mean and standard
deviation of phys_days is 12.03 ± 11.21 and BMI is 29.47 ± 6.87. The
number of days of poor physical health in the past 30 days is heavily
distributed around the maximum (30) and minimum (1). This means that the
study sample usually has very few days with poor health (<5) or
nearly every day with poor health.
Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers? There appears to be a slight trend toward more days with poor physical health and higher BMI. There are some outliers for those with extremely high BMI (>50) and few days with poor health (<10), as well as those with extremely low BMI (<10) and many days with poor health (>25) —
# (a) Fit the SLR model: phys_days ~ bmi
model_slr <- lm(phys_days ~ bmi, data = data_1)
summary(model_slr)##
## Call:
## lm(formula = phys_days ~ bmi, data = data_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.412 -9.294 -5.379 9.803 20.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.51927 0.89781 8.375 < 2e-16 ***
## bmi 0.15306 0.02967 5.158 2.65e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.16 on 2998 degrees of freedom
## Multiple R-squared: 0.008798, Adjusted R-squared: 0.008467
## F-statistic: 26.61 on 1 and 2998 DF, p-value: 2.652e-07
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.52 0.898 8.38 8.36e-17 5.76 9.28
## 2 bmi 0.153 0.0297 5.16 2.65e- 7 0.0949 0.211
# (c) Extract and report: slope, intercept, t-statistic, p-value
tidy(model_slr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.5193 | 0.8978 | 8.3751 | 0 | 5.7589 | 9.2796 |
| bmi | 0.1531 | 0.0297 | 5.1584 | 0 | 0.0949 | 0.2112 |
Questions:
Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\). Phys_days = 7.52 + 0.153*BMI
Interpret the slope (\(b_1\)) in context — what does it mean in plain English? For each 1 point increase in BMI, the number of days with poor health is estimated to increase by 0.153 days, on average.
Is the intercept (\(b_0\)) interpretable in this context? Why or why not? This is the estimated mean number of days with poor health when BMI = 0. The intercept is not interpretable in this context, but is necessary to anchor the line.
Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value. Yes. The null hypothesis is that there is no linear association between the number of days with poor health and BMI, which can be rejected (p< 0.05) suggesting there is a linear relationship. The test-statistic is 8.38.
# (a) Display the ANOVA table
anova_slr <- anova(model_slr)
anova_slr %>%
kable(
caption = "ANOVA Table: Phys_days ~ BMI",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| bmi | 1 | 3314.969 | 3314.969 | 26.609 | 0 |
| Residuals | 2998 | 373487.391 | 124.579 | NA | NA |
# (b) Compute and report SSTotal, SSRegression, and SSResidual
augmented <- augment(model_slr)
augmented %>%
select(phys_days, bmi, .fitted, .resid) %>%
slice_head(n = 10) %>%
mutate(across(where(is.numeric), ~ round(., 3))) %>%
kable(
caption = "First 10 Observations: Observed, Fitted, and Residual Values",
col.names = c("Phys_days (Y)", "BMI (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Phys_days (Y) | BMI (X) | Fitted (Ŷ) | Residual (e = Y − Ŷ) |
|---|---|---|---|
| 2 | 23.06 | 11.049 | -9.049 |
| 3 | 19.77 | 10.545 | -7.545 |
| 15 | 23.34 | 11.092 | 3.908 |
| 2 | 29.05 | 11.966 | -9.966 |
| 2 | 23.30 | 11.085 | -9.085 |
| 7 | 24.27 | 11.234 | -4.234 |
| 30 | 31.64 | 12.362 | 17.638 |
| 30 | 32.45 | 12.486 | 17.514 |
| 28 | 28.70 | 11.912 | 16.088 |
| 1 | 26.63 | 11.595 | -10.595 |
n <- nrow(data_1)
ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)
tibble(
Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
Value = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3))
) %>%
kable(caption = "Model Error Estimates") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| SS Residual | 373487.390 |
| MSE (σ̂²) |
124.57
|
#SSTotal - 376802.36
#SSRegression - 3314.97
#SSResidual - 373487.39
# (c) Compute R² two ways: from the model object and from the SS decomposition
r_sq <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared
tibble(
Metric = c("R²", "Adjusted R²", "Variance Explained"),
Value = c(
round(r_sq, 4),
round(adj_r_sq, 4),
paste0(round(r_sq * 100, 2), "%")
)
) %>%
kable(caption = "R² and Adjusted R²") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| R² | 0.0088 |
| Adjusted R² | 0.0085 |
| Variance Explained | 0.88% |
Questions:
Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic. SSTotal - 376802.36 SSRegression - 3314.97 SSResidual - 373487.39 df (numerator) - 1 df (denominator) - 2998 F-statistic - 26.61
What is the \(R^2\) value? Interpret it in plain English. The r2 value is 0.0088. This tells us the proportion of total variability in days with poor health explained by the linear regression on BMI.Essentially, the r2 value shows how well the regression model explains the variation in days with poor health.
What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation? BMI alone only explains 0.88% of the variation seen in days with poor health. Other factors such as age and other health metrics, such as cardiovascular disease, could help explain more of the variation. —
# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = c(15, 20, 25, 30, 35, 40, 45, 55))
ci_pred <- predict(model_slr, newdata = new_bmi, interval = "confidence") %>%
as.data.frame() %>%
rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
pi_pred <- predict(model_slr, newdata = new_bmi, interval = "prediction") %>%
as.data.frame() %>%
rename(PI_Lower = lwr, PI_Upper = upr) %>%
select(-fit)
results_table <- bind_cols(new_bmi, ci_pred, pi_pred) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
results_table %>%
kable(
caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by Age",
col.names = c("BMI", "Fitted phsy_days", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))| BMI | Fitted phsy_days | CI Lower | CI Upper | PI Lower | PI Upper |
|---|---|---|---|---|---|
| 15 | 9.82 | 8.88 | 10.75 | -12.09 | 31.72 |
| 20 | 10.58 | 9.90 | 11.26 | -11.32 | 32.48 |
| 25 | 11.35 | 10.87 | 11.82 | -10.54 | 33.24 |
| 30 | 12.11 | 11.71 | 12.51 | -9.78 | 34.00 |
| 35 | 12.88 | 12.36 | 13.39 | -9.01 | 34.77 |
| 40 | 13.64 | 12.91 | 14.37 | -8.26 | 35.54 |
| 45 | 14.41 | 13.42 | 15.39 | -7.50 | 36.31 |
| 55 | 15.94 | 14.40 | 17.48 | -6.00 | 37.88 |
#Assuming this question is asking for the fitted days with poor health for someone with a BMI of 25,
#the answer would be 11.35.
# (b) Calculate the 95% prediction interval for a person with BMI = 25
#11.35±0.47 (10.87, 11.82)
# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(14, 60, length.out = 200))
ci_band <- predict(model_slr, newdata = bmi_grid, interval = "confidence") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
pi_band <- predict(model_slr, newdata = bmi_grid, interval = "prediction") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
p_ci_pi <- ggplot() +
geom_point(data = data_1, aes(x = bmi, y = phys_days),
alpha = 0.10, color = "steelblue", size = 1) +
geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "lightblue", alpha = 0.3) +
geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "steelblue", alpha = 0.4) +
geom_line(data = ci_band, aes(x = bmi, y = fit),
color = "red", linewidth = 1.2) +
labs(
title = "Simple Linear Regression: BMI ~ Age",
subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
x = "BMI (kg/m²)",
y = "Days with poor health",
caption = "BRFSS 2020, n = 3,000"
) +
theme_minimal(base_size = 13)
p_ci_piQuestions:
For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean? 11.35±0.47 (10.87, 11.82)
If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days? (-10.54, 33.24)
Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice? Prediction intervals are wider because it accounts for the uncertainty of predicting a future individual instead of a sample mean. Confidence intervals should be used to explain sample data, whereas prediction intervals should be used when estimating new data.
# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)# (b) Create a residuals vs. fitted plot using ggplot
p_resid_x <- ggplot(augmented, aes(x = bmi, y = .resid)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
labs(
title = "Residuals vs. BMI",
subtitle = "Should show no pattern — random scatter around zero",
x = "BMI",
y = "Residuals"
) +
theme_minimal(base_size = 13)
p_resid_x# (c) Create a normal Q-Q plot of residuals using ggplot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Normal Q-Q Plot of Residuals",
subtitle = "Points should lie on the red line if residuals are normally distributed",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 13)
p_qq# (d) Create a Cook's distance plot
augmented <- augmented %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(model_slr),
influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
)
n_influential <- sum(augmented$cooks_d > 4 / n)
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 4 / n, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooksQuestions:
Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see. The residuals vs. fitted line is relatively straight so does not show evidence of heteroscedasticity.
Examine the Q-Q plot. Are the residuals
approximately normal? What do departures from normality in this context
suggest about the distribution of phys_days? The residuals
are not normally distributed. The s-shaped curvature suggests that the
number of days with poor health has outliers on both extremes.
Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them? Yes, there are 133 influential observations. I would only remove points with a cooks distance above 1.0, of which there are none.
Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.) The LINE assumptions do not appear to be met. The errors are not normally distributed, as evidenced by the qq plot.
Now fit a second SLR model using age as the
predictor of phys_days instead of BMI.
# (a) Fit SLR: phys_days ~ age
model_slr_2 <- lm(phys_days ~ age, data = data_1)
summary(model_slr_2)##
## Call:
## lm(formula = phys_days ~ age, data = data_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.354 -8.911 -4.432 10.992 22.910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.33587 0.68259 6.352 2.45e-10 ***
## age 0.13773 0.01168 11.789 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.96 on 2998 degrees of freedom
## Multiple R-squared: 0.04431, Adjusted R-squared: 0.04399
## F-statistic: 139 on 1 and 2998 DF, p-value: < 2.2e-16
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.34 0.683 6.35 2.45e-10 3.00 5.67
## 2 age 0.138 0.0117 11.8 2.16e-31 0.115 0.161
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.52 0.898 8.38 8.36e-17 5.76 9.28
## 2 bmi 0.153 0.0297 5.16 2.65e- 7 0.0949 0.211
tidy(model_slr_2, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 4.3359 | 0.6826 | 6.3521 | 0 | 2.9975 | 5.6743 |
| age | 0.1377 | 0.0117 | 11.7893 | 0 | 0.1148 | 0.1606 |
tidy(model_slr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.5193 | 0.8978 | 8.3751 | 0 | 5.7589 | 9.2796 |
| bmi | 0.1531 | 0.0297 | 5.1584 | 0 | 0.0949 | 0.2112 |
# (c) Which predictor has the stronger association? Compare R² values.
r_sq_2 <- summary(model_slr_2)$r.squared
r_sq <- summary(model_slr)$r.squared
#Age and days with poor health has a stronger association (r2=0.0443) than bmi and days with poor health (r2=0.0088).Questions:
How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance? Age and days with poor health has a stronger association (r2=0.0443) than bmi and days with poor health (r2=0.0088). Both are significant direct associations, with number of days with poor health increasing with age and bmi.
Compare the \(R^2\) values of
the two models. Which predictor explains more variability in
phys_days? Age and days with poor health has a stronger
association (r2=0.0443) than bmi and days with poor health (r2=0.0088).
Age explains 4.4% of the variability of days with poor health.
Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome? There are many factors involved in physical health. A multiple linear regression model would allow for a more robust analysis by accounting for multiple variables.