Overview

In this lab, you will apply Simple Linear Regression to the BRFSS 2020 dataset using a different outcome variable: number of days of poor physical health in the past 30 days (phys_days). You will model it as a continuous outcome predicted by BMI.

Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?

# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)

# Load full BRFSS 2020 data
brfss_full <- read_xpt("data/LLCP2020.XPT") %>%
  janitor::clean_names()

# Select variables of interest
brfss_lab <- brfss_full %>%
  select(bmi5, age80, sex, physhlth, sleptim1)

# Recode variables
brfss_lab <- brfss_lab %>%
  mutate(
    bmi       = bmi5 / 100,
    age       = age80,
    sex       = factor(ifelse(sex == 1, "Male", "Female")),
    phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
    sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
  )

# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_lab <- brfss_lab %>%
  select(bmi, age, sex, phys_days, sleep_hrs) %>%
  filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
  drop_na() %>%
  slice_sample(n = 3000)

# Save analytic dataset
saveRDS(brfss_lab, here::here("data/brfss_lab_2020.rds"))

Task 1: Explore the Variables (15 points)

data_1 <- readRDS("data/brfss_lab_2020.rds")
data_1b <- readRDS("data/brfss_slr_2020.rds")

# (a) Create a summary table of phys_days and bmi
summary_table <- data_1 %>%
  summarise(
      N = n(),
      Mean_BMI = round(mean(bmi, na.rm = TRUE), 2),
      SD_BMI = sd(bmi, na.rm = TRUE),
      Lower_BMI = (Mean_BMI - SD_BMI),
      Upper_BMI = (Mean_BMI + SD_BMI),
      Mean_phys_days = round(mean(phys_days, na.rm = TRUE), 2),
      SD_phys_days = sd(phys_days, na.rm = TRUE),
      Lower_phys_days = (Mean_phys_days - SD_phys_days),
      Upper_phys_days = (Mean_phys_days + SD_phys_days))

summary_table <- summary_table %>%
  select(
    N,
    Mean_BMI,
    SD_BMI,
    Lower_BMI,
    Upper_BMI,
    Mean_phys_days,
    SD_phys_days,
    Lower_phys_days,
    Upper_phys_days)

kable(summary_table, 
      caption = "Bone Marrow Density by Ethnicity",
      format = "html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE)

Bone Marrow Density by Ethnicity
N	Mean_BMI	SD_BMI	Lower_BMI	Upper_BMI	Mean_phys_days	SD_phys_days	Lower_phys_days	Upper_phys_days
3000	29.47	6.869134	22.60087	36.33913	12.03	11.20904	0.820956	23.23904

# (b) Create a histogram of phys_days — describe the distribution
ggplot(data=data_1, aes(phys_days)) +
  geom_histogram(bins=30)

# The number of days of poor physical health in the past 30 days is heavily distributed 
# around the maximum (30) and minimum (1). This means that the study sample usually has 
# very few days with poor health (<5) or nearly every day with poor health.


# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
ggplot(data_1, aes(y = phys_days, x = bmi)) +
  geom_point(alpha = 0.65, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  theme_minimal()+
  labs(title = "Scatter plot of phys_days and bmi",
       y = "Days with poor physical health",
       x = "BMI")

Questions:

What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days? The mean and standard deviation of phys_days is 12.03 ± 11.21 and BMI is 29.47 ± 6.87. The number of days of poor physical health in the past 30 days is heavily distributed around the maximum (30) and minimum (1). This means that the study sample usually has very few days with poor health (<5) or nearly every day with poor health.
Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers? There appears to be a slight trend toward more days with poor physical health and higher BMI. There are some outliers for those with extremely high BMI (>50) and few days with poor health (<10), as well as those with extremely low BMI (<10) and many days with poor health (>25) —

Task 2: Fit and Interpret the SLR Model (20 points)

# (a) Fit the SLR model: phys_days ~ bmi
model_slr <- lm(phys_days ~ bmi, data = data_1)

summary(model_slr)

## 
## Call:
## lm(formula = phys_days ~ bmi, data = data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.412  -9.294  -5.379   9.803  20.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.51927    0.89781   8.375  < 2e-16 ***
## bmi          0.15306    0.02967   5.158 2.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.16 on 2998 degrees of freedom
## Multiple R-squared:  0.008798,   Adjusted R-squared:  0.008467 
## F-statistic: 26.61 on 1 and 2998 DF,  p-value: 2.652e-07

# (b) Display a tidy coefficient table with 95% CIs
tidy(model_slr, conf.int = TRUE)

## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    7.52     0.898       8.38 8.36e-17   5.76       9.28 
## 2 bmi            0.153    0.0297      5.16 2.65e- 7   0.0949     0.211

# (c) Extract and report: slope, intercept, t-statistic, p-value
tidy(model_slr, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	7.5193	0.8978	8.3751	0	5.7589	9.2796
bmi	0.1531	0.0297	5.1584	0	0.0949	0.2112

b0 <- round(coef(model_slr)[1], 3)
b1 <- round(coef(model_slr)[2], 4)

Questions:

Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\). Phys_days = 7.52 + 0.153*BMI
Interpret the slope (\(b_1\)) in context — what does it mean in plain English? For each 1 point increase in BMI, the number of days with poor health is estimated to increase by 0.153 days, on average.
Is the intercept (\(b_0\)) interpretable in this context? Why or why not? This is the estimated mean number of days with poor health when BMI = 0. The intercept is not interpretable in this context, but is necessary to anchor the line.
Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value. Yes. The null hypothesis is that there is no linear association between the number of days with poor health and BMI, which can be rejected (p< 0.05) suggesting there is a linear relationship. The test-statistic is 8.38.

Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table
anova_slr <- anova(model_slr)

anova_slr %>%
  kable(
    caption = "ANOVA Table: Phys_days ~ BMI",
    digits = 3,
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

ANOVA Table: Phys_days ~ BMI
Source	Df	Sum Sq	Mean Sq	F value	Pr(>F)
bmi	1	3314.969	3314.969	26.609	0
Residuals	2998	373487.391	124.579	NA	NA

# (b) Compute and report SSTotal, SSRegression, and SSResidual
augmented <- augment(model_slr)
augmented %>%
  select(phys_days, bmi, .fitted, .resid) %>%
  slice_head(n = 10) %>%
  mutate(across(where(is.numeric), ~ round(., 3))) %>%
  kable(
    caption = "First 10 Observations: Observed, Fitted, and Residual Values",
    col.names = c("Phys_days (Y)", "BMI (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

First 10 Observations: Observed, Fitted, and Residual Values
Phys_days (Y)	BMI (X)	Fitted (Ŷ)	Residual (e = Y − Ŷ)
2	23.06	11.049	-9.049
3	19.77	10.545	-7.545
15	23.34	11.092	3.908
2	29.05	11.966	-9.966
2	23.30	11.085	-9.085
7	24.27	11.234	-4.234
30	31.64	12.362	17.638
30	32.45	12.486	17.514
28	28.70	11.912	16.088
1	26.63	11.595	-10.595

n <- nrow(data_1)
ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)

tibble(
  Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
  Value    = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3))
) %>%
  kable(caption = "Model Error Estimates") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Error Estimates
Quantity	Value
SS Residual	373487.390
MSE (σ̂²)	124.57

#SSTotal - 376802.36
#SSRegression - 3314.97
#SSResidual - 373487.39

# (c) Compute R² two ways: from the model object and from the SS decomposition
r_sq <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.0088
Adjusted R²	0.0085
Variance Explained	0.88%

Questions:

Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic. SSTotal - 376802.36 SSRegression - 3314.97 SSResidual - 373487.39 df (numerator) - 1 df (denominator) - 2998 F-statistic - 26.61
What is the \(R^2\) value? Interpret it in plain English. The r2 value is 0.0088. This tells us the proportion of total variability in days with poor health explained by the linear regression on BMI.Essentially, the r2 value shows how well the regression model explains the variation in days with poor health.
What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation? BMI alone only explains 0.88% of the variation seen in days with poor health. Other factors such as age and other health metrics, such as cardiovascular disease, could help explain more of the variation. —

Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = c(15, 20, 25, 30, 35, 40, 45, 55))

ci_pred <- predict(model_slr, newdata = new_bmi, interval = "confidence") %>%
  as.data.frame() %>%
  rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)

pi_pred <- predict(model_slr, newdata = new_bmi, interval = "prediction") %>%
  as.data.frame() %>%
  rename(PI_Lower = lwr, PI_Upper = upr) %>%
  select(-fit)

results_table <- bind_cols(new_bmi, ci_pred, pi_pred) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

results_table %>%
  kable(
    caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by Age",
    col.names = c("BMI", "Fitted phsy_days", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))

Fitted Values, 95% Confidence Intervals, and Prediction Intervals by Age
		95% CI for Mean		95% PI for Individual
BMI	Fitted phsy_days	CI Lower	CI Upper	PI Lower	PI Upper
15	9.82	8.88	10.75	-12.09	31.72
20	10.58	9.90	11.26	-11.32	32.48
25	11.35	10.87	11.82	-10.54	33.24
30	12.11	11.71	12.51	-9.78	34.00
35	12.88	12.36	13.39	-9.01	34.77
40	13.64	12.91	14.37	-8.26	35.54
45	14.41	13.42	15.39	-7.50	36.31
55	15.94	14.40	17.48	-6.00	37.88

#Assuming this question is asking for the fitted days with poor health for someone with a BMI of 25,
#the answer would be 11.35.

# (b) Calculate the 95% prediction interval for a person with BMI = 25
#11.35±0.47 (10.87, 11.82)

# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(14, 60, length.out = 200))

ci_band <- predict(model_slr, newdata = bmi_grid, interval = "confidence") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

pi_band <- predict(model_slr, newdata = bmi_grid, interval = "prediction") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

p_ci_pi <- ggplot() +
  geom_point(data = data_1, aes(x = bmi, y = phys_days),
             alpha = 0.10, color = "steelblue", size = 1) +
  geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "steelblue", alpha = 0.4) +
  geom_line(data = ci_band, aes(x = bmi, y = fit),
            color = "red", linewidth = 1.2) +
  labs(
    title = "Simple Linear Regression: BMI ~ Age",
    subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
    x = "BMI (kg/m²)",
    y = "Days with poor health",
    caption = "BRFSS 2020, n = 3,000"
  ) +
  theme_minimal(base_size = 13)
p_ci_pi

Questions:

For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean? 11.35±0.47 (10.87, 11.82)
If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days? (-10.54, 33.24)
Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice? Prediction intervals are wider because it accounts for the uncertainty of predicting a future individual instead of a sample mean. Confidence intervals should be used to explain sample data, whereas prediction intervals should be used when estimating new data.

Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)

# (b) Create a residuals vs. fitted plot using ggplot
p_resid_x <- ggplot(augmented, aes(x = bmi, y = .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
  labs(
    title = "Residuals vs. BMI",
    subtitle = "Should show no pattern — random scatter around zero",
    x = "BMI",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

p_resid_x

# (c) Create a normal Q-Q plot of residuals using ggplot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
  stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Points should lie on the red line if residuals are normally distributed",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 13)

p_qq

# (d) Create a Cook's distance plot
augmented <- augmented %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(model_slr),
    influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
  )

n_influential <- sum(augmented$cooks_d > 4 / n)

p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 4 / n, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = 4/n threshold | ",
                      n_influential, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks

Questions:

Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see. The residuals vs. fitted line is relatively straight so does not show evidence of heteroscedasticity.
Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days? The residuals are not normally distributed. The s-shaped curvature suggests that the number of days with poor health has outliers on both extremes.
Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them? Yes, there are 133 influential observations. I would only remove points with a cooks distance above 1.0, of which there are none.
Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.) The LINE assumptions do not appear to be met. The errors are not normally distributed, as evidenced by the qq plot.

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age
model_slr_2 <- lm(phys_days ~ age, data = data_1)
summary(model_slr_2)

## 
## Call:
## lm(formula = phys_days ~ age, data = data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.354  -8.911  -4.432  10.992  22.910 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.33587    0.68259   6.352 2.45e-10 ***
## age          0.13773    0.01168  11.789  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 2998 degrees of freedom
## Multiple R-squared:  0.04431,    Adjusted R-squared:  0.04399 
## F-statistic:   139 on 1 and 2998 DF,  p-value: < 2.2e-16

# (b) Display results and compare to the BMI model
tidy(model_slr_2, conf.int = TRUE)

## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    4.34     0.683       6.35 2.45e-10    3.00      5.67 
## 2 age            0.138    0.0117     11.8  2.16e-31    0.115     0.161

tidy(model_slr, conf.int = TRUE)

## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    7.52     0.898       8.38 8.36e-17   5.76       9.28 
## 2 bmi            0.153    0.0297      5.16 2.65e- 7   0.0949     0.211

tidy(model_slr_2, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	4.3359	0.6826	6.3521	0	2.9975	5.6743
age	0.1377	0.0117	11.7893	0	0.1148	0.1606

tidy(model_slr, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	7.5193	0.8978	8.3751	0	5.7589	9.2796
bmi	0.1531	0.0297	5.1584	0	0.0949	0.2112

# (c) Which predictor has the stronger association? Compare R² values.
r_sq_2 <- summary(model_slr_2)$r.squared
r_sq <- summary(model_slr)$r.squared

#Age and days with poor health has a stronger association (r2=0.0443) than bmi and days with poor health (r2=0.0088).

Questions:

How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance? Age and days with poor health has a stronger association (r2=0.0443) than bmi and days with poor health (r2=0.0088). Both are significant direct associations, with number of days with poor health increasing with age and bmi.
Compare the \(R^2\) values of the two models. Which predictor explains more variability in phys_days? Age and days with poor health has a stronger association (r2=0.0443) than bmi and days with poor health (r2=0.0088). Age explains 4.4% of the variability of days with poor health.
Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome? There are many factors involved in physical health. A multiple linear regression model would allow for a more robust analysis by accounting for multiple variables.

Simple Linear Regression

EPI 553 - Advanced Epidemiologic Methods

Matthew Goldman

February 26, 2026

Overview

Task 1: Explore the Variables (15 points)

Task 2: Fit and Interpret the SLR Model (20 points)

Task 3: ANOVA Decomposition and R² (15 points)

Task 4: Confidence and Prediction Intervals (20 points)

Task 5: Residual Diagnostics (20 points)

Task 6: Testing a Different Predictor (10 points)