Part 2: Lab Activity

Overview

In this lab, you will apply Simple Linear Regression to the BRFSS 2020 dataset using a different outcome variable: number of days of poor physical health in the past 30 days (phys_days). You will model it as a continuous outcome predicted by BMI.

Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?

Setup Instructions

Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.

# Load packages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven)
## Warning: package 'haven' was built under R version 4.4.3
library(here)
## Warning: package 'here' was built under R version 4.4.3
## here() starts at C:/Users/sarah/OneDrive/Documents/EPI 553
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.3
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.4.3
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(broom)
## Warning: package 'broom' was built under R version 4.4.3
# Load raw BRFSS 2020 data
brfss_full <- read_xpt(
  "/Users/sarah/OneDrive/Documents/EPI 553/data/LLCP2020.XPT"
) %>%
  janitor::clean_names()

# Select variables of interest
brfss_lab <- brfss_full %>%
  select(bmi5, age80, sex, physhlth, sleptim1)

# Recode variables
brfss_lab <- brfss_lab %>%
  mutate(
    bmi       = bmi5 / 100,
    age       = age80,
    sex       = factor(ifelse(sex == 1, "Male", "Female")),
    phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
    sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
  )

# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_lab <- brfss_lab %>%
  select(bmi, age, sex, phys_days, sleep_hrs) %>%
  filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
  drop_na() %>%
  slice_sample(n = 3000)

# Save analytic dataset
saveRDS(brfss_lab, here::here(
  "data", "brfss_lab_2020.rds"
))

Task 1: Explore the Variables (15 points)

# (a) Create a summary table of phys_days and bmi
summary_table <- brfss_lab %>%
  summarise(
    n_phys = sum(!is.na(phys_days)),
    mean_phys = mean(phys_days),
    sd_phys = sd(phys_days),
    median_phys = median(phys_days),
    min_phys = min(phys_days),
    max_phys = max(phys_days),
  
    n_bmi = sum(!is.na(bmi)),
    mean_bmi = mean(bmi),
    sd_bmi = sd(bmi),
    median_bmi = median(bmi),
    min_bmi = min(bmi),
    max_bmi= max(bmi)
  )
    
print(summary_table)
## # A tibble: 1 × 12
##   n_phys mean_phys sd_phys median_phys min_phys max_phys n_bmi mean_bmi sd_bmi
##    <int>     <dbl>   <dbl>       <dbl>    <dbl>    <dbl> <int>    <dbl>  <dbl>
## 1   3000      12.0    11.2           7        1       30  3000     29.5   6.87
## # ℹ 3 more variables: median_bmi <dbl>, min_bmi <dbl>, max_bmi <dbl>
# (b) Create a histogram of phys_days — describe the distribution
ggplot(brfss_lab, aes(x=phys_days)) +
  geom_histogram(bins = 30, color = "white", fill = "steelblue") +
  labs(
    title = "Distribution of Poor Physical Health Days",
    x = "Days of Poor Physical Health (past 30 days)",
    y = "Count"
  )

# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "Scatter Plot of BMI vs Poor Physical Health Days",
    x = "BMI",
    y = "Poor Physical Health Days"
  )+
  theme_minimal(base_size = 13)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Questions:

  1. What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days?

The mean of phys_days is 12 and the standard deviation is 11.2. The mean bmi is 29.5 with a standard deviation of 6.87. The distribution of poor physical health days shows two peaks one with a low number of poor physical health days and a spike at 30 days. The count does taper off as the days increase other than the spike at 30 days.

  1. Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers?

Based on the scatterplot, the relationship between BMI and poor physical health days is not linear. The curve line bends at both tails suggesting skewness. There are no obvious outliers.

Task 2: Fit and Interpret the SLR Model (20 points)

# (a) Fit the SLR model: phys_days ~ bmi
slr_model <- lm(phys_days ~ bmi, data= brfss_lab)

summary(slr_model)
## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.412  -9.294  -5.379   9.803  20.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.51927    0.89781   8.375  < 2e-16 ***
## bmi          0.15306    0.02967   5.158 2.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.16 on 2998 degrees of freedom
## Multiple R-squared:  0.008798,   Adjusted R-squared:  0.008467 
## F-statistic: 26.61 on 1 and 2998 DF,  p-value: 2.652e-07
# (b) Display a tidy coefficient table with 95% CIs
tidy(slr_model, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: Poor Physical Health Days ~ BMI",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)
Simple Linear Regression: Poor Physical Health Days ~ BMI
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 7.5193 0.8978 8.3751 0 5.7589 9.2796
bmi 0.1531 0.0297 5.1584 0 0.0949 0.2112
# (c) Extract and report: slope, intercept, t-statistic, p-value

b0 <- round(coef(slr_model) [1], 3)
b1 <- round(coef(slr_model) [2], 4)

Questions:

  1. Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\).

\(\hat{Y} = 7.519 + 0.153 * BMI\)

  1. Interpret the slope (\(b_1\)) in context — what does it mean in plain English?

The slope, 0.1531, indicates that for each unit increase in BMI, there will be an increase of about .15 poor physical health days.

  1. Is the intercept (\(b_0\)) interpretable in this context? Why or why not?

The intercept is not interpretable. It represents the predicted number of poor physical health days when BMI = 0. A BMI of 0 is not possible, therefore the intercept doesn’t have a meaningful real-world interpretation.

  1. Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value. There is a statistically significant association between BMI and the number of poor physical health days.

H0: b1 = 0 Ha: b1 != 0 Test statistic: t= 5.1584 p-value: <.0001


Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table
slr_anova <-  anova(slr_model)

slr_anova %>%
  kable (
    caption = "ANOVA Table: Days of Poor Physical Health ~ BMI",
    digits = 3,
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
ANOVA Table: Days of Poor Physical Health ~ BMI
Source Df Sum Sq Mean Sq F value Pr(>F)
bmi 1 3314.969 3314.969 26.609 0
Residuals 2998 373487.391 124.579 NA NA
# (b) Compute and report SSTotal, SSRegression, and SSResidual

ss_regression <- slr_anova["bmi", "Sum Sq"]
ss_residual   <- slr_anova["Residuals", "Sum Sq"]
ss_total      <- ss_regression + ss_residual

# (c) Compute R² two ways: from the model object and from the SS decomposition

summary(slr_model)$r.squared
## [1] 0.008797634
r_squared <- ss_regression / ss_total

Questions:

  1. Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic.

\(SS_{Total}\) : 376802.36 \(SS_{Regression}\) : 3314.969 \(SS_{Residual}\) : 373487.391 \(df\) : 1 \(F\) : 26.609 b) What is the \(R^2\) value? Interpret it in plain English.

The R^2 value is 0.0088. This indicates that BMI explains less than 1% of variance in poor physical health days, meaning The association is statistically significant, but the effect size is small.

  1. What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation?

The R^2 value indicates that BMI alone is not a strong predictor of poor physical health days. The remaining variation can be explained by a multitude of factors such as mental health status, disability, socioeconomic status, access to healthcare,age, physical activity, stress, etc.

Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = c(25))

predict(slr_model, newdata = new_bmi, interval = "confidence") %>%
  as.data.frame() %>%
  rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
##     Fitted CI_Lower CI_Upper
## 1 11.34566 10.86895 11.82236
# (b) Calculate the 95% prediction interval for a person with BMI = 25

predict(slr_model, newdata = new_bmi, interval = "prediction")
##        fit       lwr     upr
## 1 11.34566 -10.54449 33.2358
# (c) Plot the regression line with both the CI band and PI band
bmi_seq <- data.frame(bmi = seq(min(brfss_lab$bmi),    max(brfss_lab$bmi),             length.out = 200))

ci_band <- predict(slr_model, newdata = bmi_seq, interval = "confidence") %>%
  as.data.frame() %>%
  bind_cols(bmi_seq)

pi_band <- predict(slr_model, newdata = bmi_seq, interval = "prediction") %>%
  as.data.frame() %>%
  bind_cols(bmi_seq)

p_ci_pi <- ggplot() +
    geom_point(data = brfss_lab, aes(x = bmi, y = phys_days),
             alpha = 0.10, color = "steelblue", size = 1) +
  geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "steelblue", alpha = 0.4) +
  geom_line(data = ci_band, aes(x = bmi, y = fit),
            color = "red", linewidth = 1.2) +
  labs(
    title = "Simple Linear Regression: phys_days ~ BMI",
    subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
    x = "BMI (kg/m²)",
    y = "Poor Physical Health Days",
    caption = "BRFSS 2020, n = 3,000"
  ) +
  theme_minimal(base_size = 13)

p_ci_pi

Questions:

  1. For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean?

For someone with a BMI of 25, the estimated mean number of poor physical health days is 11.35 days. The 95% confidence interval for this mean is [10.87, 11.82].

  1. If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days?

For a specific new person with a BMI of 25, the 95% confidence interval for their number of poor physical health days is [-10.54, 33.24].

  1. Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice? The prediction interval accounts for uncertainty in mean response and individual variability around the mean, whereas the confidence interval provides a range of values for a population mean.

CI is used when estimating the average outcome at a given X, like in part (a) when estimating the average number of poor physical health days for someone with a BMI of 25. PI would be used to predict the individual outcome at X, like predicting the actual number of poor physical health days for a new person with a BMI of 25.


Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())

par(mfrow = c(2,2))
plot(slr_model, which = 1:4)

# (b) Create a residuals vs. fitted plot using ggplot
augmented <- augment(slr_model)
ggplot(augmented, aes( x= .fitted, y= .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
  labs(
    title = "Residuals vs Fitted Values",
    x = "Fitted Values",
    y = "Residuals"
  )
## `geom_smooth()` using formula = 'y ~ x'

# (c) Create a normal Q-Q plot of residuals using ggplot
ggplot(augmented, aes(sample = .resid)) +
  geom_qq(color = "steelblue", size = 2)+
  geom_qq_line(color = "red", linetype = "dashed") +
  labs(
    title = "Q-Q Plot of Residuals",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )

# (d) Create a Cook's distance plot

ggplot(augmented, aes(x = seq_along(.cooksd), y= .cooksd)) +
    geom_point(alpha = 0.6, color = "steelblue") +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance",
    x = "Observation",
    y = "Cook's Distance"
  )

Questions:

  1. Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see. The plot shows evidence of nonlinearity as the smooth curve bends meaning the linear model doesn’t fully capture the relationship between BMI and poor physical health days. There is also evidence of heteroscedasticity where the residuals are tightly clustered at lower fitted values, spreading out as the fitted values increase.

  2. Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days?

The middle points seem to follow the reference line, but the tails deviate away from the line. The deviations indicate heavier tails and some skewness. The distribution of phys_days is right-skewed with a spike at 30 days, suggesting a non normal distribution.

  1. Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them?

There were no influential observations. All values were extremely small, no data points need to be adjusted.

  1. Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.)

The line assumptions are partially met. linearity shows mild violations, and the normality assumption is most problematic. The plot showed clearly deviations for the reference line at the tails reflecting the right-skewed pattern we noted in the distribution.

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age
slr_model_age <- lm(phys_days ~ age, data = brfss_lab)

summary(slr_model_age)
## 
## Call:
## lm(formula = phys_days ~ age, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.354  -8.911  -4.432  10.992  22.910 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.33587    0.68259   6.352 2.45e-10 ***
## age          0.13773    0.01168  11.789  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 2998 degrees of freedom
## Multiple R-squared:  0.04431,    Adjusted R-squared:  0.04399 
## F-statistic:   139 on 1 and 2998 DF,  p-value: < 2.2e-16
# (b) Display results and compare to the BMI model
#Results for age model
tidy(slr_model_age, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: Poor Physical Health Days ~ Age",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)
Simple Linear Regression: Poor Physical Health Days ~ Age
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 4.3359 0.6826 6.3521 0 2.9975 5.6743
age 0.1377 0.0117 11.7893 0 0.1148 0.1606
#Results for BMI model
tidy(slr_model, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: Poor Physical Health Days ~ BMI",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)
Simple Linear Regression: Poor Physical Health Days ~ BMI
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 7.5193 0.8978 8.3751 0 5.7589 9.2796
bmi 0.1531 0.0297 5.1584 0 0.0949 0.2112
# (c) Which predictor has the stronger association? Compare R² values.
r2_age <- summary(slr_model_age)$r.squared
r2_bmi <- summary(slr_model)$r.squared

r2_age
## [1] 0.04430638
r2_bmi
## [1] 0.008797634

Questions:

  1. How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance?

Both predictors have a positive association with poor physical health days. Age and BMI move in the same direction based on the slope. The slope sizes are similar. For age the slope was 0.138 and bmi slope was 0.153. Both are statistically significant but the t-value for age was 11.79, much higher than 5.16 for bmi. b) Compare the \(R^2\) values of the two models. Which predictor explains more variability in phys_days?

The r^2 value for age was 0.044, whereas the r^2 for BMI was 0.0088. Age explains more variability as it accounts for 4.4% of variance. BMI only accounts for 0.88%.

  1. Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome?

Based on the two models, age is the stronger predictor because it explains more variation in poor physical health days than BMI. Both predictors explain little variation indicating other factors have a larger effect on poor physical health days. Limitations of the simple linear regression was phys_day was right-skewed causing non-normal residuals and nonlinearity.


Submission Instructions

Submit your completed .Rmd file and the RPubs link to your knitted HTML document.

Your .Rmd must knit without errors. Make sure all code chunks produce visible output and all questions are answered in complete sentences below each code chunk.

Due: Before the next class session.