Introduction

Simple Linear Regression (SLR) is one of the most fundamental and widely used tools in epidemiology and public health research. It allows us to:

Quantify the linear relationship between a continuous outcome and a single predictor
Predict values of an outcome given a predictor value
Test hypotheses about whether a predictor is associated with an outcome
Lay the groundwork for multiple regression, which controls for confounding

In epidemiology, we frequently use SLR to model continuous outcomes such as blood pressure, BMI, cholesterol levels, or hospital length of stay as a function of age, exposure levels, or other continuous predictors.

Summary of Key Formulas

Quantity	Formula
Slope	\(b_1 = S_{XY} / S_{XX}\)
Intercept	\(b_0 = \bar{Y} - b_1 \bar{X}\)
SSTotal	\(\sum(Y_i - \bar{Y})^2\)
SSRegression	\(\sum(\hat{Y}_i - \bar{Y})^2\)
SSResidual	\(\sum(Y_i - \hat{Y}_i)^2\)
MSE	\(SS_{Residual} / (n-2)\)
\(R^2\)	\(SS_{Reg} / SS_{Total}\)
\(SE(b_1)\)	\(\hat{\sigma}/\sqrt{S_{XX}}\)
t-statistic	\(b_1 / SE(b_1)\)
95% CI for \(\beta_1\)	\(b_1 \pm t_{n-2, 0.025} \cdot SE(b_1)\)

Part 2: Lab Activity

Overview

In this lab, you will apply Simple Linear Regression to the BRFSS 2020 dataset using a different outcome variable: number of days of poor physical health in the past 30 days (phys_days). You will model it as a continuous outcome predicted by BMI.

Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?

Setup Instructions

Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.

# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)   
library(plotly)      
library(ggeffects)  


# Load the RDS file 
brfss_lab <- readRDS("brfss_slr_2020.rds")


# Check that it worked
glimpse(brfss_lab)
summary(brfss_lab)



# Select variables of interest
brfss_lab <- brfss_full %>%
  select(bmi5, age80, sex, physhlth, sleptim1)

# Recode variables
brfss_lab <- brfss_lab %>%
  mutate(
    bmi       = bmi5 / 100,
    age       = age80,
    sex       = factor(ifelse(sex == 1, "Male", "Female")),
    phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
    sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
  )

# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_lab <- brfss_lab %>%
  select(bmi, age, sex, phys_days, sleep_hrs) %>%
  filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
  drop_na() %>%
  slice_sample(n = 3000)

Task 1: Explore the Variables (15 points)

# (a) Create a summary table of phys_days and bmi

# Load the data
brfss_lab <- readRDS("brfss_slr_2020.rds")

# Load packages in this chunk
library(tidyverse)
library(knitr)
library(kableExtra)

brfss_lab %>%
  select(phys_days, bmi) %>%
  summary() %>%
  kable(caption = "Descriptive Statistics: Poor Physical Health Days and BMI") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Descriptive Statistics: Poor Physical Health Days and BMI
	phys_days	bmi
	Min. : 1.00	Min. :14.63
	1st Qu.: 2.00	1st Qu.:24.32
	Median : 6.00	Median :27.89
	Mean :11.66	Mean :29.18
	3rd Qu.:20.00	3rd Qu.:32.89
	Max. :30.00	Max. :59.60

brfss_lab %>%
  select(phys_days, bmi) %>%
  summary() %>%
  kable(caption = "Descriptive Statistics: Poor Physical Health Days and BMI") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Descriptive Statistics: Poor Physical Health Days and BMI
	phys_days	bmi
	Min. : 1.00	Min. :14.63
	1st Qu.: 2.00	1st Qu.:24.32
	Median : 6.00	Median :27.89
	Mean :11.66	Mean :29.18
	3rd Qu.:20.00	3rd Qu.:32.89
	Max. :30.00	Max. :59.60

# (b) Create a histogram of phys_days — describe the distribution

p_hist_phys <- ggplot(brfss_lab, aes(x = phys_days)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "steelblue", color = "white", alpha = 0.8) +
  geom_density(color = "red", linewidth = 1) +
  labs(
    title = "Distribution of Poor Physical Health Days",
    subtitle = "Days in past 30 days with poor physical health",
    x = "Number of Days",
    y = "Density"
  ) +
  theme_minimal(base_size = 13)

# Display the plot
p_hist_phys

# (c) Create a scatter plot of phys_days (Y) vs bmi (X)

p_scatter_lab <- ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "orange", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "Poor Physical Health Days vs. BMI",
    subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
    x = "BMI (kg/m²)",
    y = "Poor Physical Health Days (past 30)"
  ) +
  theme_minimal(base_size = 13)

# Display the plot
p_scatter_lab

# Descriptive Statistics
brfss_lab %>%
  select(phys_days, bmi) %>%
  summary() %>%
  kable(caption = "Descriptive Statistics: Poor Physical Health Days and BMI") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Descriptive Statistics: Poor Physical Health Days and BMI
	phys_days	bmi
	Min. : 1.00	Min. :14.63
	1st Qu.: 2.00	1st Qu.:24.32
	Median : 6.00	Median :27.89
	Mean :11.66	Mean :29.18
	3rd Qu.:20.00	3rd Qu.:32.89
	Max. :30.00	Max. :59.60

# Calculate actual standard deviation
sd_phys <- sd(brfss_lab$phys_days, na.rm = TRUE)
sd_bmi <- sd(brfss_lab$bmi, na.rm = TRUE)

cat("Standard deviation for phys_days:", round(sd_phys, 2), "\n")

## Standard deviation for phys_days: 11.16

cat("Standard deviation for BMI:", round(sd_bmi, 2), "\n")

## Standard deviation for BMI: 7.01

# Create a more complete table
brfss_lab %>%
  select(phys_days, bmi) %>%
  summarise(
    Variable = c("phys_days", "bmi"),
    Mean = c(mean(phys_days), mean(bmi)),
    SD = c(sd(phys_days), sd(bmi)),
    Median = c(median(phys_days), median(bmi)),
    Min = c(min(phys_days), min(bmi)),
    Max = c(max(phys_days), max(bmi))
  ) %>%
  kable(caption = "Complete Descriptive Statistics", digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Complete Descriptive Statistics
Variable	Mean	SD	Median	Min	Max
phys_days	11.66	11.16	6.00	1.00	30.0
bmi	29.18	7.01	27.89	14.63	59.6

Questions:

What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days?

Descriptive Statistics:

Poor Physical Health Days (phys_days): Mean = 12.03 days, Standard Deviation = 11.21 days

Body Mass Index (bmi): Mean = 29.47 kg/m², Standard Deviation = 6.87 kg/m²

Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers?

The relationship appears approximately linear – the LOESS smoother closely follows the linear fit.

Obvious outliers include:

Individuals with high BMI (>35) reporting few poor health days (1-5 days)

Individuals with normal BMI (20-25) reporting 30 poor health days

A ceiling effect at 30 days across all BMI values

The plot also suggests heteroscedasticity (increasing spread at higher BMI), which may violate regression assumptions.

Task 2: Fit and Interpret the SLR Model (20 points)

# Load ALL required packages
library(tidyverse)    # For %>% pipe, ggplot, dplyr
library(knitr)        # For kable tables
library(kableExtra)   # For table styling
library(broom)        # For tidy() function 

# (a) Fit the SLR model: phys_days ~ bmi

model_phys_bmi <- lm(phys_days ~ bmi, data = brfss_lab)

# View summary output
summary(model_phys_bmi)

## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.808  -9.160  -5.623   8.943  20.453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.42285    0.86881   8.544  < 2e-16 ***
## bmi          0.14513    0.02895   5.013 5.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared:  0.008314,   Adjusted R-squared:  0.007983 
## F-statistic: 25.13 on 1 and 2998 DF,  p-value: 5.659e-07

# (b) Display a tidy coefficient table with 95% CIs

tidy(model_phys_bmi, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Table 2: Simple Linear Regression - Poor Physical Health Days ~ BMI",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Table 2: Simple Linear Regression - Poor Physical Health Days ~ BMI
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	7.4228	0.8688	8.5437	0	5.7193	9.1264
bmi	0.1451	0.0289	5.0134	0	0.0884	0.2019

# (c) Extract and report: slope, intercept, t-statistic, p-value
# Extract coefficients
b0 <- round(coef(model_phys_bmi)[1], 3)  # Intercept
b1 <- round(coef(model_phys_bmi)[2], 4)  # Slope

# Extract test statistics
t_stat <- round(summary(model_phys_bmi)$coefficients[2, 3], 3)
p_val <- summary(model_phys_bmi)$coefficients[2, 4]

# Create summary table
tibble(
  Quantity = c("Intercept (b₀)", "Slope (b₁)", "t-statistic", "p-value"),
  Value = c(b0, b1, t_stat, format.pval(p_val, digits = 4))
) %>%
  kable(caption = "Table 3: Key Model Statistics") %>%
  kable_styling(bootstrap_options = c("striped"), full_width = FALSE)

Table 3: Key Model Statistics
Quantity	Value
Intercept (b₀)	7.423
Slope (b₁)	0.1451
t-statistic	5.013
p-value	5.659e-07

Questions:

Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\).

\(\widehat{\text{phys_days}} = 7.519 + 0.1531 \times \text{bmi}\)

Interpret the slope (\(b_1\)) in context — what does it mean in plain English?

Each 1-unit increase in BMI is associated with an average increase of 0.1531 poor physical health days (about 4.6 hours).

Is the intercept (\(b_0\)) interpretable in this context? Why or why not? No. \(b_0 = 7.519\) represents predicted poor health days when BMI = 0, which is impossible.
Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value.

\(H_0: \beta_1 = 0\) (no linear relationship)

\(H_A: \beta_1 \neq 0\) (linear relationship exists)

Test statistic: \(t = 5.158\)

p-value = \(2.65 \times 10^{-7}\)

Since p-value < 0.05, reject \(H_0\). There is significant evidence of a positive linear association between BMI and poor physical health days.

Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table
anova_phys <- anova(model_phys_bmi)
anova_phys %>%
  kable(caption = "Table 4: ANOVA Table - Poor Physical Health Days ~ BMI",
        digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 4: ANOVA Table - Poor Physical Health Days ~ BMI
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
bmi	1	3105.365	3105.365	25.134	0
Residuals	2998	370411.743	123.553	NA	NA

# (b) Compute and report SSTotal, SSRegression, and SSResidual
ss_reg <- anova_phys$`Sum Sq`[1]
ss_res <- anova_phys$`Sum Sq`[2]
ss_total <- ss_reg + ss_res


# (c) Compute R² two ways: from the model object and from the SS decomposition

r_sq <- summary(model_phys_bmi)$r.squared
r_sq_calc <- ss_reg / ss_total
ss_total <- ss_reg + ss_res

# Display results
tibble(
  Source = c("Regression", "Residual", "Total"),
  `Sum Sq` = c(round(ss_reg, 2), round(ss_res, 2), round(ss_total, 2)),
  df = c(1, 2998, 2999),
  `Mean Sq` = c(round(ss_reg/1, 2), round(ss_res/2998, 2), NA)
) %>%
  kable(caption = "Table 5: Sums of Squares Decomposition") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 5: Sums of Squares Decomposition
Source	Sum Sq	df	Mean Sq
Regression	3105.36	1	3105.36
Residual	370411.74	2998	123.55
Total	373517.11	2999	NA

cat("\nR² =", round(r_sq, 4), "(", round(r_sq * 100, 2), "%)\n")

## 
## R² = 0.0083 ( 0.83 %)

cat("R² (calculated from SS) =", round(r_sq_calc, 4), "\n")

## R² (calculated from SS) = 0.0083

# Print R² values
cat("\n")

cat("========================================\n")

## ========================================

cat("R² RESULTS\n")

## R² RESULTS

cat("========================================\n")

## ========================================

cat("R² =", round(r_sq, 4), "(", round(r_sq * 100, 2), "%)\n")

## R² = 0.0083 ( 0.83 %)

cat("R² (calculated from SS) =", round(r_sq_calc, 4), "\n")

## R² (calculated from SS) = 0.0083

cat("SS Total:", round(ss_total, 2), "\n")

## SS Total: 373517.1

cat("SS Regression:", round(ss_reg, 2), "\n")

## SS Regression: 3105.36

cat("SS Residual:", round(ss_res, 2), "\n")

## SS Residual: 370411.7

cat("F-statistic:", round(anova_phys$`F value`[1], 3), "\n")

## F-statistic: 25.134

Questions:

Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic.

\(SS_{Total} = 376,802.40\)

\(SS_{Regression} = 3,314.97\)

\(SS_{Residual} = 373,487.40\)

\(df_{Regression} = 1\)

\(df_{Residual} = 2,998\)

\(df_{Total} = 2,999\)

\(F\)-statistic = 26.609

What is the \(R^2\) value? Interpret it in plain English.

R² = 0.0088 (0.88%)

Only 0.88% of the variability in poor physical health days is explained by BMI alone. This means that BMI accounts for less than 1% of the differences we observe in how many poor physical health days people report.

What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation?

This very low R² tells us that BMI alone is a poor predictor of poor physical health days. While BMI is statistically significant (p < 0.001 from the F-test), it explains virtually none of the real-world variation in this health outcome.

The other 99.12% is explained by factors not in the model: chronic conditions, mental health, health behaviors, healthcare access, socioeconomic factors, demographics, and genetics.

Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = 25)

ci_25 <- predict(model_phys_bmi, newdata = new_bmi, interval = "confidence", level = 0.95) %>%
  as.data.frame() %>%
  mutate(BMI = 25, Type = "Confidence Interval")

# (b) Calculate the 95% prediction interval for a person with BMI = 25

pi_25 <- predict(model_phys_bmi, newdata = new_bmi, interval = "prediction", level = 0.95) %>%
  as.data.frame() %>%
  mutate(BMI = 25, Type = "Prediction Interval") %>%
  select(-fit)
# Combine results
results_25 <- bind_cols(
  BMI = 25,
  Fitted = round(ci_25$fit, 2),
  CI_Lower = round(ci_25$lwr, 2),
  CI_Upper = round(ci_25$upr, 2),
  PI_Lower = round(pi_25$lwr, 2),
  PI_Upper = round(pi_25$upr, 2)
)
# Display table
results_25 %>%
  kable(
    caption = "Table 6: 95% Confidence and Prediction Intervals for BMI = 25",
    col.names = c("BMI", "Fitted Value", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 6: 95% Confidence and Prediction Intervals for BMI = 25
BMI	Fitted Value	CI Lower	CI Upper	PI Lower	PI Upper
25	11.05	10.59	11.51	-10.75	32.85

# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(15, 55, length.out = 200))

ci_band <- predict(model_phys_bmi, newdata = bmi_grid, interval = "confidence") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

pi_band <- predict(model_phys_bmi, newdata = bmi_grid, interval = "prediction") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

p_ci_pi <- ggplot() +
  geom_point(data = brfss_lab, aes(x = bmi, y = phys_days),
             alpha = 0.15, color = "steelblue", size = 1) +
  geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "steelblue", alpha = 0.4) +
  geom_line(data = ci_band, aes(x = bmi, y = fit),
            color = "red", linewidth = 1.2) +
  labs(
    title = "Figure 3: Poor Physical Health Days ~ BMI",
    subtitle = "Dark band = 95% CI for mean | Light band = 95% PI for individual",
    x = "BMI (kg/m²)",
    y = "Poor Physical Health Days (past 30)"
  ) +
  theme_minimal(base_size = 13)

p_ci_pi

Questions:

For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean?

For all adults with BMI = 25, the estimated average number of poor physical health days is 11.35 days. I am 95% confident that the true population mean falls between 10.87 and 11.82 days.

If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days?

For a specific new individual with BMI = 25, it is 95% confident that their actual number of poor physical health days will fall between -10.54 and 33.24 days.

Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice?

The prediction interval is wider than the confidence interval because it accounts for two sources of uncertainty:

Confidence interval only accounts for uncertainty in estimating the population mean

Prediction interval adds the individual variability (error term ε) around that mean

When to use each in practice:

Use a confidence interval when you want to estimate the average outcome for a group (e.g., “What’s the average poor health days for all 45-year-olds with BMI 25?”)

Use a prediction interval when you want to predict a single individual’s outcome (e.g., “If a specific patient has BMI 25, how many poor health days might they experience?”)

Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_phys_bmi, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)

par(mfrow = c(1, 1))

# (b) Create a residuals vs. fitted plot using ggplot

# Augment dataset with residuals
augmented <- augment(model_phys_bmi)

p_resid_fitted <- ggplot(augmented, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
  labs(
    title = "Figure 4: Residuals vs. Fitted Values",
    subtitle = "Should show random scatter around zero",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)
p_resid_fitted

# (c) Create a normal Q-Q plot of residuals using ggplot

p_qq <- ggplot(augmented, aes(sample = .resid)) +
  stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Figure 5: Normal Q-Q Plot of Residuals",
    subtitle = "Points should lie on red line if residuals are normal",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 13)
p_qq

# (d) Create a Cook's distance plot
n <- nrow(brfss_lab)
augmented <- augmented %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(model_phys_bmi),
    influential = ifelse(cooks_d > 4/n, "Potentially influential", "Not influential")
  )

n_influential <- sum(augmented$cooks_d > 4/n)

p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 4/n, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Figure 6: Cook's Distance",
    subtitle = paste0("Dashed line = 4/n threshold | ", 
                      n_influential, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")
p_cooks

Questions:

Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see.

Yes – there is a clear fan-shaped pattern (increasing spread) indicating heteroscedasticity, and the residuals are not randomly scattered around zero, suggesting nonlinearity.

Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days?

No, residuals are not normal. The S-shaped deviation from the red line indicates heavy tails and skewness, reflecting the bounded count nature of phys_days (1-30 days).

Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them?

Yes – 133 observations have Cook’s D > 4/n. These should be examined for data errors and a sensitivity analysis conducted, but not automatically deleted.

Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.)

Assumption Met

Linearity ❌ No Independence ✅ Yes Normality ❌ No Equal variance ❌ No

Most problematic: Normality and equal variance, due to phys_days being a bounded count outcome (1-30 days) with a ceiling effect. Linear regression is misspecified here; Poisson or negative binomial models would be more appropriate.

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age

model_phys_age <- lm(phys_days ~ age, data = brfss_lab)

# (b) Display results and compare to the BMI model

tidy(model_phys_age, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Table 7: SLR - Poor Physical Health Days ~ Age",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 7: SLR - Poor Physical Health Days ~ Age
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	4.3702	0.6661	6.5611	0	3.0642	5.6762
age	0.1313	0.0114	11.4675	0	0.1088	0.1537

# (c) Which predictor has the stronger association? Compare R² values.

r2_bmi <- summary(model_phys_bmi)$r.squared
r2_age <- summary(model_phys_age)$r.squared

# Display comparison
tibble(
  Model = c("BMI as Predictor", "Age as Predictor"),
  R_squared = c(round(r2_bmi, 4), round(r2_age, 4)),
  Percentage = c(paste0(round(r2_bmi * 100, 2), "%"),
                 paste0(round(r2_age * 100, 2), "%")),
  Slope = c(round(coef(model_phys_bmi)[2], 4),
            round(coef(model_phys_age)[2], 4)),
  p_value = c(format.pval(summary(model_phys_bmi)$coefficients[2, 4], digits = 3),
              format.pval(summary(model_phys_age)$coefficients[2, 4], digits = 3))
) %>%
  kable(caption = "Table 8: Model Comparison - BMI vs. Age") %>%
  kable_styling(bootstrap_options = c("striped"), full_width = FALSE)

Table 8: Model Comparison - BMI vs. Age
Model	R_squared	Percentage	Slope	p_value
BMI as Predictor	0.0083	0.83%	0.1451	5.66e-07
Age as Predictor	0.0420	4.2%	0.1313	<2e-16

# Scatter plot for age
p_age <- ggplot(brfss_lab, aes(x = age, y = phys_days)) +
  geom_jitter(alpha = 0.15, color = "darkgreen", width = 0.5, height = 0) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  labs(
    title = "Figure 7: Poor Physical Health Days vs. Age",
    x = "Age (years)",
    y = "Poor Physical Health Days (past 30)"
  ) +
  theme_minimal(base_size = 13)
p_age

Questions:

How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance?

Both BMI and age show positive, statistically significant associations (p < 0.001). BMI has a slightly larger slope (0.1531 vs. 0.1377), meaning a slightly stronger effect per unit change.

Compare the \(R^2\) values of the two models. Which predictor explains more variability in phys_days?

BMI model: 0.0088 (0.88%)

Age model: 0.0443 (4.43%)

Age explains more variability in poor physical health days than BMI (4.43% vs. 0.88%).

Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome?

Conclusion: Both are significant predictors, but age is stronger. However, neither explains much variation (< 5%), showing poor physical health is multifactorial.

Limitations:

Non-normal outcome (bounded count 1-30)

Heteroscedasticity and ceiling effects at 30 days

No confounders – simple models omit key variables

Linear regression is misspecified for count outcomes; Poisson or negative binomial models would be better.

Submission Instructions

Submit your completed .Rmd file and the RPubs link to your knitted HTML document.

Your .Rmd must knit without errors. Make sure all code chunks produce visible output and all questions are answered in complete sentences below each code chunk.

Session Info

sessionInfo()

## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] broom_1.0.12     kableExtra_1.4.0 knitr_1.51       lubridate_1.9.3 
##  [5] forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4      purrr_1.0.2     
##  [9] readr_2.1.5      tidyr_1.3.1      tibble_3.2.1     ggplot2_4.0.2   
## [13] tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     xml2_1.3.6         lattice_0.22-6    
##  [5] stringi_1.8.4      hms_1.1.4          digest_0.6.37      magrittr_2.0.3    
##  [9] evaluate_1.0.5     grid_4.4.2         timechange_0.3.0   RColorBrewer_1.1-3
## [13] fastmap_1.2.0      Matrix_1.7-1       jsonlite_2.0.0     backports_1.5.0   
## [17] mgcv_1.9-1         viridisLite_0.4.3  scales_1.4.0       textshaping_0.4.0 
## [21] jquerylib_0.1.4    cli_3.6.3          rlang_1.1.4        splines_4.4.2     
## [25] withr_3.0.2        cachem_1.1.0       yaml_2.3.10        otel_0.2.0        
## [29] tools_4.4.2        tzdb_0.4.0         vctrs_0.6.5        R6_2.6.1          
## [33] lifecycle_1.0.5    pkgconfig_2.0.3    pillar_1.11.1      bslib_0.10.0      
## [37] gtable_0.3.6       glue_1.8.0         systemfonts_1.3.1  xfun_0.56         
## [41] tidyselect_1.2.1   rstudioapi_0.18.0  farver_2.1.2       nlme_3.1-166      
## [45] htmltools_0.5.8.1  rmarkdown_2.30     svglite_2.2.2      labeling_0.4.3    
## [49] compiler_4.4.2     S7_0.2.1

Simple Linear Regression

EPI 553 - Advanced Epidemiologic Methods

Fizza Zaheer

February 28, 2026

Introduction

Part 2: Lab Activity

Overview

Setup Instructions

Task 1: Explore the Variables (15 points)

The plot also suggests heteroscedasticity (increasing spread at higher BMI), which may violate regression assumptions.

Task 2: Fit and Interpret the SLR Model (20 points)

Task 3: ANOVA Decomposition and R² (15 points)

Task 4: Confidence and Prediction Intervals (20 points)

Task 5: Residual Diagnostics (20 points)

Task 6: Testing a Different Predictor (10 points)

Submission Instructions

Session Info