Introduction

Simple Linear Regression (SLR) is one of the most fundamental and widely used tools in epidemiology and public health research. It allows us to:

  • Quantify the linear relationship between a continuous outcome and a single predictor
  • Predict values of an outcome given a predictor value
  • Test hypotheses about whether a predictor is associated with an outcome
  • Lay the groundwork for multiple regression, which controls for confounding

In epidemiology, we frequently use SLR to model continuous outcomes such as blood pressure, BMI, cholesterol levels, or hospital length of stay as a function of age, exposure levels, or other continuous predictors.


Setup and Data

library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)

Loading BRFSS 2020 Data

We will use the Behavioral Risk Factor Surveillance System (BRFSS) 2020 data throughout this lecture. The BRFSS is a large-scale, nationally representative telephone survey conducted by the CDC that collects data on health behaviors, chronic conditions, and preventive service use among U.S. adults.

# Load raw BRFSS 2020 data
brfss_full <- read_xpt(
  "/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_2020"
) %>%
  janitor::clean_names()

# Select variables of interest
brfss_slr <- brfss_full %>%
  select(bmi5, age80, sex, educag, genhlth, physhlth, sleptim1)

# Recode variables
brfss_slr <- brfss_slr %>%
  mutate(
    bmi       = bmi5 / 100,
    age       = age80,
    sex       = factor(ifelse(sex == 1, "Male", "Female")),
    education = factor(case_when(
      educag == 1 ~ "< High school",
      educag == 2 ~ "High school graduate",
      educag == 3 ~ "Some college",
      educag == 4 ~ "College graduate"
    ), levels = c("< High school", "High school graduate",
                  "Some college", "College graduate")),
    gen_health_num = ifelse(genhlth  %in% 1:5,  genhlth,  NA_real_),
    sleep_hrs      = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_),
    phys_days      = ifelse(physhlth %in% 0:30, physhlth, NA_real_)
  )

# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_slr <- brfss_slr %>%
  select(bmi, age, sex, education, gen_health_num, sleep_hrs, phys_days) %>%
  filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
  drop_na() %>%
  slice_sample(n = 3000)

# Save analytic dataset
saveRDS(brfss_slr, here::here(
  "/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_slr.rds"
))

tibble(
  Metric = c("Observations", "Variables"),
  Value  = c(nrow(brfss_slr), ncol(brfss_slr))
) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Analytic Dataset Dimensions
Metric Value
Observations 3000
Variables 7

Descriptive Statistics

brfss_slr %>%
  select(bmi, age, sleep_hrs, phys_days) %>%
  summary() %>%
  kable(caption = "Descriptive Statistics: Key Continuous Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Descriptive Statistics: Key Continuous Variables
bmi age sleep_hrs phys_days
Min. :14.63 Min. :18.00 Min. : 1.000 Min. : 1.00
1st Qu.:24.32 1st Qu.:43.00 1st Qu.: 6.000 1st Qu.: 2.00
Median :27.89 Median :58.00 Median : 7.000 Median : 6.00
Mean :29.18 Mean :55.52 Mean : 6.915 Mean :11.66
3rd Qu.:32.89 3rd Qu.:70.00 3rd Qu.: 8.000 3rd Qu.:20.00
Max. :59.60 Max. :80.00 Max. :20.000 Max. :30.00
brfss_slr %>%
  select(bmi, age, sleep_hrs, sex, education) %>%
  tbl_summary(
    label = list(
      bmi ~ "BMI (kg/m²)",
      age ~ "Age (years)",
      sleep_hrs ~ "Sleep (hours/night)",
      sex ~ "Sex",
      education ~ "Education"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**")
Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)
Characteristic N N = 3,0001
BMI (kg/m²) 3,000 29.2 (7.0)
Age (years) 3,000 55.5 (17.4)
Sleep (hours/night) 3,000 6.9 (1.7)
Sex 3,000
    Female
1,701 (57%)
    Male
1,299 (43%)
Education 3,000
    < High school
237 (7.9%)
    High school graduate
796 (27%)
    Some college
937 (31%)
    College graduate
1,030 (34%)
1 Mean (SD); n (%)

Part 1: Guided Practice — Simple Linear Regression

1. The Simple Linear Regression Model

1.1 What Is the Model?

Simple linear regression models the mean of a continuous outcome \(Y\) as a linear function of a single predictor \(X\):

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad i = 1, 2, \ldots, n\]

Where:

Symbol Name Meaning
\(Y_i\) Response / Outcome Observed value for subject \(i\) (e.g., BMI)
\(X_i\) Predictor / Covariate Observed predictor for subject \(i\) (e.g., age)
\(\beta_0\) Intercept Expected value of \(Y\) when \(X = 0\)
\(\beta_1\) Slope Expected change in \(Y\) for a 1-unit increase in \(X\)
\(\varepsilon_i\) Error term Random deviation of \(Y_i\) from the regression line

The population regression line (also called the true or theoretical regression line) describes the expected (mean) value of \(Y\) at each value of \(X\):

\[E(Y \mid X) = \mu_{Y|X} = \beta_0 + \beta_1 X\]

1.2 Key Distinction: Population vs. Sample

Population Sample
Line \(\beta_0 + \beta_1 X\) \(\hat{y} = b_0 + b_1 X\)
Intercept \(\beta_0\) (parameter) \(b_0\) or \(\hat{\beta}_0\) (estimate)
Slope \(\beta_1\) (parameter) \(b_1\) or \(\hat{\beta}_1\) (estimate)
Error \(\varepsilon_i\) \(e_i = Y_i - \hat{Y}_i\) (residual)

We use our sample to estimate the population parameters. The estimates \(b_0\) and \(b_1\) define the fitted regression line.

1.3 Visualizing the Relationship

Before fitting any model, always visualize the bivariate relationship.

p_scatter <- ggplot(brfss_slr, aes(x = age, y = bmi)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "BMI vs. Age (BRFSS 2020)",
    subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
    x = "Age (years)",
    y = "BMI (kg/m²)"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_scatter)

BMI vs. Age — BRFSS 2020

Interpretation tip: The LOESS smoother (orange) follows the data without assuming linearity. When it closely tracks the linear fit (red), a linear model is reasonable. Departures suggest nonlinearity.


2. Assumptions of Simple Linear Regression

A useful mnemonic is LINE:

Letter Assumption Description
L Linearity The relationship between \(X\) and \(E(Y)\) is linear
I Independence Observations are independent of one another
N Normality Errors \(\varepsilon_i\) are normally distributed
E Equal variance Errors have constant variance (homoscedasticity): \(\text{Var}(\varepsilon_i) = \sigma^2\)

Formally, we assume:

\[\varepsilon_i \overset{iid}{\sim} N(0, \sigma^2)\]

This means that for any value of \(X\), the distribution of \(Y\) is:

\[Y \mid X \sim N(\beta_0 + \beta_1 X, \; \sigma^2)\]

Note on independence: In cross-sectional survey data like BRFSS, observations from the same household or geographic cluster may not be fully independent. We acknowledge this limitation but proceed with the standard SLR framework for pedagogical purposes.


3. Estimating the Regression Coefficients

3.1 The Method of Least Squares

We estimate \(\beta_0\) and \(\beta_1\) by finding the values \(b_0\) and \(b_1\) that minimize the sum of squared residuals (SSR):

\[SSR = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n}(Y_i - b_0 - b_1 X_i)^2\]

This is called the Ordinary Least Squares (OLS) criterion. Minimizing SSR yields the closed-form solutions:

\[b_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} = \frac{S_{XY}}{S_{XX}}\]

\[b_0 = \bar{Y} - b_1 \bar{X}\]

where \(\bar{X}\) and \(\bar{Y}\) are the sample means of \(X\) and \(Y\).

Gauss-Markov Theorem: Under the LINE assumptions, OLS estimators are the Best Linear Unbiased Estimators (BLUE) — they have the smallest variance among all linear unbiased estimators.

3.2 Fitting the Model in R

# Fit simple linear regression: BMI ~ Age
model_slr <- lm(bmi ~ age, data = brfss_slr)

# Summary output
summary(model_slr)
## 
## Call:
## lm(formula = bmi ~ age, data = brfss_slr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.633  -4.883  -1.325   3.688  30.340 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.528231   0.427507  69.071   <2e-16 ***
## age         -0.006238   0.007347  -0.849    0.396    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.012 on 2998 degrees of freedom
## Multiple R-squared:  0.0002404,  Adjusted R-squared:  -9.312e-05 
## F-statistic: 0.7208 on 1 and 2998 DF,  p-value: 0.396
# Tidy coefficient table
tidy(model_slr, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: BMI ~ Age (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)
Simple Linear Regression: BMI ~ Age (BRFSS 2020)
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 29.5282 0.4275 69.0708 0.000 28.6900 30.3665
age -0.0062 0.0073 -0.8490 0.396 -0.0206 0.0082

3.3 Interpreting the Coefficients

b0 <- round(coef(model_slr)[1], 3)
b1 <- round(coef(model_slr)[2], 4)

Fitted regression equation:

\[\widehat{\text{BMI}} = 29.528 + -0.0062 \times \text{Age}\]

Intercept (\(b_0 = 29.528\)): The estimated mean BMI when age = 0. This is a mathematical artifact — a newborn does not have an adult BMI. The intercept is not directly interpretable in this context, but is necessary to anchor the line.

Slope (\(b_1 = -0.0062\)): For each 1-year increase in age, BMI is estimated to decrease by 0.0062 kg/m², on average, holding all else constant (though there are no other variables in this simple model).

Practical significance vs. statistical significance: Even a small slope can be highly statistically significant with a large sample. Always consider whether the magnitude is meaningful in the real world.

3.4 Visualizing Fitted Values and Residuals

# Augment dataset with fitted values and residuals
augmented <- augment(model_slr)

# Show a sample of fitted values and residuals
augmented %>%
  select(bmi, age, .fitted, .resid) %>%
  slice_head(n = 10) %>%
  mutate(across(where(is.numeric), ~ round(., 3))) %>%
  kable(
    caption = "First 10 Observations: Observed, Fitted, and Residual Values",
    col.names = c("Observed BMI (Y)", "Age (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
First 10 Observations: Observed, Fitted, and Residual Values
Observed BMI (Y) Age (X) Fitted (Ŷ) Residual (e = Y − Ŷ)
26.58 67 29.110 -2.530
33.47 38 29.291 4.179
35.15 78 29.042 6.108
30.42 65 29.123 1.297
22.67 55 29.185 -6.515
30.11 80 29.029 1.081
35.43 34 29.316 6.114
31.58 71 29.085 2.495
28.13 55 29.185 -1.055
34.01 62 29.141 4.869

The fitted value (Y^) and the observed BMI (Y) are different because a person’s age does not perfectly determine their exact BMI.

  • Fitted Value (Y^): This is the predicted average BMI for anyone of a specific age, based on the trend line calculated by the model. For example, the model might calculate that the average 67-year-old has a BMI of 29.110.

  • Observed BMI (Y): This is the actual, real-world BMI of a specific individual in the dataset.

  • The Residual/Error (e=Y−Y^): The difference between the two is the residual or error term. The error term (εi) represents the “random deviation of Yi from the regression line.”

# Select a random sample of 80 points to illustrate residuals
set.seed(42)
resid_sample <- augmented %>% slice_sample(n = 80)

p_resid <- ggplot(resid_sample, aes(x = age, y = bmi)) +
  geom_segment(aes(xend = age, yend = .fitted),
               color = "tomato", alpha = 0.5, linewidth = 0.5) +
  geom_point(color = "steelblue", size = 1.8, alpha = 0.8) +
  geom_line(aes(y = .fitted), color = "black", linewidth = 1.1) +
  labs(
    title = "Residuals Illustrated on the Regression Line",
    subtitle = "Red segments = residuals (Y − Ŷ); Black line = fitted regression line",
    x = "Age (years)",
    y = "BMI (kg/m²)"
  ) +
  theme_minimal(base_size = 13)

p_resid
Visualizing Residuals on the Regression Line

Visualizing Residuals on the Regression Line


4. Partitioning Variability: ANOVA Decomposition

4.1 Sums of Squares

The total variability in \(Y\) can be decomposed into two parts:

\[\underbrace{SS_{Total}}_{Total\ variability} = \underbrace{SS_{Regression}}_{Explained\ by\ X} + \underbrace{SS_{Residual}}_{Unexplained}\]

Where:

\[SS_{Total} = \sum(Y_i - \bar{Y})^2 \qquad (df = n-1)\] \[SS_{Regression} = \sum(\hat{Y}_i - \bar{Y})^2 \qquad (df = 1)\] \[SS_{Residual} = \sum(Y_i - \hat{Y}_i)^2 \qquad (df = n-2)\]

In statistics, degrees of freedom refer to the number of independent pieces of information that are free to vary when calculating an estimate. You can think of it as the amount of “data budget” you have. Every time you estimate a parameter (like a mean or a slope), you spend one degree of freedom.

Why it’s \(n - 1\): You have \(n\) total observations. However, to calculate this variance, you first had to calculate the sample mean. Because the sum of deviations from the mean must always equal zero, knowing \(n-1\) of the deviations automatically tells you the last one. You “spent” 1 degree of freedom calculating the mean, leaving you with \(n - 1\).

Why it’s \(1\): In Simple Linear Regression, you are using exactly one predictor variable (e.g., Age) to explain the outcome (e.g., BMI). Because you are only estimating one slope parameter (\(\beta_1\)) to capture this relationship, the regression model has 1 degree of freedom. (Note: In multiple regression, this would be equal to \(k\), the number of predictor variables).

Why it’s \(n - 2\): To calculate the predicted values (\(\hat{Y}\)) and find the residuals, your model had to estimate two parameters from the data: the intercept (\(\beta_0\)) and the slope (\(\beta_1\)). Since you “spent” 2 pieces of information to create the regression line, you are left with \(n - 2\) degrees of freedom for the errors.

# ANOVA decomposition
anova_slr <- anova(model_slr)

anova_slr %>%
  kable(
    caption = "ANOVA Table: BMI ~ Age",
    digits = 3,
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
ANOVA Table: BMI ~ Age
Source Df Sum Sq Mean Sq F value Pr(>F)
age 1 35.438 35.438 0.721 0.396
Residuals 2998 147400.214 49.166 NA NA

4.2 Mean Squared Error (MSE) and \(\hat{\sigma}\)

The Mean Squared Error estimates the variance of the error term:

\[MSE = \frac{SS_{Residual}}{n - 2} = \hat{\sigma}^2\]

The Residual Standard Error \(\hat{\sigma} = \sqrt{MSE}\) is in the same units as \(Y\) and tells us the typical prediction error of the model.

n <- nrow(brfss_slr)
ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)

tibble(
  Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
  Value    = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3)),
  Units    = c("", "", "kg/m²")
) %>%
  kable(caption = "Model Error Estimates") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Model Error Estimates
Quantity Value Units
SS Residual 147400.210
MSE (σ̂²)
    49.16

Interpretation: On average, our model’s predictions are off by about 7.01 BMI units.


5. The Coefficient of Determination: R²

5.1 Definition and Interpretation

\(R^2\) measures the proportion of total variability in \(Y\) explained by the linear regression on \(X\):

\[R^2 = \frac{SS_{Regression}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}}\]

\(R^2\) ranges from 0 to 1:

  • \(R^2 = 0\): \(X\) explains none of the variability in \(Y\)
  • \(R^2 = 1\): \(X\) explains all variability in \(Y\) (perfect fit)
# Extract R-squared from model
r_sq <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
R² and Adjusted R²
Metric Value
2e-04
Adjusted R² -1e-04
Variance Explained 0.02%

5.2 Relationship Between R² and Pearson’s r

For simple linear regression:

\[R^2 = r^2\]

where \(r\) is the Pearson correlation coefficient between \(X\) and \(Y\).

r_pearson <- cor(brfss_slr$age, brfss_slr$bmi)
tibble(
  Quantity   = c("Pearson r", "r² (from Pearson)", "R² (from model)", "r² = R²?"),
  Value      = c(
    round(r_pearson, 4),
    round(r_pearson^2, 4),
    round(r_sq, 4),
    as.character(round(r_pearson^2, 4) == round(r_sq, 4))
  )
) %>%
  kable(caption = "Pearson r vs. R² from Model") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Pearson r vs. R² from Model
Quantity Value
Pearson r -0.0155
r² (from Pearson) 2e-04
R² (from model) 2e-04
r² = R²? TRUE

Important caveat: A low \(R^2\) does not mean the regression is useless. In epidemiology, outcomes are influenced by many unmeasured factors, so \(R^2\) values of 0.05–0.20 can still yield scientifically meaningful and statistically significant estimates.


6. Hypothesis Testing

6.1 Testing the Slope: Is There a Linear Association?

The most important hypothesis test in SLR is:

\[H_0: \beta_1 = 0 \quad \text{(no linear relationship between X and Y)}\] \[H_A: \beta_1 \neq 0 \quad \text{(there is a linear relationship)}\]

Test statistic:

\[t = \frac{b_1 - 0}{SE(b_1)} \sim t_{n-2} \quad \text{under } H_0\]

Where:

\[SE(b_1) = \frac{\hat{\sigma}}{\sqrt{\sum(X_i - \bar{X})^2}} = \frac{\hat{\sigma}}{\sqrt{S_{XX}}}\]

# Extract slope test statistics
slope_test <- tidy(model_slr, conf.int = TRUE) %>% filter(term == "age")

tibble(
  Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
               "Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
  Value    = c(
    round(slope_test$estimate, 4),
    round(slope_test$std.error, 4),
    round(slope_test$statistic, 3),
    n - 2,
    format.pval(slope_test$p.value, digits = 3),
    round(slope_test$conf.low, 4),
    round(slope_test$conf.high, 4)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
t-Test for the Slope (H₀: β₁ = 0)
Quantity Value
Slope (b₁) -0.0062
SE(b₁) 0.0073
t-statistic -0.849
Degrees of freedom 2998
p-value 0.396
95% CI Lower -0.0206
95% CI Upper 0.0082

Decision: With p = 0.396, we reject \(H_0\) at the \(\alpha = 0.05\) level. There is statistically significant evidence of a linear association between age and BMI.

6.2 The F-Test: Overall Model Significance

The F-test evaluates whether the overall model (i.e., all predictors together) explains a statistically significant portion of the variability in \(Y\). For simple linear regression with one predictor, the F-test is equivalent to the t-test for the slope (\(F = t^2\)).

\[F = \frac{MS_{Regression}}{MS_{Residual}} \sim F_{1,\, n-2} \quad \text{under } H_0\]

  • The Numerator: This represents the amount of variability in the data that your model successfully explains.

  • The Denominator: This represents the “leftover” or unexplained variability (the errors).

When your model does a good job of predicting the outcome, the explained variance goes up, and the unexplained variance goes down. This makes the resulting F-statistic larger. What a higher F-statistic means for your results:

  • Lower p-value: A larger F-statistic pushes the p-value closer to zero.

  • Statistical Significance: If the p-value drops below your alpha level (usually 0.05), you can reject the null hypothesis. It gives you confidence that your overall model actually has some predictive power, rather than just capturing random noise.

f_stat <- summary(model_slr)$fstatistic
f_value <- f_stat[1]
df1 <- f_stat[2]
df2 <- f_stat[3]
p_f <- pf(f_value, df1, df2, lower.tail = FALSE)

tibble(
  Quantity = c("F-statistic", "df (numerator)", "df (denominator)",
               "p-value", "Verification: t²", "Verification: F"),
  Value    = c(
    round(f_value, 3),
    df1,
    df2,
    format.pval(p_f, digits = 3),
    round(slope_test$statistic^2, 3),
    round(f_value, 3)
  )
) %>%
  kable(caption = "F-Test for Overall Model Significance") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
F-Test for Overall Model Significance
Quantity Value
F-statistic 0.721
df (numerator) 1
df (denominator) 2998
p-value 0.396
Verification: t² 0.721
Verification: F 0.721

6.3 Confidence Interval for the Slope

A 95% CI for \(\beta_1\) is:

\[b_1 \pm t_{n-2, \, 0.025} \times SE(b_1)\]

t_crit <- qt(0.975, df = n - 2)
ci_lower <- slope_test$estimate - t_crit * slope_test$std.error
ci_upper <- slope_test$estimate + t_crit * slope_test$std.error

tibble(
  Bound = c("95% CI Lower", "95% CI Upper"),
  Value = c(round(ci_lower, 4), round(ci_upper, 4)),
  Units = c("kg/m² per year", "kg/m² per year")
) %>%
  kable(caption = "95% Confidence Interval for β₁ (manually computed)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
95% Confidence Interval for β₁ (manually computed)
Bound Value Units
95% CI Lower -0.0206 kg/m² per year
95% CI Upper 0.0082 kg/m² per year

7. Confidence Intervals and Prediction Intervals

7.1 Estimating the Mean Response (Confidence Interval)

A confidence interval for the mean response \(E(Y \mid X = x^*)\) gives a range of plausible values for the population mean of \(Y\) at a specific value \(x^*\):

\[\hat{Y}^* \pm t_{n-2, \, \alpha/2} \times SE(\hat{Y}^*)\]

Where:

\[SE(\hat{Y}^*) = \hat{\sigma}\sqrt{\frac{1}{n} + \frac{(x^* - \bar{X})^2}{S_{XX}}}\]

7.2 Predicting a Single Observation (Prediction Interval)

A prediction interval gives a range for a single new observation \(Y^*_{new}\) at \(X = x^*\). It is always wider than the confidence interval because it accounts for both the uncertainty in \(E(Y)\) and the individual variability around the mean:

\[\hat{Y}^* \pm t_{n-2, \, \alpha/2} \times SE_{pred}\]

Where:

\[SE_{pred} = \hat{\sigma}\sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{X})^2}{S_{XX}}}\]

# Compute CI and PI at specific age values
new_ages <- data.frame(age = c(25, 35, 45, 55, 65, 75))

ci_pred <- predict(model_slr, newdata = new_ages, interval = "confidence") %>%
  as.data.frame() %>%
  rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)

pi_pred <- predict(model_slr, newdata = new_ages, interval = "prediction") %>%
  as.data.frame() %>%
  rename(PI_Lower = lwr, PI_Upper = upr) %>%
  select(-fit)

results_table <- bind_cols(new_ages, ci_pred, pi_pred) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

results_table %>%
  kable(
    caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by Age",
    col.names = c("Age", "Fitted BMI", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))
Fitted Values, 95% Confidence Intervals, and Prediction Intervals by Age
95% CI for Mean
95% PI for Individual
Age Fitted BMI CI Lower CI Upper PI Lower PI Upper
25 29.37 28.87 29.88 15.61 43.13
35 29.31 28.92 29.70 15.56 43.06
45 29.25 28.95 29.54 15.50 43.00
55 29.19 28.93 29.44 15.43 42.94
65 29.12 28.84 29.41 15.37 42.87
75 29.06 28.68 29.44 15.31 42.81
# Generate CI and PI across the full age range
age_grid <- data.frame(age = seq(18, 80, length.out = 200))

ci_band <- predict(model_slr, newdata = age_grid, interval = "confidence") %>%
  as.data.frame() %>%
  bind_cols(age_grid)

pi_band <- predict(model_slr, newdata = age_grid, interval = "prediction") %>%
  as.data.frame() %>%
  bind_cols(age_grid)

p_ci_pi <- ggplot() +
  geom_point(data = brfss_slr, aes(x = age, y = bmi),
             alpha = 0.10, color = "steelblue", size = 1) +
  geom_ribbon(data = pi_band, aes(x = age, ymin = lwr, ymax = upr),
              fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = ci_band, aes(x = age, ymin = lwr, ymax = upr),
              fill = "steelblue", alpha = 0.4) +
  geom_line(data = ci_band, aes(x = age, y = fit),
            color = "red", linewidth = 1.2) +
  labs(
    title = "Simple Linear Regression: BMI ~ Age",
    subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
    x = "Age (years)",
    y = "BMI (kg/m²)",
    caption = "BRFSS 2020, n = 3,000"
  ) +
  theme_minimal(base_size = 13)

p_ci_pi
Regression Line with 95% Confidence and Prediction Intervals

Regression Line with 95% Confidence and Prediction Intervals

Key distinction: If you want to estimate the average BMI for all 45-year-olds in the population, use the confidence interval. If you want to predict the BMI of a specific new 45-year-old patient, use the prediction interval.


8. Checking Model Assumptions (Diagnostics)

Fitting a regression model is not enough — we must verify that the LINE assumptions are reasonably met. We do this through residual diagnostics.

8.1 The Four Standard Diagnostic Plots

par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)
Standard Regression Diagnostic Plots

Standard Regression Diagnostic Plots

par(mfrow = c(1, 1))

Interpreting each plot:

1. Residuals vs. Fitted: Checks linearity and equal variance. We want a horizontal red line and random scatter with no pattern. A “fan shape” (spread increasing with fitted values) indicates heteroscedasticity.

2. Normal Q-Q Plot: Checks normality of residuals. Points should fall approximately along the 45° reference line. Heavy tails or S-curves suggest non-normality.

3. Scale-Location (Spread-Location): Another check for equal variance (homoscedasticity). The square root of standardized residuals is plotted against fitted values. A flat line indicates constant variance.

4. Residuals vs. Leverage: Identifies influential observations using Cook’s distance. Points in the upper or lower right corner (beyond the dashed lines) have high influence.

8.2 Residuals vs. Predictor

The Residuals vs. Predictor plot is a visual diagnostic tool used to check if your data meets the core assumptions of Simple Linear Regression—specifically Linearity and Equal Variance (Homoscedasticity).

p_resid_x <- ggplot(augmented, aes(x = age, y = .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
  labs(
    title = "Residuals vs. Age",
    subtitle = "Should show no pattern — random scatter around zero",
    x = "Age (years)",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

p_resid_x
Residuals vs. Age — Checking Linearity

Residuals vs. Age — Checking Linearity

  • Y-axis (Residuals): The errors from your model. This is how far off each person’s actual BMI was from the model’s predicted BMI.

  • X-axis (Age): Your predictor variable.

    (Note: In Simple Linear Regression with only one predictor, this plot looks identical in shape to the “Residuals vs. Fitted” plot).

What do the lines mean?

  • The Red Line: This is a flat, horizontal line at exactly zero. If the model predicted a person’s BMI perfectly, their point would land exactly on this line.

  • The Orange Line (LOESS Smoother): This is a moving average of the residuals. It helps your eye track the overall trend of the errors across different ages.

Interpretation:

The orange line stays fairly close to the red line, though there is a very slight curve downwards at the extreme ends of the age range. The vertical spread of the points looks relatively consistent across the ages. Overall, it doesn’t show any severe violations of linearity or equal variance, though there is a lot of random scatter (which aligns with the very low \(R^2\) value we saw earlier).

8.3 Histogram and Q-Q Plot of Residuals

p_hist <- ggplot(augmented, aes(x = .resid)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "steelblue", color = "white", alpha = 0.8) +
  geom_density(color = "red", linewidth = 1) +
  stat_function(fun = dnorm,
                args = list(mean = mean(augmented$.resid),
                            sd = sd(augmented$.resid)),
                color = "black", linetype = "dashed", linewidth = 1) +
  labs(
    title = "Distribution of Residuals",
    subtitle = "Red = kernel density | Black dashed = normal distribution",
    x = "Residuals",
    y = "Density"
  ) +
  theme_minimal(base_size = 13)

p_hist
Distribution of Residuals

Distribution of Residuals

# ggplot version of QQ plot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
  stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Points should lie on the red line if residuals are normally distributed",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 13)

p_qq
Normal Q-Q Plot of Residuals

Normal Q-Q Plot of Residuals

8.4 Identifying Influential Observations

# Cook's distance
augmented <- augmented %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(model_slr),
    influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
  )

n_influential <- sum(augmented$cooks_d > 4 / n)

p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 4 / n, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = 4/n threshold | ",
                      n_influential, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks
Cook's Distance: Identifying Influential Observations

Cook’s Distance: Identifying Influential Observations


9. A Second Example: Sleep and BMI

To reinforce the concepts, let’s fit a second SLR model examining the association between hours of sleep and BMI.

p_sleep <- ggplot(brfss_slr, aes(x = sleep_hrs, y = bmi)) +
  geom_jitter(alpha = 0.15, color = "purple", width = 0.15, height = 0) +
  geom_smooth(method = "lm", color = "darkred", linewidth = 1.2, se = TRUE) +
  labs(
    title = "BMI vs. Nightly Sleep Hours (BRFSS 2020)",
    x = "Average Hours of Sleep per Night",
    y = "BMI (kg/m²)"
  ) +
  theme_minimal(base_size = 13)

p_sleep
BMI vs. Sleep Hours

BMI vs. Sleep Hours

model_sleep <- lm(bmi ~ sleep_hrs, data = brfss_slr)

tidy(model_sleep, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "SLR: BMI ~ Hours of Sleep per Night",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
SLR: BMI ~ Hours of Sleep per Night
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 30.7419 0.534 57.5683 0.0000 29.6948 31.7890
sleep_hrs -0.2256 0.075 -3.0087 0.0026 -0.3726 -0.0786
b1_sleep <- coef(model_sleep)["sleep_hrs"]
r2_sleep <- summary(model_sleep)$r.squared

tibble(
  Metric = c("Slope (b₁)", "R²"),
  Value  = c(round(b1_sleep, 4), round(r2_sleep, 4))
) %>%
  kable(caption = "Sleep Model Key Statistics") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Sleep Model Key Statistics
Metric Value
Slope (b₁) -0.2256
0.0030

Interpretation: Each additional hour of sleep per night is associated with a change of -0.2256 kg/m² in BMI, on average. The direction of this association is negative (more sleep → lower BMI). The model explains 0.3% of variability in BMI. While statistically significant, the effect size is modest, underscoring the multifactorial nature of BMI.

par(mfrow = c(2, 2))
plot(model_sleep, which = 1:4,
     col = adjustcolor("purple", 0.4), pch = 19, cex = 0.6)

par(mfrow = c(1, 1))

10. Does BMI Really Decrease with Age? Adding a Quadratic Term

Our linear model estimated a negative slope for age: older adults have, on average, slightly lower BMI. But is that the full story? Cross-sectional data can show a decline at older ages due to survivorship bias — people with very high BMI may die before reaching old age, leaving a healthier-looking older sample. There may also be a genuine nonlinear pattern (BMI rises through middle age, then declines in later life).

We can test this by including an age² term in the model:

\[\widehat{\text{BMI}} = b_0 + b_1 \cdot \text{Age} + b_2 \cdot \text{Age}^2\]

This is still a linear regression model (linear in the coefficients), even though it is nonlinear in the predictor. It allows the slope to change across the range of age.

# Add age-squared term
brfss_slr <- brfss_slr %>%
  mutate(age2 = age^2)

# Fit quadratic model
model_quad <- lm(bmi ~ age + age2, data = brfss_slr)

tidy(model_quad, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 5))) %>%
  kable(
    caption = "Quadratic Model: BMI ~ Age + Age²",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Quadratic Model: BMI ~ Age + Age²
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 18.54178 1.08095 17.15329 0 16.42230 20.66125
age 0.47435 0.04418 10.73772 0 0.38773 0.56096
age2 -0.00464 0.00042 -11.02651 0 -0.00546 -0.00381
# Compare linear vs. quadratic model
tibble(
  Model       = c("Linear: BMI ~ Age", "Quadratic: BMI ~ Age + Age²"),
  R_squared   = c(
    round(summary(model_slr)$r.squared, 4),
    round(summary(model_quad)$r.squared, 4)
  ),
  Adj_R2      = c(
    round(summary(model_slr)$adj.r.squared, 4),
    round(summary(model_quad)$adj.r.squared, 4)
  ),
  AIC         = c(round(AIC(model_slr), 1), round(AIC(model_quad), 1))
) %>%
  kable(
    caption = "Model Comparison: Linear vs. Quadratic",
    col.names = c("Model", "R²", "Adj. R²", "AIC")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which.min(c(AIC(model_slr), AIC(model_quad))),
           bold = TRUE, background = "#d4edda")
Model Comparison: Linear vs. Quadratic
Model Adj. R² AIC
Linear: BMI ~ Age 0.0002 -0.0001 20203.2
Quadratic: BMI ~ Age + Age² 0.0392 0.0386 20085.9
# Generate predicted values from both models
age_seq <- data.frame(age = seq(18, 80, length.out = 300)) %>%
  mutate(age2 = age^2)

pred_linear <- predict(model_slr, newdata = age_seq)
pred_quad   <- predict(model_quad, newdata = age_seq)

pred_df <- age_seq %>%
  mutate(
    linear    = pred_linear,
    quadratic = pred_quad
  ) %>%
  pivot_longer(cols = c(linear, quadratic),
               names_to = "Model", values_to = "Predicted_BMI")

ggplot() +
  geom_point(data = brfss_slr, aes(x = age, y = bmi),
             alpha = 0.10, color = "steelblue", size = 1) +
  geom_line(data = pred_df, aes(x = age, y = Predicted_BMI, color = Model),
            linewidth = 1.3) +
  scale_color_manual(
    values = c("linear" = "red", "quadratic" = "darkorange"),
    labels = c("linear" = "Linear fit", "quadratic" = "Quadratic fit (Age + Age²)")
  ) +
  labs(
    title = "BMI vs. Age: Linear vs. Quadratic Model",
    subtitle = "Does BMI rise then fall with age, or decline monotonically?",
    x = "Age (years)",
    y = "BMI (kg/m²)",
    color = "Model"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")
Linear vs. Quadratic Fit: BMI ~ Age

Linear vs. Quadratic Fit: BMI ~ Age

Interpretation: If the coefficient on Age² is negative and statistically significant, the fitted curve is an inverted-U — BMI peaks at some middle age and declines thereafter. Extract the peak using \(\text{Age}^* = -b_1 / (2 b_2)\). A positive Age² coefficient would indicate a U-shape (BMI lowest in middle age).

b1_q <- coef(model_quad)["age"]
b2_q <- coef(model_quad)["age2"]

peak_age <- -b1_q / (2 * b2_q)

tibble(
  Quantity = c("b₁ (Age)", "b₂ (Age²)", "Peak / Trough Age (-b₁ / 2b₂)"),
  Value    = c(round(b1_q, 5), round(b2_q, 6), round(peak_age, 1))
) %>%
  kable(caption = "Quadratic Model Coefficients and Implied Turning Point") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Quadratic Model Coefficients and Implied Turning Point
Quantity Value
b₁ (Age) 0.474350
b₂ (Age²) -0.004635
Peak / Trough Age (-b₁ / 2b₂) 51.200000

Caution on interpretation: Even if the quadratic model fits better statistically, be cautious about causal interpretation. The cross-sectional pattern reflects cohort differences in BMI trajectories, not necessarily the aging process within any individual. Survivorship bias (heavier individuals dying earlier) can make the quadratic term appear significant in cross-sectional data.


11. Summary of Key Formulas

Quantity Formula
Slope \(b_1 = S_{XY} / S_{XX}\)
Intercept \(b_0 = \bar{Y} - b_1 \bar{X}\)
SSTotal \(\sum(Y_i - \bar{Y})^2\)
SSRegression \(\sum(\hat{Y}_i - \bar{Y})^2\)
SSResidual \(\sum(Y_i - \hat{Y}_i)^2\)
MSE \(SS_{Residual} / (n-2)\)
\(R^2\) \(SS_{Reg} / SS_{Total}\)
\(SE(b_1)\) \(\hat{\sigma}/\sqrt{S_{XX}}\)
t-statistic \(b_1 / SE(b_1)\)
95% CI for \(\beta_1\) \(b_1 \pm t_{n-2, 0.025} \cdot SE(b_1)\)

Part 2: Lab Activity

Overview

In this lab, you will apply Simple Linear Regression to the BRFSS 2020 dataset using a different outcome variable: number of days of poor physical health in the past 30 days (phys_days). You will model it as a continuous outcome predicted by BMI.

Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?

Setup Instructions

Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.

# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)

# Load raw BRFSS 2020 data
brfss_full <- read_xpt(
  "/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_2020"
) %>%
  janitor::clean_names()

# Select variables of interest
brfss_lab <- brfss_full %>%
  select(bmi5, age80, sex, physhlth, sleptim1)

# Recode variables
brfss_lab <- brfss_lab %>%
  mutate(
    bmi       = bmi5 / 100,
    age       = age80,
    sex       = factor(ifelse(sex == 1, "Male", "Female")),
    phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
    sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
  )

# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_lab <- brfss_lab %>%
  select(bmi, age, sex, phys_days, sleep_hrs) %>%
  filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
  drop_na() %>%
  slice_sample(n = 3000)

# Save analytic dataset
saveRDS(brfss_lab, here::here("/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_lab_2020.rds"
))

Task 1: Explore the Variables (15 points)

# Load raw BRFSS 2020 data
brfss_lab <- read_rds(
  "/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_lab_2020.rds"
)

# (a) Create a summary table of phys_days and bmi
brfss_lab %>%  
  select (bmi, phys_days) %>% 
  summary() %>%  
  kable(caption = "Descriptive Statistics: BMI and Number of days of poor physical health in the past 30 days") %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Descriptive Statistics: BMI and Number of days of poor physical health in the past 30 days
bmi phys_days
Min. :14.91 Min. : 1.00
1st Qu.:24.41 1st Qu.: 2.00
Median :28.33 Median : 7.00
Mean :29.47 Mean :12.03
3rd Qu.:33.47 3rd Qu.:21.00
Max. :58.24 Max. :30.00
brfss_lab %>%
  select (bmi, phys_days) %>%
  summarise(
    n = n(),
    mean_bmi = mean(bmi, na.rm = TRUE),
    sd_bmi = sd(bmi, na.rm = TRUE),
    mean_phys_days = mean(phys_days, na.rm = TRUE),
    sd_phys_days = sd(phys_days, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
    kable(caption = "Mean and Standard Deviation: BMI and Number of days of poor physical health in the past 30 days") %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Mean and Standard Deviation: BMI and Number of days of poor physical health in the past 30 days
n mean_bmi sd_bmi mean_phys_days sd_phys_days
3000 29.46902 6.869134 12.02967 11.20904
# (b) Create a histogram of phys_days — describe the distribution
phys_days_histogram <- ggplot(brfss_lab, aes(x = phys_days)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "forestgreen", color = "white", alpha = 0.8) +
  geom_density(color = "red", linewidth = 1) +
  stat_function(fun = dnorm,
                args = list(mean = mean(brfss_lab$phys_days),
                            sd = sd(brfss_lab$phys_days)),
                color = "black", linetype = "dashed", linewidth = 1) +
  labs(
    title = "Histogram of Number of days of poor physical health in the past 30 days",
    subtitle = "Red = kernel density | Black dashed = normal distribution",
    x = "Number of days of poor physical health in the past 30 days",
    y = "Density"
  ) +
  theme_minimal(base_size = 13)

phys_days_histogram

# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
scatterplot_phy <- ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "Number of days of poor physical health in the past 30 days vs BMI (BRFSS 2020)",
    subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
    x = "Number of days of poor physical health in the past 30 days",
    y = "BMI (kg/m²)"
  ) +
  theme_minimal(base_size = 13)

ggplotly(scatterplot_phy)

Questions:

  1. What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days?

The mean of phys_days is 12.02967 and the standard deviation is 11.20904. The mean of BMI is 29.46902 and the standard deviation is 6.869134. The distribution of phys_days is not normal or symmetric. The data appears to be right skewed because of a long tail of values to the right.

  1. Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers?

The LOESS smoother (orange) tracks the linear fit (red) suggesting a linear relationship between BMI and poor physical health days. It does appear that there may be possible ouliers such as the dots near 60 on the BMI scale. —

Task 2: Fit and Interpret the SLR Model (20 points)

# (a) Fit the SLR model: phys_days ~ bmi
model_slr_lab <- lm(phys_days ~ bmi, data = brfss_lab)
summary(model_slr_lab)
## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.412  -9.294  -5.379   9.803  20.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.51927    0.89781   8.375  < 2e-16 ***
## bmi          0.15306    0.02967   5.158 2.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.16 on 2998 degrees of freedom
## Multiple R-squared:  0.008798,   Adjusted R-squared:  0.008467 
## F-statistic: 26.61 on 1 and 2998 DF,  p-value: 2.652e-07
# (b and c) Display a tidy coefficient table with 95% CIs and slope, intercept, t-statistic, p-value
tidy(model_slr_lab, conf.int = TRUE) %>% 
  mutate(across(where(is.numeric), ~ round(., 4))) %>% 
  kable(
    caption = "Simple Linear Regression: Phys_days ~ BMI (BRFSS 2020)",
    col.names = c ("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% 
  row_spec(0, bold = TRUE)
Simple Linear Regression: Phys_days ~ BMI (BRFSS 2020)
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 7.5193 0.8978 8.3751 0 5.7589 9.2796
bmi 0.1531 0.0297 5.1584 0 0.0949 0.2112
# Extract slope test statistics
slope_test_lab <- tidy(model_slr_lab, conf.int = TRUE) %>% filter(term == "bmi")

tibble(
  Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
               "Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
  Value    = c(
    round(slope_test_lab$estimate, 4),
    round(slope_test_lab$std.error, 4),
    round(slope_test_lab$statistic, 3),
    n - 2,
    format.pval(slope_test_lab$p.value, digits = 3),
    round(slope_test_lab$conf.low, 4),
    round(slope_test_lab$conf.high, 4)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
t-Test for the Slope (H₀: β₁ = 0)
Quantity Value
Slope (b₁) 0.1531
SE(b₁) 0.0297
t-statistic 5.158
Degrees of freedom 2998
p-value 2.65e-07
95% CI Lower 0.0949
95% CI Upper 0.2112

Questions:

  1. Write the fitted regression equation in the form \(\widehat{Physdays} = 7.5193 + 0.1531 \times\ BMI\).

  2. Interpret the slope (\(b_1\)) in context — what does it mean in plain English?

For each 1 unit increase in BMI, number of days of poor physical health is estimated to increase by 0.1531, on average, holding all else constant(though there are no other variables in this simple model).

  1. Is the intercept (\(b_0\)) interpretable in this context? Why or why not?

The current intercept would suggest that the estimated number of days of poor physical health when BMI = 0 is 7.5193. However, according to our values and the scale, BMI can not be 0, so I would suggest that the intercept is usable for our regression line but not an interpretation.

  1. Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value.

Null Hypothesis (Ho): No association between BMI and Phy_days Alternative Hypothesis (H1): There is an association between BMI and Phys_days The association is statistically significant because the p-value is less than 0.001. The test statistic is 5.158. —

Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table
anova_slr_lab <- anova(model_slr_lab)

anova_slr_lab %>% 
  kable(
    caption = "ANOVA Table: Phys_days ~ BMI", 
    digits = 3, 
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
ANOVA Table: Phys_days ~ BMI
Source Df Sum Sq Mean Sq F value Pr(>F)
bmi 1 3314.969 3314.969 26.609 0
Residuals 2998 373487.391 124.579 NA NA
# (b) Compute and report SSTotal, SSRegression, and SSResidual

ss_total_var <- var(brfss_lab$bmi) * (nrow(brfss_lab) - 1)
print(ss_total_var)
## [1] 141507.8
ss_regression <- anova_slr_lab$"Sum Sq"[1]
print(ss_regression)
## [1] 3314.969
ss_residual <- sum(residuals(model_slr_lab)^2)
print(paste("SS Residual:", ss_residual))
## [1] "SS Residual: 373487.390574057"
results_ss <- data.frame(
  Metric = c("SS Regression", "SS Residual", "SS Total"),
  Value = c(ss_regression, ss_residual, ss_total_var)
)

knitr::kable(results_ss, digits = 2, caption = "Summary of Sum of Squares")
Summary of Sum of Squares
Metric Value
SS Regression 3314.97
SS Residual 373487.39
SS Total 141507.80
# Augment dataset with fitted values and residuals 
augment_lab <- augment(model_slr_lab)

n1 <- nrow(brfss_lab)
ss_residu <- sum(augment_lab$.resid^2)
mse_lab <- ss_residu / (n - 2)
sigma_hat_lab <- sqrt(mse_lab)

tibble(
  Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
  Value = c(round(ss_residu, 2), round(mse_lab, 3), round(sigma_hat_lab, 3)),
  Units = c("", "", "kg/m²")
) %>% 
  kable(caption = "Model Error Estimates") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Model Error Estimates
Quantity Value Units
SS Residual 373487.390
MSE (σ̂²)
   124.57
# (c) Compute R² two ways: from the model object and from the SS decomposition

# Extract R-squared from model
r_sq <- summary(model_slr_lab)$r.squared
adj_r_sq <- summary(model_slr_lab)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
R² and Adjusted R²
Metric Value
0.0088
Adjusted R² 0.0085
Variance Explained 0.88%
r_pearson <- cor(brfss_lab$bmi, brfss_lab$phys_days)
tibble(
  Qunatity = c("Pearson r", "r² (from Pearson)", "R²(from model)", "r² = R²?"),
  Value = c(
      round(r_pearson, 4),
      round(r_pearson^2, 4),
      round(r_sq,4),
      as.character(round(r_pearson^2, 4) == round(r_sq, 4))
  )
) %>% 
  kable(caption = "Pearson r vs. R² from Model") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Pearson r vs. R² from Model
Qunatity Value
Pearson r 0.0938
r² (from Pearson) 0.0088
R²(from model) 0.0088
r² = R²? TRUE

Questions:

  1. Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic.

  2. What is the \(R^2\) value? Interpret it in plain English.

The \(R^2\) value is 0.0088 and could be interpreted that 0.88% of the variation in poor physical health days is explained by BMI.

  1. What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation?

BMI alone does not explain the variation in poor physical mental health days alone seeing as it can only explain less than 1%. Other factors could be gender, income, education level, employment, insurance, age, race, ethnicity, physical activity, or smoking usage. It is a variable that has countless different components that could affect it. —

Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
BMI_25 <- data.frame(bmi = c(20, 25, 30))

BMI_25_ci_pred <- predict(model_slr_lab, newdata = BMI_25, interval = "confidence") %>% 
  as.data.frame() %>% 
  rename(Fitted = fit, CI_Lower = lwr, CI_upper = upr)

# (b) Calculate the 95% prediction interval for a person with BMI = 25

pi_pred_bmi <- predict(model_slr_lab, newdata = BMI_25, interval = "prediction") %>%
  as.data.frame() %>%
  rename(PI_Lower = lwr, PI_Upper = upr) %>%
  select(-fit)
  
results_table_lab <- bind_cols(BMI_25, BMI_25_ci_pred, pi_pred_bmi) %>% 
  mutate(across(where(is.numeric), ~ round(., 2)))

results_table_lab %>% 
  kable(
    caption = "Fitted Values, 95% Confidence Intervals, and Prediction Interval for BMI of 20, 25, 30", 
    col.names = c("BMI", "Fitted Phys_days", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
  ) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% 
  add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))
Fitted Values, 95% Confidence Intervals, and Prediction Interval for BMI of 20, 25, 30
95% CI for Mean
95% PI for Individual
BMI Fitted Phys_days CI Lower CI Upper PI Lower PI Upper
20 10.58 9.90 11.26 -11.32 32.48
25 11.35 10.87 11.82 -10.54 33.24
30 12.11 11.71 12.51 -9.78 34.00
# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(length.out = 200))

ci_band_lab <- predict(model_slr_lab, newdata = bmi_grid, interval = "confidence") %>% 
  as.data.frame() %>% 
  bind_cols(bmi_grid)

pi_band_lab <- predict(model_slr_lab, newdata = bmi_grid, interval = "prediction") %>% 
  as.data.frame() %>% 
  bind_cols(bmi_grid)

p_ci_pi_lab <- ggplot() +
  geom_point(data = brfss_lab, aes (x = bmi, y = phys_days),
             alpha = 0.10, color = "forestgreen", size = 1) + 
  geom_ribbon(data = pi_band_lab, aes(x = bmi, ymin = lwr, ymax = upr), fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = ci_band_lab, aes(x = bmi, ymin = lwr, ymax = upr), fill = "steelblue", alpha = 0.4) + 
  geom_line(data = ci_band_lab, aes(x = bmi, y = fit), color = "red", linewidth = 1.2) +
  geom_bin2d(bins = 5) +
  coord_cartesian(xlim = c(0, 60), ylim = c(0, 35)) +
  labs(
    title = "Simple Linear Regression: Bad Physical Health Days ~ BMI", 
    subtitle = "Dark band = 95% CI for mean response | Light Band = 95% PI for individual observation",
    x = "BMI (kg/m²)",
    y = "Number of Days or Poor Physical Health", 
    caption = "BRFSS 2020, n = 3,000"
  ) + 
  theme_minimal(base_size = 13)

p_ci_pi_lab

Questions:

  1. For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean?

The estimated mean number of physical health days is 11.35. The 95% confidence interval is [10.87, 11.82].

  1. If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days?

The 95% prediction interval is [10.54, 33.24].

  1. Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice?

The prediction interval is wider than the confidence interval because it has many more considerations regarding uncertainty since it is dealing with where an individual observation may lie. Confidence intervals focus on the mean. We would use prediction intervals when we are curious as to what individual values whereas confidence intervals are used to discover averages. —

Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())

par(mfrow = c(2, 2))
plot(model_slr_lab, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)

par(mfrow = c(1, 1))

# (b) Create a residuals vs. fitted plot using ggplot
p_resid_lab <- ggplot(augment_lab, aes(x = bmi, y = .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
  labs(
    title = "Residuals vs. BMI",
    subtitle = "Should show no pattern — random scatter around zero",
    x = "BMI kg/m²",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

p_resid_lab

# (c) Create a normal Q-Q plot of residuals using ggplot
p_qq_lab <- ggplot(augment_lab, aes(sample = .resid)) +
  stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Points should lie on the red line if residuals are normally distributed",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 13)

p_qq_lab

# (d) Create a Cook's distance plot
augment_lab2 <- augment_lab %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(model_slr_lab),
    influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
  )

n_influential <- sum(augment_lab2$cooks_d > 4 / n)

p_cooks_lab <- ggplot(augment_lab2, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 4 / n, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = 4/n threshold | ",
                      n_influential, "potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks_lab

Questions:

  1. Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see.

There does not appear to be non-linearity or heteroscedasticity because the plot has a rather straight line around zero without a fan or parabola shape appearing. The dots appear to be random.

  1. Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days?

The residuals do not appear to be approximately normal because the plot shows an S shape. This suggests that the distribution of phys_days has has heavy tails.

  1. Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them?

It does look like there are influential observations. There appears to be about 50 of them, and I would either replace the outliers with missing values, use imputation, or cap the outliers at 1.5 IQR limits so that they would not have as much of an effect.

  1. Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.)

I think the line assumptions struggle to be met. Linearity is met because the residuals and fitted appears to be clustered around 0. However normality and equal variance will be problematic because of how highly skewed the data suggest to be. Also independence may be questionable because BRFSS surveys may not be fully independent. —

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age
model_slr_phys <- lm(phys_days ~ age, data = brfss_lab)

# (b) Display results and compare to the BMI model
summary(model_slr_phys)
## 
## Call:
## lm(formula = phys_days ~ age, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.354  -8.911  -4.432  10.992  22.910 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.33587    0.68259   6.352 2.45e-10 ***
## age          0.13773    0.01168  11.789  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 2998 degrees of freedom
## Multiple R-squared:  0.04431,    Adjusted R-squared:  0.04399 
## F-statistic:   139 on 1 and 2998 DF,  p-value: < 2.2e-16
# Tidy coefficient table
tidy(model_slr_phys, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: Phys_days ~ BMI (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)
Simple Linear Regression: Phys_days ~ BMI (BRFSS 2020)
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 4.3359 0.6826 6.3521 0 2.9975 5.6743
age 0.1377 0.0117 11.7893 0 0.1148 0.1606
# (c) Which predictor has the stronger association? Compare R² values.
r_sq_extra <- summary(model_slr_phys)$r.squared
adj_r_sq_extra <- summary(model_slr_phys)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq_extra, 4),
    round(adj_r_sq_extra, 4),
    paste0(round(r_sq_extra * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R² for Phys_days and Age") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
R² and Adjusted R² for Phys_days and Age
Metric Value
0.0443
Adjusted R² 0.044
Variance Explained 4.43%
r_sq_lab <- summary(model_slr_lab)$r.squared
adj_r_sq_lab <- summary(model_slr_lab)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq_lab, 4),
    round(adj_r_sq_lab, 4),
    paste0(round(r_sq_lab * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R² for Phys_days and BMI") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
R² and Adjusted R² for Phys_days and BMI
Metric Value
0.0088
Adjusted R² 0.0085
Variance Explained 0.88%

Questions:

  1. How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance?

Both of the associations are statistically significant (p < 0.001) and positive in direction. The relationship between age and poor physical health days (estimate = 0.1377) has a smaller magnitude than the BMI association with poor physical health days (estimate = 0.1531).

  1. Compare the \(R^2\) values of the two models. Which predictor explains more variability in phys_days?

Age appears to explain more variability in phys_days with 4.43% variance explained while BMI only explains 0.88%.

  1. Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome?

My overall conclusion about predictors of poor physical health days is that each predictor can only explain a small amount of variability because there are so many different integrated factors that influence poor physical health day. From these models, age appears to be able to explain the variation more than BMI can. Limitations of using simple linear regression for poor physical health days is that there appear to be a few outliers that could be influencing the relationship and it may not be able to fully understand the complex nature of poor physical health. —

Submission Instructions

Submit your completed .Rmd file and the RPubs link to your knitted HTML document.

Your .Rmd must knit without errors. Make sure all code chunks produce visible output and all questions are answered in complete sentences below each code chunk.

Due: Before the next class session.


Session Info

sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.7.3
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gtsummary_2.5.0  ggeffects_2.3.2  broom_1.0.11     plotly_4.12.0   
##  [5] kableExtra_1.4.0 knitr_1.51       here_1.0.2       haven_2.5.5     
##  [9] lubridate_1.9.4  forcats_1.0.1    stringr_1.6.0    dplyr_1.2.0     
## [13] purrr_1.2.1      readr_2.1.6      tidyr_1.3.2      tibble_3.3.1    
## [17] ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.9.0        htmlwidgets_1.6.4 
##  [5] insight_1.4.6      lattice_0.22-7     tzdb_0.5.0         crosstalk_1.2.2   
##  [9] vctrs_0.7.1        tools_4.5.2        generics_0.1.4     pkgconfig_2.0.3   
## [13] Matrix_1.7-4       data.table_1.18.0  RColorBrewer_1.1-3 S7_0.2.1          
## [17] gt_1.3.0           lifecycle_1.0.5    compiler_4.5.2     farver_2.1.2      
## [21] textshaping_1.0.4  janitor_2.2.1      snakecase_0.11.1   litedown_0.9      
## [25] htmltools_0.5.9    sass_0.4.10        yaml_2.3.12        lazyeval_0.2.2    
## [29] pillar_1.11.1      jquerylib_0.1.4    cachem_1.1.0       nlme_3.1-168      
## [33] commonmark_2.0.0   tidyselect_1.2.1   digest_0.6.39      stringi_1.8.7     
## [37] labeling_0.4.3     splines_4.5.2      rprojroot_2.1.1    fastmap_1.2.0     
## [41] grid_4.5.2         cli_3.6.5          magrittr_2.0.4     cards_0.7.1       
## [45] withr_3.0.2        scales_1.4.0       backports_1.5.0    timechange_0.3.0  
## [49] rmarkdown_2.30     httr_1.4.7         otel_0.2.0         hms_1.1.4         
## [53] evaluate_1.0.5     viridisLite_0.4.2  mgcv_1.9-3         markdown_2.0      
## [57] rlang_1.1.7        glue_1.8.0         xml2_1.5.2         svglite_2.2.2     
## [61] rstudioapi_0.18.0  jsonlite_2.0.0     R6_2.6.1           systemfonts_1.3.1 
## [65] fs_1.6.6