Simple Linear Regression (SLR) is one of the most fundamental and widely used tools in epidemiology and public health research. It allows us to:
In epidemiology, we frequently use SLR to model continuous outcomes such as blood pressure, BMI, cholesterol levels, or hospital length of stay as a function of age, exposure levels, or other continuous predictors.
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)We will use the Behavioral Risk Factor Surveillance System (BRFSS) 2020 data throughout this lecture. The BRFSS is a large-scale, nationally representative telephone survey conducted by the CDC that collects data on health behaviors, chronic conditions, and preventive service use among U.S. adults.
brfss_slr_2020 %>%
select(bmi, age, sleep_hrs, phys_days) %>%
summary() %>%
kable(caption = "Descriptive Statistics: Key Continuous Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| bmi | age | sleep_hrs | phys_days | |
|---|---|---|---|---|
| Min. :14.63 | Min. :18.00 | Min. : 1.000 | Min. : 1.00 | |
| 1st Qu.:24.32 | 1st Qu.:43.00 | 1st Qu.: 6.000 | 1st Qu.: 2.00 | |
| Median :27.89 | Median :58.00 | Median : 7.000 | Median : 6.00 | |
| Mean :29.18 | Mean :55.52 | Mean : 6.915 | Mean :11.66 | |
| 3rd Qu.:32.89 | 3rd Qu.:70.00 | 3rd Qu.: 8.000 | 3rd Qu.:20.00 | |
| Max. :59.60 | Max. :80.00 | Max. :20.000 | Max. :30.00 |
brfss_slr_2020 %>%
select(bmi, age, sleep_hrs, sex, education) %>%
tbl_summary(
label = list(
bmi ~ "BMI (kg/m²)",
age ~ "Age (years)",
sleep_hrs ~ "Sleep (hours/night)",
sex ~ "Sex",
education ~ "Education"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**")| Characteristic | N | N = 3,0001 |
|---|---|---|
| BMI (kg/m²) | 3,000 | 29.2 (7.0) |
| Age (years) | 3,000 | 55.5 (17.4) |
| Sleep (hours/night) | 3,000 | 6.9 (1.7) |
| Sex | 3,000 | |
| Female | 1,701 (57%) | |
| Male | 1,299 (43%) | |
| Education | 3,000 | |
| < High school | 237 (7.9%) | |
| High school graduate | 796 (27%) | |
| Some college | 937 (31%) | |
| College graduate | 1,030 (34%) | |
| 1 Mean (SD); n (%) | ||
Simple linear regression models the mean of a continuous outcome \(Y\) as a linear function of a single predictor \(X\):
\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad i = 1, 2, \ldots, n\]
Where:
| Symbol | Name | Meaning |
|---|---|---|
| \(Y_i\) | Response / Outcome | Observed value for subject \(i\) (e.g., BMI) |
| \(X_i\) | Predictor / Covariate | Observed predictor for subject \(i\) (e.g., age) |
| \(\beta_0\) | Intercept | Expected value of \(Y\) when \(X = 0\) |
| \(\beta_1\) | Slope | Expected change in \(Y\) for a 1-unit increase in \(X\) |
| \(\varepsilon_i\) | Error term | Random deviation of \(Y_i\) from the regression line |
The population regression line (also called the true or theoretical regression line) describes the expected (mean) value of \(Y\) at each value of \(X\):
\[E(Y \mid X) = \mu_{Y|X} = \beta_0 + \beta_1 X\]
| Population | Sample | |
|---|---|---|
| Line | \(\beta_0 + \beta_1 X\) | \(\hat{y} = b_0 + b_1 X\) |
| Intercept | \(\beta_0\) (parameter) | \(b_0\) or \(\hat{\beta}_0\) (estimate) |
| Slope | \(\beta_1\) (parameter) | \(b_1\) or \(\hat{\beta}_1\) (estimate) |
| Error | \(\varepsilon_i\) | \(e_i = Y_i - \hat{Y}_i\) (residual) |
We use our sample to estimate the population parameters. The estimates \(b_0\) and \(b_1\) define the fitted regression line.
Before fitting any model, always visualize the bivariate relationship.
p_scatter <- ggplot(brfss_slr_2020, aes(x = age, y = bmi)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "blue", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "BMI vs. Age (BRFSS 2020)",
subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
x = "Age (years)",
y = "BMI (kg/m²)"
) +
theme_minimal(base_size = 13)
ggplotly(p_scatter)BMI vs. Age — BRFSS 2020
Interpretation tip: The LOESS smoother (orange) follows the data without assuming linearity. When it closely tracks the linear fit (red), a linear model is reasonable. Departures suggest nonlinearity.
A useful mnemonic is LINE:
| Letter | Assumption | Description |
|---|---|---|
| L | Linearity | The relationship between \(X\) and \(E(Y)\) is linear |
| I | Independence | Observations are independent of one another |
| N | Normality | Errors \(\varepsilon_i\) are normally distributed |
| E | Equal variance | Errors have constant variance (homoscedasticity): \(\text{Var}(\varepsilon_i) = \sigma^2\) |
Formally, we assume:
\[\varepsilon_i \overset{iid}{\sim} N(0, \sigma^2)\]
This means that for any value of \(X\), the distribution of \(Y\) is:
\[Y \mid X \sim N(\beta_0 + \beta_1 X, \; \sigma^2)\]
Note on independence: In cross-sectional survey data like BRFSS, observations from the same household or geographic cluster may not be fully independent. We acknowledge this limitation but proceed with the standard SLR framework for pedagogical purposes.
We estimate \(\beta_0\) and \(\beta_1\) by finding the values \(b_0\) and \(b_1\) that minimize the sum of squared residuals (SSR):
\[SSR = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n}(Y_i - b_0 - b_1 X_i)^2\]
This is called the Ordinary Least Squares (OLS) criterion. Minimizing SSR yields the closed-form solutions:
\[b_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} = \frac{S_{XY}}{S_{XX}}\]
\[b_0 = \bar{Y} - b_1 \bar{X}\]
where \(\bar{X}\) and \(\bar{Y}\) are the sample means of \(X\) and \(Y\).
Gauss-Markov Theorem: Under the LINE assumptions, OLS estimators are the Best Linear Unbiased Estimators (BLUE) — they have the smallest variance among all linear unbiased estimators.
# Fit simple linear regression: BMI ~ Age
model_slr <- lm(bmi ~ age, data = brfss_slr_2020)
# Summary output
summary(model_slr)##
## Call:
## lm(formula = bmi ~ age, data = brfss_slr_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.633 -4.883 -1.325 3.688 30.340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.528231 0.427507 69.071 <2e-16 ***
## age -0.006238 0.007347 -0.849 0.396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.012 on 2998 degrees of freedom
## Multiple R-squared: 0.0002404, Adjusted R-squared: -9.312e-05
## F-statistic: 0.7208 on 1 and 2998 DF, p-value: 0.396
# Tidy coefficient table
tidy(model_slr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: BMI ~ Age (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 29.5282 | 0.4275 | 69.0708 | 0.000 | 28.6900 | 30.3665 |
| age | -0.0062 | 0.0073 | -0.8490 | 0.396 | -0.0206 | 0.0082 |
Fitted regression equation:
\[\widehat{\text{BMI}} = 29.528 + -0.0062 \times \text{Age}\]
Intercept (\(b_0 = 29.528\)): The estimated mean BMI when age = 0. This is a mathematical artifact — a newborn does not have an adult BMI. The intercept is not directly interpretable in this context, but is necessary to anchor the line.
Slope (\(b_1 = -0.0062\)): For each 1-year increase in age, BMI is estimated to decrease by 0.0062 kg/m², on average, holding all else constant (though there are no other variables in this simple model).
Practical significance vs. statistical significance: Even a small slope can be highly statistically significant with a large sample. Always consider whether the magnitude is meaningful in the real world.
# Augment dataset with fitted values and residuals
augmented <- augment(model_slr)
# Show a sample of fitted values and residuals
augmented %>%
select(bmi, age, .fitted, .resid) %>%
slice_head(n = 10) %>%
mutate(across(where(is.numeric), ~ round(., 3))) %>%
kable(
caption = "First 10 Observations: Observed, Fitted, and Residual Values",
col.names = c("Observed BMI (Y)", "Age (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Observed BMI (Y) | Age (X) | Fitted (Ŷ) | Residual (e = Y − Ŷ) |
|---|---|---|---|
| 26.58 | 67 | 29.110 | -2.530 |
| 33.47 | 38 | 29.291 | 4.179 |
| 35.15 | 78 | 29.042 | 6.108 |
| 30.42 | 65 | 29.123 | 1.297 |
| 22.67 | 55 | 29.185 | -6.515 |
| 30.11 | 80 | 29.029 | 1.081 |
| 35.43 | 34 | 29.316 | 6.114 |
| 31.58 | 71 | 29.085 | 2.495 |
| 28.13 | 55 | 29.185 | -1.055 |
| 34.01 | 62 | 29.141 | 4.869 |
# Select a random sample of 80 points to illustrate residuals
set.seed(42)
resid_sample <- augmented %>% slice_sample(n = 80)
p_resid <- ggplot(resid_sample, aes(x = age, y = bmi)) +
geom_segment(aes(xend = age, yend = .fitted),
color = "tomato", alpha = 0.5, linewidth = 0.5) +
geom_point(color = "steelblue", size = 1.8, alpha = 0.8) +
geom_line(aes(y = .fitted), color = "black", linewidth = 1.1) +
labs(
title = "Residuals Illustrated on the Regression Line",
subtitle = "Red segments = residuals (Y − Ŷ); Black line = fitted regression line",
x = "Age (years)",
y = "BMI (kg/m²)"
) +
theme_minimal(base_size = 13)
p_residVisualizing Residuals on the Regression Line
The total variability in \(Y\) can be decomposed into two parts:
\[\underbrace{SS_{Total}}_{Total\ variability} = \underbrace{SS_{Regression}}_{Explained\ by\ X} + \underbrace{SS_{Residual}}_{Unexplained}\]
Where:
\[SS_{Total} = \sum(Y_i - \bar{Y})^2 \qquad (df = n-1)\] \[SS_{Regression} = \sum(\hat{Y}_i - \bar{Y})^2 \qquad (df = 1)\] \[SS_{Residual} = \sum(Y_i - \hat{Y}_i)^2 \qquad (df = n-2)\]
# ANOVA decomposition
anova_slr <- anova(model_slr)
anova_slr %>%
kable(
caption = "ANOVA Table: BMI ~ Age",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| age | 1 | 35.438 | 35.438 | 0.721 | 0.396 |
| Residuals | 2998 | 147400.214 | 49.166 | NA | NA |
The Mean Squared Error estimates the variance of the error term:
\[MSE = \frac{SS_{Residual}}{n - 2} = \hat{\sigma}^2\]
The Residual Standard Error \(\hat{\sigma} = \sqrt{MSE}\) is in the same units as \(Y\) and tells us the typical prediction error of the model.
n <- nrow(brfss_slr_2020)
ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)
tibble(
Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
Value = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3)),
Units = c("", "", "kg/m²")
) %>%
kable(caption = "Model Error Estimates") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value | Units |
|---|---|---|
| SS Residual | 147400.210 | |
| MSE (σ̂²) |
49.16
|
Interpretation: On average, our model’s predictions are off by about 7.01 BMI units.
\(R^2\) measures the proportion of total variability in \(Y\) explained by the linear regression on \(X\):
\[R^2 = \frac{SS_{Regression}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}}\]
\(R^2\) ranges from 0 to 1:
# Extract R-squared from model
r_sq <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared
tibble(
Metric = c("R²", "Adjusted R²", "Variance Explained"),
Value = c(
round(r_sq, 4),
round(adj_r_sq, 4),
paste0(round(r_sq * 100, 2), "%")
)
) %>%
kable(caption = "R² and Adjusted R²") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| R² | 2e-04 |
| Adjusted R² | -1e-04 |
| Variance Explained | 0.02% |
For simple linear regression:
\[R^2 = r^2\]
where \(r\) is the Pearson correlation coefficient between \(X\) and \(Y\).
r_pearson <- cor(brfss_slr_2020$age, brfss_slr_2020$bmi)
tibble(
Quantity = c("Pearson r", "r² (from Pearson)", "R² (from model)", "r² = R²?"),
Value = c(
round(r_pearson, 4),
round(r_pearson^2, 4),
round(r_sq, 4),
as.character(round(r_pearson^2, 4) == round(r_sq, 4))
)
) %>%
kable(caption = "Pearson r vs. R² from Model") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| Pearson r | -0.0155 |
| r² (from Pearson) | 2e-04 |
| R² (from model) | 2e-04 |
| r² = R²? | TRUE |
Important caveat: A low \(R^2\) does not mean the regression is useless. In epidemiology, outcomes are influenced by many unmeasured factors, so \(R^2\) values of 0.05–0.20 can still yield scientifically meaningful and statistically significant estimates.
The most important hypothesis test in SLR is:
\[H_0: \beta_1 = 0 \quad \text{(no linear relationship between X and Y)}\] \[H_A: \beta_1 \neq 0 \quad \text{(there is a linear relationship)}\]
Test statistic:
\[t = \frac{b_1 - 0}{SE(b_1)} \sim t_{n-2} \quad \text{under } H_0\]
Where:
\[SE(b_1) = \frac{\hat{\sigma}}{\sqrt{\sum(X_i - \bar{X})^2}} = \frac{\hat{\sigma}}{\sqrt{S_{XX}}}\]
# Extract slope test statistics
slope_test <- tidy(model_slr, conf.int = TRUE) %>% filter(term == "age")
tibble(
Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
"Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
Value = c(
round(slope_test$estimate, 4),
round(slope_test$std.error, 4),
round(slope_test$statistic, 3),
n - 2,
format.pval(slope_test$p.value, digits = 3),
round(slope_test$conf.low, 4),
round(slope_test$conf.high, 4)
)
) %>%
kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| Slope (b₁) | -0.0062 |
| SE(b₁) | 0.0073 |
| t-statistic | -0.849 |
| Degrees of freedom | 2998 |
| p-value | 0.396 |
| 95% CI Lower | -0.0206 |
| 95% CI Upper | 0.0082 |
Decision: With p = 0.396, we reject \(H_0\) at the \(\alpha = 0.05\) level. There is statistically significant evidence of a linear association between age and BMI.
The F-test evaluates whether the overall model (i.e., all predictors together) explains a statistically significant portion of the variability in \(Y\). For simple linear regression with one predictor, the F-test is equivalent to the t-test for the slope (\(F = t^2\)).
\[F = \frac{MS_{Regression}}{MS_{Residual}} \sim F_{1,\, n-2} \quad \text{under } H_0\]
f_stat <- summary(model_slr)$fstatistic
f_value <- f_stat[1]
df1 <- f_stat[2]
df2 <- f_stat[3]
p_f <- pf(f_value, df1, df2, lower.tail = FALSE)
tibble(
Quantity = c("F-statistic", "df (numerator)", "df (denominator)",
"p-value", "Verification: t²", "Verification: F"),
Value = c(
round(f_value, 3),
df1,
df2,
format.pval(p_f, digits = 3),
round(slope_test$statistic^2, 3),
round(f_value, 3)
)
) %>%
kable(caption = "F-Test for Overall Model Significance") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| F-statistic | 0.721 |
| df (numerator) | 1 |
| df (denominator) | 2998 |
| p-value | 0.396 |
| Verification: t² | 0.721 |
| Verification: F | 0.721 |
A 95% CI for \(\beta_1\) is:
\[b_1 \pm t_{n-2, \, 0.025} \times SE(b_1)\]
t_crit <- qt(0.975, df = n - 2)
ci_lower <- slope_test$estimate - t_crit * slope_test$std.error
ci_upper <- slope_test$estimate + t_crit * slope_test$std.error
tibble(
Bound = c("95% CI Lower", "95% CI Upper"),
Value = c(round(ci_lower, 4), round(ci_upper, 4)),
Units = c("kg/m² per year", "kg/m² per year")
) %>%
kable(caption = "95% Confidence Interval for β₁ (manually computed)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Bound | Value | Units |
|---|---|---|
| 95% CI Lower | -0.0206 | kg/m² per year |
| 95% CI Upper | 0.0082 | kg/m² per year |
A confidence interval for the mean response \(E(Y \mid X = x^*)\) gives a range of plausible values for the population mean of \(Y\) at a specific value \(x^*\):
\[\hat{Y}^* \pm t_{n-2, \, \alpha/2} \times SE(\hat{Y}^*)\]
Where:
\[SE(\hat{Y}^*) = \hat{\sigma}\sqrt{\frac{1}{n} + \frac{(x^* - \bar{X})^2}{S_{XX}}}\]
A prediction interval gives a range for a single new observation \(Y^*_{new}\) at \(X = x^*\). It is always wider than the confidence interval because it accounts for both the uncertainty in \(E(Y)\) and the individual variability around the mean:
\[\hat{Y}^* \pm t_{n-2, \, \alpha/2} \times SE_{pred}\]
Where:
\[SE_{pred} = \hat{\sigma}\sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{X})^2}{S_{XX}}}\]
# Compute CI and PI at specific age values
new_ages <- data.frame(age = c(25, 35, 45, 55, 65, 75))
ci_pred <- predict(model_slr, newdata = new_ages, interval = "confidence") %>%
as.data.frame() %>%
rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
pi_pred <- predict(model_slr, newdata = new_ages, interval = "prediction") %>%
as.data.frame() %>%
rename(PI_Lower = lwr, PI_Upper = upr) %>%
select(-fit)
results_table <- bind_cols(new_ages, ci_pred, pi_pred) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
results_table %>%
kable(
caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by Age",
col.names = c("Age", "Fitted BMI", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))| Age | Fitted BMI | CI Lower | CI Upper | PI Lower | PI Upper |
|---|---|---|---|---|---|
| 25 | 29.37 | 28.87 | 29.88 | 15.61 | 43.13 |
| 35 | 29.31 | 28.92 | 29.70 | 15.56 | 43.06 |
| 45 | 29.25 | 28.95 | 29.54 | 15.50 | 43.00 |
| 55 | 29.19 | 28.93 | 29.44 | 15.43 | 42.94 |
| 65 | 29.12 | 28.84 | 29.41 | 15.37 | 42.87 |
| 75 | 29.06 | 28.68 | 29.44 | 15.31 | 42.81 |
# Generate CI and PI across the full age range
age_grid <- data.frame(age = seq(18, 80, length.out = 200))
ci_band <- predict(model_slr, newdata = age_grid, interval = "confidence") %>%
as.data.frame() %>%
bind_cols(age_grid)
pi_band <- predict(model_slr, newdata = age_grid, interval = "prediction") %>%
as.data.frame() %>%
bind_cols(age_grid)
p_ci_pi <- ggplot() +
geom_point(data = brfss_slr_2020, aes(x = age, y = bmi),
alpha = 0.10, color = "steelblue", size = 1) +
geom_ribbon(data = pi_band, aes(x = age, ymin = lwr, ymax = upr),
fill = "lightblue", alpha = 0.3) +
geom_ribbon(data = ci_band, aes(x = age, ymin = lwr, ymax = upr),
fill = "steelblue", alpha = 0.4) +
geom_line(data = ci_band, aes(x = age, y = fit),
color = "red", linewidth = 1.2) +
labs(
title = "Simple Linear Regression: BMI ~ Age",
subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
x = "Age (years)",
y = "BMI (kg/m²)",
caption = "BRFSS 2020, n = 3,000"
) +
theme_minimal(base_size = 13)
p_ci_piRegression Line with 95% Confidence and Prediction Intervals
Key distinction: If you want to estimate the average BMI for all 45-year-olds in the population, use the confidence interval. If you want to predict the BMI of a specific new 45-year-old patient, use the prediction interval.
Fitting a regression model is not enough — we must verify that the LINE assumptions are reasonably met. We do this through residual diagnostics.
par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)Standard Regression Diagnostic Plots
Interpreting each plot:
1. Residuals vs. Fitted: Checks linearity and equal variance. We want a horizontal red line and random scatter with no pattern. A “fan shape” (spread increasing with fitted values) indicates heteroscedasticity.
2. Normal Q-Q Plot: Checks normality of residuals. Points should fall approximately along the 45° reference line. Heavy tails or S-curves suggest non-normality.
3. Scale-Location (Spread-Location): Another check for equal variance (homoscedasticity). The square root of standardized residuals is plotted against fitted values. A flat line indicates constant variance.
4. Residuals vs. Leverage: Identifies influential observations using Cook’s distance. Points in the upper or lower right corner (beyond the dashed lines) have high influence.
p_resid_x <- ggplot(augmented, aes(x = age, y = .resid)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
labs(
title = "Residuals vs. Age",
subtitle = "Should show no pattern — random scatter around zero",
x = "Age (years)",
y = "Residuals"
) +
theme_minimal(base_size = 13)
p_resid_xResiduals vs. Age — Checking Linearity
p_hist <- ggplot(augmented, aes(x = .resid)) +
geom_histogram(aes(y = after_stat(density)), bins = 40,
fill = "steelblue", color = "white", alpha = 0.8) +
geom_density(color = "red", linewidth = 1) +
stat_function(fun = dnorm,
args = list(mean = mean(augmented$.resid),
sd = sd(augmented$.resid)),
color = "black", linetype = "dashed", linewidth = 1) +
labs(
title = "Distribution of Residuals",
subtitle = "Red = kernel density | Black dashed = normal distribution",
x = "Residuals",
y = "Density"
) +
theme_minimal(base_size = 13)
p_histDistribution of Residuals
# ggplot version of QQ plot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Normal Q-Q Plot of Residuals",
subtitle = "Points should lie on the red line if residuals are normally distributed",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 13)
p_qqNormal Q-Q Plot of Residuals
# Cook's distance
augmented <- augmented %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(model_slr),
influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
)
n_influential <- sum(augmented$cooks_d > 4 / n)
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 4 / n, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooksCook’s Distance: Identifying Influential Observations
To reinforce the concepts, let’s fit a second SLR model examining the association between hours of sleep and BMI.
p_sleep <- ggplot(brfss_slr_2020, aes(x = sleep_hrs, y = bmi)) +
geom_jitter(alpha = 0.15, color = "purple", width = 0.15, height = 0) +
geom_smooth(method = "lm", color = "darkred", linewidth = 1.2, se = TRUE) +
labs(
title = "BMI vs. Nightly Sleep Hours (BRFSS 2020)",
x = "Average Hours of Sleep per Night",
y = "BMI (kg/m²)"
) +
theme_minimal(base_size = 13)
p_sleepBMI vs. Sleep Hours
model_sleep <- lm(bmi ~ sleep_hrs, data = brfss_slr_2020)
tidy(model_sleep, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "SLR: BMI ~ Hours of Sleep per Night",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 30.7419 | 0.534 | 57.5683 | 0.0000 | 29.6948 | 31.7890 |
| sleep_hrs | -0.2256 | 0.075 | -3.0087 | 0.0026 | -0.3726 | -0.0786 |
b1_sleep <- coef(model_sleep)["sleep_hrs"]
r2_sleep <- summary(model_sleep)$r.squared
tibble(
Metric = c("Slope (b₁)", "R²"),
Value = c(round(b1_sleep, 4), round(r2_sleep, 4))
) %>%
kable(caption = "Sleep Model Key Statistics") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Slope (b₁) | -0.2256 |
| R² | 0.0030 |
Interpretation: Each additional hour of sleep per night is associated with a change of -0.2256 kg/m² in BMI, on average. The direction of this association is negative (more sleep → lower BMI). The model explains 0.3% of variability in BMI. While statistically significant, the effect size is modest, underscoring the multifactorial nature of BMI.
par(mfrow = c(2, 2))
plot(model_sleep, which = 1:4,
col = adjustcolor("purple", 0.4), pch = 19, cex = 0.6)Our linear model estimated a negative slope for age: older adults have, on average, slightly lower BMI. But is that the full story? Cross-sectional data can show a decline at older ages due to survivorship bias — people with very high BMI may die before reaching old age, leaving a healthier-looking older sample. There may also be a genuine nonlinear pattern (BMI rises through middle age, then declines in later life).
We can test this by including an age² term in the model:
\[\widehat{\text{BMI}} = b_0 + b_1 \cdot \text{Age} + b_2 \cdot \text{Age}^2\]
This is still a linear regression model (linear in the coefficients), even though it is nonlinear in the predictor. It allows the slope to change across the range of age.
# Add age-squared term
brfss_slr_2020 <- brfss_slr_2020 %>%
mutate(age2 = age^2)
# Fit quadratic model
model_quad <- lm(bmi ~ age + age2, data = brfss_slr_2020)
tidy(model_quad, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 5))) %>%
kable(
caption = "Quadratic Model: BMI ~ Age + Age²",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 18.54178 | 1.08095 | 17.15329 | 0 | 16.42230 | 20.66125 |
| age | 0.47435 | 0.04418 | 10.73772 | 0 | 0.38773 | 0.56096 |
| age2 | -0.00464 | 0.00042 | -11.02651 | 0 | -0.00546 | -0.00381 |
# Compare linear vs. quadratic model
tibble(
Model = c("Linear: BMI ~ Age", "Quadratic: BMI ~ Age + Age²"),
R_squared = c(
round(summary(model_slr)$r.squared, 4),
round(summary(model_quad)$r.squared, 4)
),
Adj_R2 = c(
round(summary(model_slr)$adj.r.squared, 4),
round(summary(model_quad)$adj.r.squared, 4)
),
AIC = c(round(AIC(model_slr), 1), round(AIC(model_quad), 1))
) %>%
kable(
caption = "Model Comparison: Linear vs. Quadratic",
col.names = c("Model", "R²", "Adj. R²", "AIC")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which.min(c(AIC(model_slr), AIC(model_quad))),
bold = TRUE, background = "#d4edda")| Model | R² | Adj. R² | AIC |
|---|---|---|---|
| Linear: BMI ~ Age | 0.0002 | -0.0001 | 20203.2 |
| Quadratic: BMI ~ Age + Age² | 0.0392 | 0.0386 | 20085.9 |
# Generate predicted values from both models
age_seq <- data.frame(age = seq(18, 80, length.out = 300)) %>%
mutate(age2 = age^2)
pred_linear <- predict(model_slr, newdata = age_seq)
pred_quad <- predict(model_quad, newdata = age_seq)
pred_df <- age_seq %>%
mutate(
linear = pred_linear,
quadratic = pred_quad
) %>%
pivot_longer(cols = c(linear, quadratic),
names_to = "Model", values_to = "Predicted_BMI")
ggplot() +
geom_point(data = brfss_slr_2020, aes(x = age, y = bmi),
alpha = 0.10, color = "steelblue", size = 1) +
geom_line(data = pred_df, aes(x = age, y = Predicted_BMI, color = Model),
linewidth = 1.3) +
scale_color_manual(
values = c("linear" = "red", "quadratic" = "darkorange"),
labels = c("linear" = "Linear fit", "quadratic" = "Quadratic fit (Age + Age²)")
) +
labs(
title = "BMI vs. Age: Linear vs. Quadratic Model",
subtitle = "Does BMI rise then fall with age, or decline monotonically?",
x = "Age (years)",
y = "BMI (kg/m²)",
color = "Model"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")Linear vs. Quadratic Fit: BMI ~ Age
Interpretation: If the coefficient on Age² is negative and statistically significant, the fitted curve is an inverted-U — BMI peaks at some middle age and declines thereafter. Extract the peak using \(\text{Age}^* = -b_1 / (2 b_2)\). A positive Age² coefficient would indicate a U-shape (BMI lowest in middle age).
b1_q <- coef(model_quad)["age"]
b2_q <- coef(model_quad)["age2"]
peak_age <- -b1_q / (2 * b2_q)
tibble(
Quantity = c("b₁ (Age)", "b₂ (Age²)", "Peak / Trough Age (-b₁ / 2b₂)"),
Value = c(round(b1_q, 5), round(b2_q, 6), round(peak_age, 1))
) %>%
kable(caption = "Quadratic Model Coefficients and Implied Turning Point") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| b₁ (Age) | 0.474350 |
| b₂ (Age²) | -0.004635 |
| Peak / Trough Age (-b₁ / 2b₂) | 51.200000 |
Caution on interpretation: Even if the quadratic model fits better statistically, be cautious about causal interpretation. The cross-sectional pattern reflects cohort differences in BMI trajectories, not necessarily the aging process within any individual. Survivorship bias (heavier individuals dying earlier) can make the quadratic term appear significant in cross-sectional data.
| Quantity | Formula |
|---|---|
| Slope | \(b_1 = S_{XY} / S_{XX}\) |
| Intercept | \(b_0 = \bar{Y} - b_1 \bar{X}\) |
| SSTotal | \(\sum(Y_i - \bar{Y})^2\) |
| SSRegression | \(\sum(\hat{Y}_i - \bar{Y})^2\) |
| SSResidual | \(\sum(Y_i - \hat{Y}_i)^2\) |
| MSE | \(SS_{Residual} / (n-2)\) |
| \(R^2\) | \(SS_{Reg} / SS_{Total}\) |
| \(SE(b_1)\) | \(\hat{\sigma}/\sqrt{S_{XX}}\) |
| t-statistic | \(b_1 / SE(b_1)\) |
| 95% CI for \(\beta_1\) | \(b_1 \pm t_{n-2, 0.025} \cdot SE(b_1)\) |
In this lab, you will apply Simple Linear Regression to the
BRFSS 2020 dataset using a different outcome variable:
number of days of poor physical health in the past 30
days (phys_days). You will model it as a
continuous outcome predicted by BMI.
Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?
Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.
# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)# (a) Create a summary table of phys_days and bmi
summary_table <- data.frame(
Variable = c("phys_days", "bmi"),
Mean = c(mean(brfss_slr_2020$phys_days),
mean(brfss_slr_2020$bmi)),
SD = c(sd(brfss_slr_2020$phys_days),
sd(brfss_slr_2020$bmi)),
Median = c(median(brfss_slr_2020$phys_days),
median(brfss_slr_2020$bmi)),
Min = c(min(brfss_slr_2020$phys_days),
min(brfss_slr_2020$bmi)),
Max = c(max(brfss_slr_2020$phys_days),
max(brfss_slr_2020$bmi))
)
summary_table## Variable Mean SD Median Min Max
## 1 phys_days 11.65800 11.160073 6.00 1.00 30.0
## 2 bmi 29.18194 7.011534 27.89 14.63 59.6
# (b) Create a histogram of phys_days — describe the distribution
hist(brfss_slr_2020$phys_days,
main = "Histogram: Number of Poor Physical Health Days",
xlab = "Poor Physical Health (Days)",
col = "lightblue")# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
p_scatter <- ggplot(brfss_slr_2020, aes(x = bmi, y = phys_days)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "blue", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "BMI vs. Poor Physical Health Days (BRFSS 2020)",
subtitle = "ed = Linear fit | Orange dashed = LOESS smoother",
x = "BMI (kg/m²)",
y = "Poor Physical Health (Days)"
) +
theme_minimal(base_size = 13)
ggplotly(p_scatter)Questions:
phys_days?
Of bmi? What do you notice about the distribution of
phys_days?The mean number of poor physical health days (phys_days) is 11.66 days with a standard deviation of 11.16 days. The mean BMI 29.18, with a standard deviation of 7.01. The distribution of phys_days appears to be mostly right-skewed. The mean being 11.66 is higher than the median of 6, suggesting that a smaller number of individuals report very high numbers of active days, pulling the mean upward.
# (a) Fit the SLR model: phys_days ~ bmi
model_slr <- lm(phys_days ~ bmi, data = brfss_slr_2020)
# Summary output
summary(model_slr)##
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_slr_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.808 -9.160 -5.623 8.943 20.453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.42285 0.86881 8.544 < 2e-16 ***
## bmi 0.14513 0.02895 5.013 5.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared: 0.008314, Adjusted R-squared: 0.007983
## F-statistic: 25.13 on 1 and 2998 DF, p-value: 5.659e-07
# (b) Display a tidy coefficient table with 95% CIs
tidy(model_slr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Poor Physical Health Days ~ BMI (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.4228 | 0.8688 | 8.5437 | 0 | 5.7193 | 9.1264 |
| bmi | 0.1451 | 0.0289 | 5.0134 | 0 | 0.0884 | 0.2019 |
# (c) Extract and report: slope, intercept, t-statistic, p-value
# Slope: 0.1451
# Intercept: 7.4228
# t-statistic: 5.013
# p-value < 0.001: 0
b0 <- round(coef(model_slr)[1], 3)
b1 <- round(coef(model_slr)[2], 4)Questions:
Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\). \(\hat{\text{phys_days}} = 7.42 + 0.145 {BMI}\)
Interpret the slope (\(b_1\)) in context — what does it mean in plain English?
For each 1-unit increase in BMI, the expected number of poor physical health days increases by 0.145 days on average, holding all else constant.
The intercept (\(b_0\)) = 7.42 represents the predicted number of poor physical health days when BMI = 0. No, this is not interpretable in this context because it is not realistic or possible for adults in this population to have a BMI of 0.
Yes, the association is statistically significant at = 0.05. Null hypothesis (H0): b_1 = 0, BMI is not associated with poor physical health days. Alternative hypothesis (HA): b_1 , BMI is associated with poor physical health days. The test statistic for the slope is t = 5.0134, and the p-value < 0.001.
Because the p-value is less than 0.05, we reject the null hypothesis. Therefore, there is a statistically significant association between BMI and poor physical health days.
# (a) Display the ANOVA table
model_slr <- lm(phys_days ~ bmi, data = brfss_slr_2020)
anova_slr <- anova(model_slr)
anova_slr %>%
kable(
caption = "ANOVA Table: phys_days ~ bmi",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| bmi | 1 | 3105.365 | 3105.365 | 25.134 | 0 |
| Residuals | 2998 | 370411.743 | 123.553 | NA | NA |
# (b) Compute and report SSTotal, SSRegression, and SSResidual
SSRegression <- anova_slr$`Sum Sq`[1]
SSResidual <- anova_slr$`Sum Sq`[2]
SSTotal <- SSRegression + SSResidual
tibble(
Quantity = c("SS Total", "SS Regression", "SS Residual"),
Value = round(c(SSTotal, SSRegression, SSResidual), 2)
)## # A tibble: 3 × 2
## Quantity Value
## <chr> <dbl>
## 1 SS Total 373517.
## 2 SS Regression 3105.
## 3 SS Residual 370412.
# (c) Compute R² two ways: from the model object and from the SS decomposition
r_sq <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared
tibble(
Metric = c("R²", "Adjusted R²", "Variance Explained"),
Value = c(
round(r_sq, 4),
round(adj_r_sq, 4),
paste0(round(r_sq * 100, 2), "%")
)
) %>%
kable(caption = "R² and Adjusted R²") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| R² | 0.0083 |
| Adjusted R² | 0.008 |
| Variance Explained | 0.83% |
# R Pearson
r_pearson <- cor(brfss_slr_2020$bmi, brfss_slr_2020$phys_days)
tibble(
Quantity = c("Pearson r", "r² (from Pearson)", "R² (from model)", "r² = R²?"),
Value = c(
round(r_pearson, 4),
round(r_pearson^2, 4),
round(r_sq, 4),
as.character(round(r_pearson^2, 4) == round(r_sq, 4))
)
) %>%
kable(caption = "Pearson r vs. R² from Model") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| Pearson r | 0.0912 |
| r² (from Pearson) | 0.0083 |
| R² (from model) | 0.0083 |
| r² = R²? | TRUE |
Questions:
\(SS_{373517.11}\), \(SS_{3105.36}\), \(SS_{370411.74}\), \(df\), and \(F\)-statistic.
The R² value = 0.0083. This means that only about 0.83% of the variation in poor physical health days is explained by BMI in this simple linear regression model.
BMI alone is a very weak predictor of the number of poor physical health days in this dataset. Also, this means most of the variation is explained by other factors or random variation.
# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = 25)
ci_pred <- predict(model_slr, newdata = new_bmi, interval = "confidence") %>%
as.data.frame() %>%
rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_pred <- predict(model_slr, newdata = new_bmi, interval = "prediction") %>%
as.data.frame() %>%
rename(PI_Lower = lwr, PI_Upper = upr) %>%
select(-fit)
results_table <- bind_cols(ci_pred, pi_pred) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
# (c) Plot the regression line with both the CI band and PI band
results_table %>%
kable(
caption = "Fitted BMI Value, 95% Confidence Interval, and 95% Prediction Interval for BMI = 25",
col.names = c("Fitted BMI", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Fitted BMI | CI Lower | CI Upper | PI Lower | PI Upper |
|---|---|---|---|---|
| 11.05 | 10.59 | 11.51 | -10.75 | 32.85 |
Questions:
For someone with a BMI of 25, the esimated mean number of poor physical health days is 11.05. The 95% confidence interval for this mean is 10.59 to 11.51 days. This means we are 95% confident that the average number of poor health days for all people with BMI = 25 falls within this range.
For a specific new person with BMI of 25, the 95% prediction interval is -10.75 to 32.85 days.
The prediction interval is wider because it accounts for both the uncertainty in estimating the mean and the individual variability around that mean. The confidence interval only reflects the uncertainty in estimating the mean. You would use the confidence interval when you want to know about the average outcome for a group of people with a given BMI and you would use the prediction interval when predicting an outcome for a single individual.
# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)par(mfrow = c(1, 1))
# (b) Create a residuals vs. fitted plot using ggplot
ggplot(augmented, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Residuals vs Fitted Values",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal(base_size = 13)# (c) Create a normal Q-Q plot of residuals using ggplot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Normal Q-Q Plot of Residuals",
subtitle = "Points should lie on the red line if residuals are normally distributed",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 13)
p_qq# (d) Create a Cook's distance plot
augmented <- augmented %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(model_slr),
influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
)
n_influential <- sum(augmented$cooks_d > 4 / n)
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 4 / n, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooksQuestions:
There is a slightly curved red line, but there is random scatter with no pattern. There is also no “fan shape” (spread increasing with fitted values). Both of these indicate there is no evidence of nonlinearity or heteroscedasticity.
phys_days?Points fall approximately along the 45° reference line. There is slight S-curves suggesting non-normality. In this context, it suggests that the distribution of phys_days is not perfectly normal, maybe due to people reporting very high or very low numbers of poor physical health days.
There are 3 influential observations based on Cook’s distance (>4/n). Since these points have a disproportionate effect on the regression estimates, I would investigate them to ensure they are not data errors. If they are valid, I could perform a sensitivity analysis by refitting the model without these points to see if the results change substantially. Since they represent real variation, I would not automatically remove them from the dataset.
Overall, the LINE assumptions mostly appear to be met. The most problematic assumptions are probably normality and, to a lesser extent, linearity. Since phys_days is a count variable with possible extreme values, standard linear regression may not fully capture its distribution. A generalized linear model (e.g., Poisson or negative binomial regression) could better account for this.
Now fit a second SLR model using age as the
predictor of phys_days instead of BMI.
# (a) Fit SLR: phys_days ~ age
model_slr_age <- lm(phys_days ~ age, data = brfss_slr_2020)
summary(model_slr_age)##
## Call:
## lm(formula = phys_days ~ age, data = brfss_slr_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.872 -8.803 -4.733 9.460 23.267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.37023 0.66608 6.561 6.27e-11 ***
## age 0.13127 0.01145 11.467 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.92 on 2998 degrees of freedom
## Multiple R-squared: 0.04202, Adjusted R-squared: 0.0417
## F-statistic: 131.5 on 1 and 2998 DF, p-value: < 2.2e-16
# (b) Display results and compare to the BMI model
tidy(model_slr_age, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Poor Physical Health Days ~ Age (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 4.3702 | 0.6661 | 6.5611 | 0 | 3.0642 | 5.6762 |
| age | 0.1313 | 0.0114 | 11.4675 | 0 | 0.1088 | 0.1537 |
# (c) Which predictor has the stronger association? Compare R² values.
r_sq <- summary(model_slr_age)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared
tibble(
Metric = c("R²", "Adjusted R²", "Variance Explained"),
Value = c(
round(r_sq, 4),
round(adj_r_sq, 4),
paste0(round(r_sq * 100, 2), "%")
)
) %>%
kable(caption = "R² and Adjusted R²") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| R² | 0.042 |
| Adjusted R² | 0.008 |
| Variance Explained | 4.2% |
# R Pearson
r_pearson <- cor(brfss_slr_2020$age, brfss_slr_2020$phys_days)
tibble(
Quantity = c("Pearson r", "r² (from Pearson)", "R² (from model)", "r² = R²?"),
Value = c(
round(r_pearson, 4),
round(r_pearson^2, 4),
round(r_sq, 4),
as.character(round(r_pearson^2, 4) == round(r_sq, 4))
)
) %>%
kable(caption = "Pearson r vs. R² from Model") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| Pearson r | 0.205 |
| r² (from Pearson) | 0.042 |
| R² (from model) | 0.042 |
| r² = R²? | TRUE |
Questions:
In terms of direction, both predictors have a positive association with poor physical health days, as either age or BMI increases, the number of poor health days increases. In terms of magnitude, the slope for BMI (0.1451) is slightly higher than that for age (0.1313), meaning each 1-unit increase in BMI is associated with a slightly larger increase in poor health days than each 1-year increase in age. Both predictors are statistically significant (p < 0.001); therfore, both have a real positive effect on poor health days in this dataset.
phys_days?In the age model, R² = 0.0137 and for BMI model, R² = 0.0083. This means that age explains slightly more variability in poor physical health days than BMI does. Neither predictor explains much variability on its own.
Both age and BMI are positively associated with poor physical health days, but the effect sizes are small and each predictor alone explains very little of the variation in the outcome. Age is slightly stronger in this dataset. Limitations of using this simple linear regression include: (1) only one predictor is included at a time and thus ignores potential confounders; (2) it assumes a linear relationship and may miss non-linear effects; (3) R² is very low, so these simple models have poor predictive ability for individual outcomes; and (4) outliers or influential points could distort the slope or R².
Submit your completed .Rmd file and the RPubs
link to your knitted HTML document.
Your .Rmd must knit without errors. Make sure all code
chunks produce visible output and all questions are answered in complete
sentences below each code chunk.
Due: Before the next class session.
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gtsummary_2.5.0 ggeffects_2.3.2 broom_1.0.11 plotly_4.12.0
## [5] kableExtra_1.4.0 knitr_1.51 here_1.0.2 haven_2.5.5
## [9] lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.0
## [13] purrr_1.2.1 readr_2.1.5 tidyr_1.3.2 tibble_3.3.0
## [17] ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0 htmlwidgets_1.6.4
## [5] insight_1.4.5 lattice_0.22-7 tzdb_0.5.0 crosstalk_1.2.2
## [9] vctrs_0.7.1 tools_4.5.1 generics_0.1.4 pkgconfig_2.0.3
## [13] Matrix_1.7-3 data.table_1.18.0 RColorBrewer_1.1-3 S7_0.2.1
## [17] gt_1.3.0 lifecycle_1.0.5 compiler_4.5.1 farver_2.1.2
## [21] textshaping_1.0.4 litedown_0.9 htmltools_0.5.9 sass_0.4.10
## [25] yaml_2.3.12 lazyeval_0.2.2 pillar_1.11.0 jquerylib_0.1.4
## [29] cachem_1.1.0 nlme_3.1-168 commonmark_2.0.0 tidyselect_1.2.1
## [33] digest_0.6.39 stringi_1.8.7 labeling_0.4.3 splines_4.5.1
## [37] rprojroot_2.1.1 fastmap_1.2.0 grid_4.5.1 cli_3.6.5
## [41] magrittr_2.0.3 cards_0.7.1 utf8_1.2.6 withr_3.0.2
## [45] scales_1.4.0 backports_1.5.0 timechange_0.3.0 rmarkdown_2.30
## [49] httr_1.4.7 otel_0.2.0 hms_1.1.3 evaluate_1.0.5
## [53] viridisLite_0.4.2 mgcv_1.9-3 markdown_2.0 rlang_1.1.7
## [57] glue_1.8.0 xml2_1.5.2 svglite_2.2.2 rstudioapi_0.18.0
## [61] jsonlite_2.0.0 R6_2.6.1 systemfonts_1.3.1 fs_1.6.6