After fitting a regression model, we must ask: does the model actually fit the data well, and are the underlying assumptions satisfied? A model that violates key assumptions may produce misleading coefficients, invalid standard errors, or unreliable predictions.
Regression diagnostics address three fundamental questions:
This lecture provides the tools to diagnose each of these issues and, where possible, apply remedial measures. Diagnostics should be a routine part of every regression analysis, not an afterthought.
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue with the BRFSS 2020 dataset. For this lecture, we predict mentally unhealthy days from several health-related variables.
brfss_full <- read_xpt(
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2024/Papers/BRFSS Project/Final Analysis/BRFSS Data 2013 - 2023/LLCP2020.XPT"
) |>
clean_names()brfss_dx <- brfss_full |>
mutate(
# Outcome
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Predictors
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
age = age80,
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
)
) |>
filter(
!is.na(menthlth_days), !is.na(physhlth_days), !is.na(sleep_hrs),
!is.na(age), age >= 18, !is.na(sex), !is.na(bmi),
!is.na(exercise), !is.na(income_cat)
)
set.seed(1220)
brfss_dx <- brfss_dx |>
dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
bmi, exercise, income_cat) |>
slice_sample(n = 5000)
# Save for lab
saveRDS(brfss_dx,
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Diagnostics/brfss_dx_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_dx), ncol(brfss_dx))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 8 |
brfss_dx |>
tbl_summary(
type = list(
c(menthlth_days, physhlth_days, sleep_hrs, age, bmi, income_cat) ~ "continuous"
),
statistic = list(
all_continuous() ~ "{mean} ({sd}), [{min}, {max}]"
)
) |>
bold_labels()| Characteristic | N = 5,0001 |
|---|---|
| menthlth_days | 4 (8), [0, 30] |
| physhlth_days | 3 (8), [0, 30] |
| sleep_hrs | 7.07 (1.36), [1.00, 14.00] |
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 | 54 (17), [18, 80] |
| sex | |
| Male | 2,454 (49%) |
| Female | 2,546 (51%) |
| bmi | 28.3 (6.2), [13.2, 71.1] |
| exercise | 3,965 (79%) |
| income_cat | 6.17 (2.02), [1.00, 8.00] |
| 1 Mean (SD), [Min, Max]; n (%) | |
Before trusting any regression output, we need to verify that the underlying assumptions are met. The acronym LINE helps us remember the four key assumptions:
| Assumption | Description | Diagnostic tool |
|---|---|---|
| Linearity | The relationship between predictors and outcome is linear | Residuals vs. fitted plot |
| Independence | Residual errors are independent of one another | Study design, Durbin-Watson test |
| Normality | Residual errors follow a normal distribution | Q-Q plot, histogram, Shapiro-Wilk test |
| Equal variance | Variance of residuals is constant across fitted values (homoscedasticity) | Residuals vs. fitted plot, Scale-Location plot, Breusch-Pagan test |
Violations of these assumptions do not necessarily invalidate the analysis, but they may affect the reliability of confidence intervals, p-values, and prediction intervals. The central limit theorem provides some protection when sample sizes are large, particularly for the normality assumption.
Let us fit a multiple regression model predicting mentally unhealthy days from physical health days, sleep, age, sex, and BMI.
mod1 <- lm(menthlth_days ~ physhlth_days + sleep_hrs + age + sex + bmi,
data = brfss_dx)
tidy(mod1, conf.int = TRUE) |>
kable(digits = 3, caption = "Working Model Coefficients") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 9.912 | 0.775 | 12.793 | 0.000 | 8.393 | 11.431 |
| physhlth_days | 0.316 | 0.013 | 24.325 | 0.000 | 0.291 | 0.342 |
| sleep_hrs | -0.676 | 0.075 | -9.010 | 0.000 | -0.823 | -0.529 |
| age | -0.079 | 0.006 | -13.306 | 0.000 | -0.090 | -0.067 |
| sexFemale | 1.505 | 0.202 | 7.430 | 0.000 | 1.108 | 1.902 |
| bmi | 0.039 | 0.016 | 2.357 | 0.018 | 0.006 | 0.071 |
glance(mod1) |>
dplyr::select(r.squared, adj.r.squared, AIC, BIC, sigma) |>
kable(digits = 3, caption = "Model Fit Summary") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| r.squared | adj.r.squared | AIC | BIC | sigma |
|---|---|---|---|---|
| 0.16 | 0.159 | 33857.77 | 33903.39 | 7.142 |
Residuals are the difference between observed and fitted values. They are our primary tool for checking assumptions. However, not all residuals are created equal.
The simplest residual is the raw (ordinary) residual:
\[e_i = Y_i - \hat{Y}_i\]
Raw residuals are on the scale of the outcome variable. Their main limitation is that their variance is not constant across observations, because the variance of \(e_i\) depends on the leverage \(h_i\):
\[\text{Var}(e_i) = \sigma^2(1 - h_i)\]
Observations with higher leverage have smaller residual variance, making it harder to detect unusual observations using raw residuals alone.
# Extract raw residuals
brfss_dx <- brfss_dx |>
mutate(
raw_resid = residuals(mod1),
fitted_val = fitted(mod1)
)
# Summary of raw residuals
summary(brfss_dx$raw_resid)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -17.6288 -3.4591 -1.7307 0.0000 0.4225 30.8948
Standardized residuals divide each raw residual by the estimated standard deviation of the residuals (\(\hat{\sigma}\)), also known as the root mean squared error (RMSE):
\[r_i^{std} = \frac{e_i}{\hat{\sigma}}\]
where \(\hat{\sigma} = \sqrt{MSE}\). While this puts residuals on a common scale, it does not account for the fact that different observations have different residual variances (depending on their leverage).
brfss_dx <- brfss_dx |>
mutate(std_resid = rstandard(mod1))
# Observations with |standardized residual| > 3
brfss_dx |>
filter(abs(std_resid) > 3) |>
dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, std_resid) |>
head(10) |>
kable(digits = 3, caption = "Observations with Large Standardized Residuals") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| menthlth_days | physhlth_days | sleep_hrs | age | std_resid |
|---|---|---|---|---|
| 30 | 0 | 6 | 64 | 3.958 |
| 30 | 0 | 7 | 69 | 4.057 |
| 30 | 0 | 5 | 25 | 3.169 |
| 30 | 0 | 6 | 52 | 3.812 |
| 30 | 10 | 7 | 43 | 3.140 |
| 30 | 0 | 8 | 80 | 4.328 |
| 30 | 5 | 10 | 21 | 3.332 |
| 30 | 0 | 8 | 65 | 3.921 |
| 30 | 0 | 6 | 25 | 3.526 |
| 30 | 0 | 5 | 29 | 3.216 |
Studentized residuals account for the varying variance of each residual by dividing by the observation-specific standard error:
\[r_i = \frac{e_i}{\hat{\sigma}\sqrt{1 - h_i}}\]
This correction means that studentized residuals have approximately
unit variance under the model assumptions. In R,
rstandard() computes these internally studentized
residuals.
Jackknife residuals (also called externally studentized or deleted residuals) take the correction one step further. For observation \(i\), the residual is computed using a model fitted without observation \(i\):
\[r_{(-i)} = \frac{e_i}{\hat{\sigma}_{(-i)}\sqrt{1 - h_i}}\]
where \(\hat{\sigma}_{(-i)}\) is the RMSE estimated from the model that excludes observation \(i\). Under the null hypothesis that observation \(i\) is not an outlier, \(r_{(-i)} \sim t_{n-p-2}\).
The jackknife residual is the most powerful residual for detecting outliers because it prevents the potentially outlying observation from influencing the estimate of \(\hat{\sigma}\).
brfss_dx <- brfss_dx |>
mutate(jack_resid = rstudent(mod1))
# Compare residual types
brfss_dx |>
dplyr::select(raw_resid, std_resid, jack_resid) |>
pivot_longer(everything(), names_to = "type", values_to = "value") |>
ggplot(aes(x = value)) +
geom_histogram(bins = 50, fill = "steelblue", color = "white", alpha = 0.8) +
facet_wrap(~type, scales = "free_x",
labeller = labeller(type = c(
raw_resid = "Raw Residuals",
std_resid = "Standardized Residuals",
jack_resid = "Jackknife Residuals"
))) +
labs(title = "Distribution of Three Residual Types",
x = "Residual Value", y = "Frequency") +
theme_minimal()| Residual | Formula | Distribution under \(H_0\) | Use |
|---|---|---|---|
| Raw (\(e_i\)) | \(Y_i - \hat{Y}_i\) | Unequal variance | Quick check |
| Standardized | \(e_i / \hat{\sigma}\) | Approximately standard normal | Rough screening |
| Studentized | \(e_i / (\hat{\sigma}\sqrt{1-h_i})\) | Approximately standard normal | Better screening |
| Jackknife (\(r_{(-i)}\)) | \(e_i / (\hat{\sigma}_{(-i)}\sqrt{1-h_i})\) | \(t_{n-p-2}\) | Formal outlier testing |
This is the single most important diagnostic plot. It checks linearity and homoscedasticity simultaneously.
ggplot(brfss_dx, aes(x = fitted_val, y = jack_resid)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_smooth(method = "loess", se = FALSE, color = "darkorange", linewidth = 1) +
labs(title = "Jackknife Residuals vs. Fitted Values",
x = "Fitted Values", y = "Jackknife Residuals") +
theme_minimal()Interpretation guide:
The Q-Q plot compares the distribution of the residuals to the theoretical normal distribution. If the normality assumption holds, points should fall approximately along the diagonal reference line.
ggplot(brfss_dx, aes(sample = jack_resid)) +
stat_qq(alpha = 0.3, color = "steelblue") +
stat_qq_line(color = "red", linewidth = 1) +
labs(title = "Normal Q-Q Plot of Jackknife Residuals",
x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_minimal()Common Q-Q patterns and what they suggest:
| Pattern | Interpretation | Potential remedy |
|---|---|---|
| Points follow the line | Normality satisfied | None needed |
| Right skew (upper tail curves upward) | Right-skewed residuals | Log or square root transformation of \(Y\) |
| Left skew (lower tail curves downward) | Left-skewed residuals | Square transformation of \(Y\) |
| Heavy tails (both ends deviate) | Heavy-tailed distribution | Robust regression or transformation |
| S-shape | Both skew and kurtosis issues | Consider a different model family |
A histogram provides a complementary view to the Q-Q plot.
ggplot(brfss_dx, aes(x = jack_resid)) +
geom_histogram(aes(y = after_stat(density)), bins = 50,
fill = "steelblue", color = "white", alpha = 0.8) +
stat_function(fun = dnorm, args = list(mean = 0, sd = 1),
color = "red", linewidth = 1) +
labs(title = "Histogram of Jackknife Residuals with Normal Curve",
x = "Jackknife Residual", y = "Density") +
theme_minimal()The Scale-Location plot shows \(\sqrt{|\text{standardized residuals}|}\) vs. fitted values. It is specifically designed to assess homoscedasticity. A flat trend line indicates constant variance.
brfss_dx <- brfss_dx |>
mutate(sqrt_abs_std = sqrt(abs(std_resid)))
ggplot(brfss_dx, aes(x = fitted_val, y = sqrt_abs_std)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "loess", se = FALSE, color = "darkorange", linewidth = 1) +
labs(title = "Scale-Location Plot",
x = "Fitted Values",
y = expression(sqrt("|Standardized Residuals|"))) +
theme_minimal()R provides four built-in diagnostic plots through plot()
on an lm object. These are quick but less customizable.
The four panels are:
While visual diagnostics are essential, formal tests can supplement them, especially for borderline cases.
# Breusch-Pagan test for heteroscedasticity
bp_test <- car::ncvTest(mod1)
tibble(
Test = "Breusch-Pagan",
Statistic = bp_test$ChiSquare,
df = bp_test$Df,
p_value = bp_test$p
) |>
kable(digits = 4, caption = "Breusch-Pagan Test for Heteroscedasticity") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Test | Statistic | df | p_value |
|---|---|---|---|
| Breusch-Pagan | 1157.809 | 1 | 0 |
Note: With very large samples (n = 5000), formal tests are often overpowered and detect trivial departures from assumptions. In large samples, the visual diagnostic is typically more informative than the p-value. If the Q-Q plot looks reasonable and the residual plot has no alarming pattern, you may proceed even if a formal test rejects.
When assumptions are violated, transforming the outcome variable is a common remedy.
| Problem | Transformation | R code |
|---|---|---|
| Right-skewed residuals | \(\log(Y)\) | lm(log(y) ~ x) |
| Right-skewed residuals (mild) | \(\sqrt{Y}\) | lm(sqrt(y) ~ x) |
| Left-skewed residuals | \(Y^2\) | lm(I(y^2) ~ x) |
| Non-constant variance | \(\log(Y)\) or \(\sqrt{Y}\) | Depends on pattern |
If the Q-Q plot suggests right skew, we can try a log transformation.
Since our outcome (menthlth_days) includes zeros, we use
\(\log(Y + 1)\).
mod_log <- lm(log(menthlth_days + 1) ~ physhlth_days + sleep_hrs + age + sex + bmi,
data = brfss_dx)
# Compare Q-Q plots side by side
par(mfrow = c(1, 2))
qqnorm(rstudent(mod1), main = "Q-Q: Original Scale")
qqline(rstudent(mod1), col = "red")
qqnorm(rstudent(mod_log), main = "Q-Q: Log(Y+1) Scale")
qqline(rstudent(mod_log), col = "red")Important: Transforming the outcome changes the interpretation of the coefficients. With \(\log(Y+1)\), a one-unit increase in \(X\) is associated with a multiplicative change in \(Y+1\), not an additive change.
An outlier is an observation with a residual that is much larger than the rest. An influential observation is one whose removal substantially changes the fitted model (coefficients, predictions, or both).
Not all outliers are influential, and not all influential points are outliers. Influence depends on two components:
We test whether any jackknife residual is statistically significant using a Bonferroni correction to account for multiple testing.
For \(n\) observations and \(p\) predictors:
n <- nrow(brfss_dx)
p <- length(coef(mod1)) - 1 # number of predictors (excluding intercept)
alpha <- 0.05
# Critical value with Bonferroni correction
crit_val <- qt(1 - alpha / (2 * n), df = n - p - 2)
cat("Sample size:", n, "\n")## Sample size: 5000
## Number of predictors: 5
## Bonferroni critical value: 4.422
# Identify outliers
outlier_idx <- which(abs(brfss_dx$jack_resid) > crit_val)
cat("Number of jackknife residual outliers:", length(outlier_idx), "\n")## Number of jackknife residual outliers: 0
if (length(outlier_idx) > 0) {
brfss_dx |>
slice(outlier_idx) |>
dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, jack_resid) |>
kable(digits = 3, caption = "Outliers by Jackknife Residual (Bonferroni-corrected)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
}The car::outlierTest() function automates this
Bonferroni-corrected test:
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 190 4.33569 1.4816e-05 0.074082
Leverage (\(h_i\)) measures how far an observation’s predictor values are from the mean of all predictors. High-leverage observations have unusual predictor values and exert disproportionate influence on the fitted line.
The leverage for observation \(i\) is:
\[h_i = \frac{1}{n} + \sum_{j=1}^{p} \frac{(X_{ij} - \bar{X}_j)^2}{(n-1)s_j^2}\]
Key properties of leverage:
Rules of thumb for high leverage:
| Rule | Threshold | Reference |
|---|---|---|
| \(h_i > 2\bar{h}\) | \(2(p+1)/n\) | Hoaglin and Welsch (1978) |
| \(h_i > 3\bar{h}\) | \(3(p+1)/n\) | Velleman and Welsch (1981) |
| \(h_i > 0.5\) | 0.5 | Huber (1981) — clearly large |
| \(h_i > 0.2\) | 0.2 | Huber (1981) — worth examining |
brfss_dx <- brfss_dx |>
mutate(leverage = hatvalues(mod1))
h_bar <- (p + 1) / n
cat("Average leverage:", round(h_bar, 5), "\n")## Average leverage: 0.0012
## 2x average leverage threshold: 0.0024
## 3x average leverage threshold: 0.0036
# High leverage points (using 3x threshold)
high_lev <- brfss_dx |>
filter(leverage > 3 * h_bar)
cat("Number of high-leverage points (> 3x average):", nrow(high_lev), "\n")## Number of high-leverage points (> 3x average): 157
ggplot(brfss_dx, aes(x = seq_len(n), y = leverage)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_hline(yintercept = 2 * h_bar, color = "orange", linetype = "dashed",
linewidth = 0.8) +
geom_hline(yintercept = 3 * h_bar, color = "red", linetype = "dashed",
linewidth = 0.8) +
annotate("text", x = n * 0.9, y = 2 * h_bar + 0.0005, label = "2x average",
color = "orange") +
annotate("text", x = n * 0.9, y = 3 * h_bar + 0.0005, label = "3x average",
color = "red") +
labs(title = "Leverage Values by Observation Index",
x = "Observation Index", y = "Leverage (h_i)") +
theme_minimal()Cook’s distance combines residual size and leverage into a single measure of overall influence. It quantifies how much all fitted values change when observation \(i\) is removed:
\[D_i = \frac{r_i^2}{p+1} \cdot \frac{h_i}{1 - h_i}\]
where \(r_i\) is the internally studentized residual. A large Cook’s distance indicates that the observation substantially influences the regression surface.
Rules of thumb for Cook’s distance:
| Threshold | Interpretation |
|---|---|
| \(D_i > 4/n\) | Common rule for flagging potentially influential points |
| \(D_i > 1\) | Clearly influential (some textbooks) |
| \(D_i > F_{0.50, p+1, n-p-1}\) | Compares to the median of the \(F\)-distribution |
brfss_dx <- brfss_dx |>
mutate(cooks_d = cooks.distance(mod1))
# Threshold
cooks_threshold <- 4 / n
cat("Cook's distance threshold (4/n):", round(cooks_threshold, 5), "\n")## Cook's distance threshold (4/n): 8e-04
influential_obs <- brfss_dx |>
filter(cooks_d > cooks_threshold)
cat("Number of influential points (D > 4/n):", nrow(influential_obs), "\n")## Number of influential points (D > 4/n): 504
ggplot(brfss_dx, aes(x = seq_len(n), y = cooks_d)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_hline(yintercept = cooks_threshold, color = "red",
linetype = "dashed", linewidth = 0.8) +
labs(title = "Cook's Distance by Observation Index",
subtitle = paste0("Red line: 4/n = ", round(cooks_threshold, 5)),
x = "Observation Index", y = "Cook's Distance") +
theme_minimal()Two additional influence measures complement Cook’s distance:
DFFITS measures the influence of observation \(i\) on its own fitted value:
\[\text{DFFITS}_i = r_{(-i)} \sqrt{\frac{h_i}{1-h_i}}\]
Flagging threshold: \(|\text{DFFITS}_i| > 2\sqrt{(p+1)/n}\)
DFBETAS measures the influence of observation \(i\) on each individual regression coefficient \(\hat{\beta}_j\):
\[\text{DFBETAS}_{j,i} = \frac{\hat{\beta}_j - \hat{\beta}_{j(-i)}}{\hat{\sigma}_{(-i)} \sqrt{(X^TX)^{-1}_{jj}}}\]
Flagging threshold: \(|\text{DFBETAS}_{j,i}| > 2/\sqrt{n}\)
brfss_dx <- brfss_dx |>
mutate(dffits_val = dffits(mod1))
dffits_threshold <- 2 * sqrt((p + 1) / n)
dfbetas_threshold <- 2 / sqrt(n)
cat("DFFITS threshold:", round(dffits_threshold, 4), "\n")## DFFITS threshold: 0.0693
## DFBETAS threshold: 0.0283
# Number exceeding DFFITS threshold
cat("Observations with |DFFITS| > threshold:",
sum(abs(brfss_dx$dffits_val) > dffits_threshold), "\n")## Observations with |DFFITS| > threshold: 504
# DFBETAS for each coefficient
dfb <- dfbetas(mod1)
cat("\nObservations exceeding DFBETAS threshold per coefficient:\n")##
## Observations exceeding DFBETAS threshold per coefficient:
tibble(
Coefficient = colnames(dfb),
n_flagged = map_int(seq_len(ncol(dfb)), \(j) sum(abs(dfb[, j]) > dfbetas_threshold))
) |>
kable(caption = "DFBETAS Flags by Coefficient") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Coefficient | n_flagged |
|---|---|
| (Intercept) | 333 |
| physhlth_days | 397 |
| sleep_hrs | 349 |
| age | 263 |
| sexFemale | 350 |
| bmi | 294 |
This plot combines residuals and leverage, with Cook’s distance contours overlaid. Points in the upper-right or lower-right corners are both outlying and high-leverage, making them potentially influential.
ggplot(brfss_dx, aes(x = leverage, y = jack_resid)) +
geom_point(aes(size = cooks_d), alpha = 0.3, color = "steelblue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
geom_vline(xintercept = 3 * h_bar, color = "red", linetype = "dashed") +
scale_size_continuous(name = "Cook's D", range = c(0.5, 4)) +
labs(title = "Jackknife Residuals vs. Leverage",
subtitle = "Point size proportional to Cook's distance",
x = "Leverage", y = "Jackknife Residual") +
theme_minimal()When you identify outliers or influential observations:
# Compare model with and without high-influence points
influential_idx <- which(brfss_dx$cooks_d > cooks_threshold)
brfss_no_infl <- brfss_dx |> slice(-influential_idx)
mod_no_infl <- lm(menthlth_days ~ physhlth_days + sleep_hrs + age + sex + bmi,
data = brfss_no_infl)
# Side-by-side comparison
bind_rows(
tidy(mod1) |> mutate(Model = "Full data"),
tidy(mod_no_infl) |> mutate(Model = "Without influential obs")
) |>
dplyr::select(Model, term, estimate, std.error, p.value) |>
pivot_wider(
names_from = Model,
values_from = c(estimate, std.error, p.value),
names_glue = "{Model}_{.value}"
) |>
kable(digits = 4, caption = "Coefficient Comparison: Full Data vs. Influential Points Removed") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE) |>
scroll_box(width = "100%")| term | Full data_estimate | Without influential obs_estimate | Full data_std.error | Without influential obs_std.error | Full data_p.value | Without influential obs_p.value |
|---|---|---|---|---|---|---|
| (Intercept) | 9.9123 | 5.3267 | 0.7748 | 0.4542 | 0.0000 | 0.0000 |
| physhlth_days | 0.3162 | 0.2806 | 0.0130 | 0.0109 | 0.0000 | 0.0000 |
| sleep_hrs | -0.6763 | -0.3095 | 0.0751 | 0.0450 | 0.0000 | 0.0000 |
| age | -0.0789 | -0.0567 | 0.0059 | 0.0033 | 0.0000 | 0.0000 |
| sexFemale | 1.5046 | 1.3468 | 0.2025 | 0.1125 | 0.0000 | 0.0000 |
| bmi | 0.0385 | 0.0285 | 0.0163 | 0.0095 | 0.0185 | 0.0026 |
Multicollinearity occurs when two or more predictors in a regression model are moderately or highly correlated. This creates instability in the coefficient estimates: the standard errors inflate, individual predictors may appear non-significant even when the overall model is significant, and estimated coefficients may change dramatically when variables are added or removed.
| Type | Cause | Example |
|---|---|---|
| Structural | Mathematical artifact from creating new variables | Including both \(X\) and \(X^2\); or age and age-squared |
| Data-based | Inherent correlation in the data | Income and education; systolic and diastolic blood pressure |
How do you know multicollinearity is present?
The VIF for predictor \(j\) is:
\[VIF_j = \frac{1}{1 - R_j^2}\]
where \(R_j^2\) is the \(R^2\) obtained from regressing \(X_j\) on all other predictors. The VIF tells you how much the variance of \(\hat{\beta}_j\) is inflated by collinearity relative to what it would be if \(X_j\) were uncorrelated with the other predictors.
An equivalent measure is the tolerance:
\[\text{tolerance}_j = \frac{1}{VIF_j} = 1 - R_j^2\]
Rules of thumb:
| VIF value | Interpretation |
|---|---|
| \(VIF_j = 1\) | No correlation with other predictors |
| \(VIF_j > 4\) | Needs further investigation |
| \(VIF_j > 10\) | Strong collinearity, action likely needed |
vif_values <- car::vif(mod1)
tibble(
Predictor = names(vif_values),
VIF = round(vif_values, 3),
Tolerance = round(1 / vif_values, 3)
) |>
mutate(Flag = case_when(
VIF > 10 ~ "Strong collinearity",
VIF > 4 ~ "Investigate",
TRUE ~ "OK"
)) |>
kable(caption = "Variance Inflation Factors") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Predictor | VIF | Tolerance | Flag |
|---|---|---|---|
| physhlth_days | 1.030 | 0.971 | OK |
| sleep_hrs | 1.028 | 0.973 | OK |
| age | 1.029 | 0.971 | OK |
| sex | 1.004 | 0.996 | OK |
| bmi | 1.020 | 0.980 | OK |
While pairwise correlations do not capture all forms of multicollinearity (a variable can be correlated with a linear combination of other variables without being strongly correlated with any single one), they are a useful starting point.
brfss_dx |>
dplyr::select(physhlth_days, sleep_hrs, age, bmi) |>
cor(use = "complete.obs") |>
round(3) |>
kable(caption = "Pairwise Correlation Matrix (Continuous Predictors)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| physhlth_days | sleep_hrs | age | bmi | |
|---|---|---|---|---|
| physhlth_days | 1.000 | -0.075 | 0.092 | 0.120 |
| sleep_hrs | -0.075 | 1.000 | 0.124 | -0.073 |
| age | 0.092 | 0.124 | 1.000 | 0.011 |
| bmi | 0.120 | -0.073 | 0.011 | 1.000 |
The appropriate remedy depends on the type of multicollinearity:
For structural multicollinearity (e.g., \(X\) and \(X^2\)):
# Example: structural multicollinearity
brfss_dx <- brfss_dx |>
mutate(
bmi_sq = bmi^2,
bmi_c = bmi - mean(bmi),
bmi_c_sq = bmi_c^2
)
# VIF before centering
mod_poly <- lm(menthlth_days ~ bmi + bmi_sq, data = brfss_dx)
cat("VIF before centering:\n")## VIF before centering:
## bmi bmi_sq
## 30.62497 30.62497
# VIF after centering
mod_poly_c <- lm(menthlth_days ~ bmi_c + bmi_c_sq, data = brfss_dx)
cat("\nVIF after centering:\n")##
## VIF after centering:
## bmi_c bmi_c_sq
## 1.447669 1.447669
For data-based multicollinearity:
A systematic approach to regression diagnostics:
| Step | Action | Tools |
|---|---|---|
| 1 | Fit the model and examine coefficients | lm(), tidy(), glance() |
| 2 | Check residuals vs. fitted plot (linearity + homoscedasticity) | Residuals vs. fitted, Breusch-Pagan test |
| 3 | Check Q-Q plot and histogram (normality) | Q-Q plot, histogram, Shapiro-Wilk |
| 4 | If assumptions violated, consider transformations and refit | log(Y), sqrt(Y), Y^2 |
| 5 | Compute VIF for multicollinearity | car::vif() |
| 6 | Compute leverage, Cook’s distance, DFFITS, DFBETAS for influential obs | hatvalues(), cooks.distance(), dffits(), dfbetas() |
| 7 | Compare model with and without influential observations; report both if different | Compare coefficients, CIs |
| Topic | Key point |
|---|---|
| LINE assumptions | Linearity, Independence, Normality, Equal Variance must be checked |
| Residuals | Jackknife residuals are preferred for outlier detection |
| Diagnostic plots | Residuals vs. fitted (linearity + variance); Q-Q plot (normality) |
| Transformations | \(\log(Y)\) for right skew; \(Y^2\) for left skew |
| Outlier testing | Use Bonferroni-corrected jackknife residuals |
| Leverage | Measures extremeness in predictor space; \(h_i > 2(p+1)/n\) or \(3(p+1)/n\) |
| Cook’s distance | Combines residual and leverage; \(D_i > 4/n\) flags potential influence |
| DFFITS/DFBETAS | Measure influence on fitted values and individual coefficients |
| Do not auto-delete | Always investigate; report models with and without flagged observations |
| Multicollinearity | VIF > 10 is strong collinearity; center variables for structural collinearity |
EPI 553 — Regression Diagnostics Lab Due: End of class, March 24, 2026
In this lab, you will conduct a thorough diagnostic assessment of a regression model using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. The
outcome for the lab is menthlth_days and
you will build a model with a different set of
predictors than the guided practice.
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
bmi |
Body mass index | Continuous |
exercise |
Any physical activity (Yes/No) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
brfss_dx <- readRDS(
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Diagnostics/brfss_dx_2020.rds"
)1a. (5 pts) Fit the model:
menthlth_days ~ physhlth_days + sleep_hrs + age + sex + exercise + income_cat.
Report the model coefficients, \(R^2\),
and Adjusted \(R^2\).
1b. (5 pts) Create a residuals vs. fitted values plot using jackknife residuals. Include a loess smoothing line. Does the linearity assumption appear to hold? What about homoscedasticity?
1c. (5 pts) Create a normal Q-Q plot of the jackknife residuals. Does the normality assumption hold? Describe the specific pattern you observe (e.g., right skew, heavy tails, etc.).
1d. (5 pts) Run the Breusch-Pagan test for
heteroscedasticity using car::ncvTest(). Report the test
statistic and p-value. Given the large sample size, how should you weigh
the formal test result against the visual evidence from your residual
plot?
2a. (5 pts) Based on your diagnostic plots in Task 1, apply an appropriate transformation to the outcome variable (e.g., \(\log(Y+1)\) or \(\sqrt{Y}\)). Refit the model and create a new Q-Q plot. Did the transformation improve normality?
2b. (5 pts) Create residuals vs. fitted plots for both the original and transformed models side by side. Compare the patterns.
2c. (5 pts) In 2–3 sentences, explain how the transformation changes the interpretation of the regression coefficients. Is the trade-off worth it?
3a. (5 pts) Compute jackknife residuals for the original (untransformed) model. What are the minimum and maximum jackknife residuals? How many observations have \(|r_{(-i)}| > 3\)?
3b. (10 pts) Perform a formal Bonferroni-corrected
outlier test. Calculate the critical value manually using
qt() and then verify using car::outlierTest().
Are any observations statistically significant outliers?
3c. (5 pts) For the top 5 observations with the largest absolute jackknife residuals, examine their predictor values. Is there anything unusual about these observations? Create a table showing their values.
3d. (5 pts) In 2–3 sentences, explain why it is important to use the Bonferroni correction when testing multiple residuals for outlier status.
4a. (5 pts) Compute the leverage (\(h_i\)) for all observations. Report the average leverage. How many observations exceed the \(3\bar{h}\) threshold?
4b. (5 pts) Compute Cook’s distance for all observations. Using the \(4/n\) threshold, how many observations are potentially influential?
4c. (5 pts) Create a scatterplot of jackknife residuals vs. leverage with point size proportional to Cook’s distance. Identify any observations that are both high-leverage and have large residuals.
4d. (5 pts) Refit the model after removing all observations with Cook’s distance \(> 4/n\). Compare the coefficients with the full-data model in a single table. Do any coefficients change substantially?
4e. (5 pts) A colleague argues: “We should just remove all outliers and high-leverage points to get a cleaner model.” Write a 3–4 sentence response explaining why this approach is problematic.
5a. (5 pts) Compute the VIF for all predictors in your model. Present the results in a table with VIF, tolerance, and a flag column. Is there evidence of multicollinearity?
5b. (5 pts) Create a pairwise correlation matrix for the continuous predictors. Which pair has the strongest correlation?
5c. (5 pts) Suppose you wanted to add a quadratic
term for BMI to the model (\(\text{bmi}^2\)). Compute the VIF for
bmi and bmi_sq in this expanded model. Then
center BMI and recompute. Show that centering reduces the VIF. Why does
centering help?
End of Lab Activity