Introduction

After fitting a regression model, we must ask: does the model actually fit the data well, and are the underlying assumptions satisfied? A model that violates key assumptions may produce misleading coefficients, invalid standard errors, or unreliable predictions.

Regression diagnostics address three fundamental questions:

  1. Are the regression assumptions satisfied? (Linearity, Independence, Normality, Equal Variance)
  2. Is there multicollinearity among predictors? (Inflated standard errors, unstable coefficients)
  3. Are there influential observations? (Outliers, high-leverage points that distort the fitted model)

This lecture provides the tools to diagnose each of these issues and, where possible, apply remedial measures. Diagnostics should be a routine part of every regression analysis, not an afterthought.


Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue with the BRFSS 2020 dataset. For this lecture, we predict mentally unhealthy days from several health-related variables.

brfss_full <- read_xpt(
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2024/Papers/BRFSS Project/Final Analysis/BRFSS Data 2013 - 2023/LLCP2020.XPT"
) |>
  clean_names()
brfss_dx <- brfss_full |>
  mutate(
    # Outcome
    menthlth_days = case_when(
      menthlth == 88                  ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                           ~ NA_real_
    ),
    # Predictors
    physhlth_days = case_when(
      physhlth == 88                  ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                           ~ NA_real_
    ),
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                           ~ NA_real_
    ),
    age = age80,
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    )
  ) |>
  filter(
    !is.na(menthlth_days), !is.na(physhlth_days), !is.na(sleep_hrs),
    !is.na(age), age >= 18, !is.na(sex), !is.na(bmi),
    !is.na(exercise), !is.na(income_cat)
  )

set.seed(1220)
brfss_dx <- brfss_dx |>
  dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
                bmi, exercise, income_cat) |>
  slice_sample(n = 5000)

# Save for lab
saveRDS(brfss_dx,
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Diagnostics/brfss_dx_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_dx), ncol(brfss_dx))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Analytic Dataset Dimensions
Metric Value
Observations 5000
Variables 8

Quick Descriptive Overview

brfss_dx |>
  tbl_summary(
    type = list(
      c(menthlth_days, physhlth_days, sleep_hrs, age, bmi, income_cat) ~ "continuous"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd}), [{min}, {max}]"
    )
  ) |>
  bold_labels()
Characteristic N = 5,0001
menthlth_days 4 (8), [0, 30]
physhlth_days 3 (8), [0, 30]
sleep_hrs 7.07 (1.36), [1.00, 14.00]
IMPUTED AGE VALUE COLLAPSED ABOVE 80 54 (17), [18, 80]
sex
    Male 2,454 (49%)
    Female 2,546 (51%)
bmi 28.3 (6.2), [13.2, 71.1]
exercise 3,965 (79%)
income_cat 6.17 (2.02), [1.00, 8.00]
1 Mean (SD), [Min, Max]; n (%)

Part 1: Guided Practice — Regression Diagnostics


1. Regression Assumptions (LINE)

Before trusting any regression output, we need to verify that the underlying assumptions are met. The acronym LINE helps us remember the four key assumptions:

Assumption Description Diagnostic tool
Linearity The relationship between predictors and outcome is linear Residuals vs. fitted plot
Independence Residual errors are independent of one another Study design, Durbin-Watson test
Normality Residual errors follow a normal distribution Q-Q plot, histogram, Shapiro-Wilk test
Equal variance Variance of residuals is constant across fitted values (homoscedasticity) Residuals vs. fitted plot, Scale-Location plot, Breusch-Pagan test

Violations of these assumptions do not necessarily invalidate the analysis, but they may affect the reliability of confidence intervals, p-values, and prediction intervals. The central limit theorem provides some protection when sample sizes are large, particularly for the normality assumption.

1.1 Fitting the Working Model

Let us fit a multiple regression model predicting mentally unhealthy days from physical health days, sleep, age, sex, and BMI.

mod1 <- lm(menthlth_days ~ physhlth_days + sleep_hrs + age + sex + bmi,
           data = brfss_dx)

tidy(mod1, conf.int = TRUE) |>
  kable(digits = 3, caption = "Working Model Coefficients") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Working Model Coefficients
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 9.912 0.775 12.793 0.000 8.393 11.431
physhlth_days 0.316 0.013 24.325 0.000 0.291 0.342
sleep_hrs -0.676 0.075 -9.010 0.000 -0.823 -0.529
age -0.079 0.006 -13.306 0.000 -0.090 -0.067
sexFemale 1.505 0.202 7.430 0.000 1.108 1.902
bmi 0.039 0.016 2.357 0.018 0.006 0.071
glance(mod1) |>
  dplyr::select(r.squared, adj.r.squared, AIC, BIC, sigma) |>
  kable(digits = 3, caption = "Model Fit Summary") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Model Fit Summary
r.squared adj.r.squared AIC BIC sigma
0.16 0.159 33857.77 33903.39 7.142

2. Residuals: Types and Interpretation

Residuals are the difference between observed and fitted values. They are our primary tool for checking assumptions. However, not all residuals are created equal.

2.1 Raw Residuals

The simplest residual is the raw (ordinary) residual:

\[e_i = Y_i - \hat{Y}_i\]

Raw residuals are on the scale of the outcome variable. Their main limitation is that their variance is not constant across observations, because the variance of \(e_i\) depends on the leverage \(h_i\):

\[\text{Var}(e_i) = \sigma^2(1 - h_i)\]

Observations with higher leverage have smaller residual variance, making it harder to detect unusual observations using raw residuals alone.

# Extract raw residuals
brfss_dx <- brfss_dx |>
  mutate(
    raw_resid = residuals(mod1),
    fitted_val = fitted(mod1)
  )

# Summary of raw residuals
summary(brfss_dx$raw_resid)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -17.6288  -3.4591  -1.7307   0.0000   0.4225  30.8948

2.2 Standardized Residuals

Standardized residuals divide each raw residual by the estimated standard deviation of the residuals (\(\hat{\sigma}\)), also known as the root mean squared error (RMSE):

\[r_i^{std} = \frac{e_i}{\hat{\sigma}}\]

where \(\hat{\sigma} = \sqrt{MSE}\). While this puts residuals on a common scale, it does not account for the fact that different observations have different residual variances (depending on their leverage).

brfss_dx <- brfss_dx |>
  mutate(std_resid = rstandard(mod1))

# Observations with |standardized residual| > 3
brfss_dx |>
  filter(abs(std_resid) > 3) |>
  dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, std_resid) |>
  head(10) |>
  kable(digits = 3, caption = "Observations with Large Standardized Residuals") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Observations with Large Standardized Residuals
menthlth_days physhlth_days sleep_hrs age std_resid
30 0 6 64 3.958
30 0 7 69 4.057
30 0 5 25 3.169
30 0 6 52 3.812
30 10 7 43 3.140
30 0 8 80 4.328
30 5 10 21 3.332
30 0 8 65 3.921
30 0 6 25 3.526
30 0 5 29 3.216

2.3 Studentized (Internally Studentized) Residuals

Studentized residuals account for the varying variance of each residual by dividing by the observation-specific standard error:

\[r_i = \frac{e_i}{\hat{\sigma}\sqrt{1 - h_i}}\]

This correction means that studentized residuals have approximately unit variance under the model assumptions. In R, rstandard() computes these internally studentized residuals.

2.4 Jackknife (Externally Studentized) Residuals

Jackknife residuals (also called externally studentized or deleted residuals) take the correction one step further. For observation \(i\), the residual is computed using a model fitted without observation \(i\):

\[r_{(-i)} = \frac{e_i}{\hat{\sigma}_{(-i)}\sqrt{1 - h_i}}\]

where \(\hat{\sigma}_{(-i)}\) is the RMSE estimated from the model that excludes observation \(i\). Under the null hypothesis that observation \(i\) is not an outlier, \(r_{(-i)} \sim t_{n-p-2}\).

The jackknife residual is the most powerful residual for detecting outliers because it prevents the potentially outlying observation from influencing the estimate of \(\hat{\sigma}\).

brfss_dx <- brfss_dx |>
  mutate(jack_resid = rstudent(mod1))

# Compare residual types
brfss_dx |>
  dplyr::select(raw_resid, std_resid, jack_resid) |>
  pivot_longer(everything(), names_to = "type", values_to = "value") |>
  ggplot(aes(x = value)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "white", alpha = 0.8) +
  facet_wrap(~type, scales = "free_x",
             labeller = labeller(type = c(
               raw_resid = "Raw Residuals",
               std_resid = "Standardized Residuals",
               jack_resid = "Jackknife Residuals"
             ))) +
  labs(title = "Distribution of Three Residual Types",
       x = "Residual Value", y = "Frequency") +
  theme_minimal()

2.5 Summary of Residual Types

Residual Formula Distribution under \(H_0\) Use
Raw (\(e_i\)) \(Y_i - \hat{Y}_i\) Unequal variance Quick check
Standardized \(e_i / \hat{\sigma}\) Approximately standard normal Rough screening
Studentized \(e_i / (\hat{\sigma}\sqrt{1-h_i})\) Approximately standard normal Better screening
Jackknife (\(r_{(-i)}\)) \(e_i / (\hat{\sigma}_{(-i)}\sqrt{1-h_i})\) \(t_{n-p-2}\) Formal outlier testing

3. Diagnostic Plots

3.1 Residuals vs. Fitted Values

This is the single most important diagnostic plot. It checks linearity and homoscedasticity simultaneously.

  • Linearity: If the model is correct, the residuals should scatter randomly around zero with no systematic pattern (no curve, no trend).
  • Equal variance: The vertical spread of the residuals should remain roughly constant as fitted values increase. A “funnel” shape (spread increasing or decreasing) indicates heteroscedasticity.
ggplot(brfss_dx, aes(x = fitted_val, y = jack_resid)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_smooth(method = "loess", se = FALSE, color = "darkorange", linewidth = 1) +
  labs(title = "Jackknife Residuals vs. Fitted Values",
       x = "Fitted Values", y = "Jackknife Residuals") +
  theme_minimal()

Interpretation guide:

  • A flat loess line near zero suggests linearity holds.
  • A curved loess line suggests non-linearity (consider polynomial terms or transformations).
  • A funnel shape suggests heteroscedasticity (consider a variance-stabilizing transformation such as log or square root).

3.2 Normal Q-Q Plot

The Q-Q plot compares the distribution of the residuals to the theoretical normal distribution. If the normality assumption holds, points should fall approximately along the diagonal reference line.

ggplot(brfss_dx, aes(sample = jack_resid)) +
  stat_qq(alpha = 0.3, color = "steelblue") +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(title = "Normal Q-Q Plot of Jackknife Residuals",
       x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal()

Common Q-Q patterns and what they suggest:

Pattern Interpretation Potential remedy
Points follow the line Normality satisfied None needed
Right skew (upper tail curves upward) Right-skewed residuals Log or square root transformation of \(Y\)
Left skew (lower tail curves downward) Left-skewed residuals Square transformation of \(Y\)
Heavy tails (both ends deviate) Heavy-tailed distribution Robust regression or transformation
S-shape Both skew and kurtosis issues Consider a different model family

3.3 Histogram of Residuals

A histogram provides a complementary view to the Q-Q plot.

ggplot(brfss_dx, aes(x = jack_resid)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50,
                 fill = "steelblue", color = "white", alpha = 0.8) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1),
                color = "red", linewidth = 1) +
  labs(title = "Histogram of Jackknife Residuals with Normal Curve",
       x = "Jackknife Residual", y = "Density") +
  theme_minimal()

3.4 Scale-Location Plot

The Scale-Location plot shows \(\sqrt{|\text{standardized residuals}|}\) vs. fitted values. It is specifically designed to assess homoscedasticity. A flat trend line indicates constant variance.

brfss_dx <- brfss_dx |>
  mutate(sqrt_abs_std = sqrt(abs(std_resid)))

ggplot(brfss_dx, aes(x = fitted_val, y = sqrt_abs_std)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "loess", se = FALSE, color = "darkorange", linewidth = 1) +
  labs(title = "Scale-Location Plot",
       x = "Fitted Values",
       y = expression(sqrt("|Standardized Residuals|"))) +
  theme_minimal()

3.5 Base R Diagnostic Plots

R provides four built-in diagnostic plots through plot() on an lm object. These are quick but less customizable.

par(mfrow = c(2, 2))
plot(mod1)

par(mfrow = c(1, 1))

The four panels are:

  1. Residuals vs. Fitted (linearity + homoscedasticity)
  2. Normal Q-Q (normality)
  3. Scale-Location (homoscedasticity)
  4. Residuals vs. Leverage (influential observations via Cook’s distance contours)

3.6 Formal Tests for Assumptions

While visual diagnostics are essential, formal tests can supplement them, especially for borderline cases.

# Breusch-Pagan test for heteroscedasticity
bp_test <- car::ncvTest(mod1)
tibble(
  Test = "Breusch-Pagan",
  Statistic = bp_test$ChiSquare,
  df = bp_test$Df,
  p_value = bp_test$p
) |>
  kable(digits = 4, caption = "Breusch-Pagan Test for Heteroscedasticity") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Breusch-Pagan Test for Heteroscedasticity
Test Statistic df p_value
Breusch-Pagan 1157.809 1 0

Note: With very large samples (n = 5000), formal tests are often overpowered and detect trivial departures from assumptions. In large samples, the visual diagnostic is typically more informative than the p-value. If the Q-Q plot looks reasonable and the residual plot has no alarming pattern, you may proceed even if a formal test rejects.


4. Remedial Measures: Transformations

When assumptions are violated, transforming the outcome variable is a common remedy.

Problem Transformation R code
Right-skewed residuals \(\log(Y)\) lm(log(y) ~ x)
Right-skewed residuals (mild) \(\sqrt{Y}\) lm(sqrt(y) ~ x)
Left-skewed residuals \(Y^2\) lm(I(y^2) ~ x)
Non-constant variance \(\log(Y)\) or \(\sqrt{Y}\) Depends on pattern

4.1 Trying a Transformation

If the Q-Q plot suggests right skew, we can try a log transformation. Since our outcome (menthlth_days) includes zeros, we use \(\log(Y + 1)\).

mod_log <- lm(log(menthlth_days + 1) ~ physhlth_days + sleep_hrs + age + sex + bmi,
              data = brfss_dx)

# Compare Q-Q plots side by side
par(mfrow = c(1, 2))

qqnorm(rstudent(mod1), main = "Q-Q: Original Scale")
qqline(rstudent(mod1), col = "red")

qqnorm(rstudent(mod_log), main = "Q-Q: Log(Y+1) Scale")
qqline(rstudent(mod_log), col = "red")

par(mfrow = c(1, 1))

Important: Transforming the outcome changes the interpretation of the coefficients. With \(\log(Y+1)\), a one-unit increase in \(X\) is associated with a multiplicative change in \(Y+1\), not an additive change.


5. Outliers and Influential Observations

An outlier is an observation with a residual that is much larger than the rest. An influential observation is one whose removal substantially changes the fitted model (coefficients, predictions, or both).

Not all outliers are influential, and not all influential points are outliers. Influence depends on two components:

  • Residual size (the distance of \(Y_i\) from the regression surface)
  • Leverage (the distance of \(X_i\) from the center of the predictor space)

5.1 Detecting Outliers via Jackknife Residuals

We test whether any jackknife residual is statistically significant using a Bonferroni correction to account for multiple testing.

For \(n\) observations and \(p\) predictors:

  • \(H_0\): observation \(i\) is not an outlier
  • Under \(H_0\): \(r_{(-i)} \sim t_{n-p-2}\)
  • Bonferroni-corrected significance level: \(\alpha / n\)
  • Observation \(i\) is an outlier if \(|r_{(-i)}| > t_{n-p-2,\; 1-\alpha/(2n)}\)
n <- nrow(brfss_dx)
p <- length(coef(mod1)) - 1  # number of predictors (excluding intercept)
alpha <- 0.05

# Critical value with Bonferroni correction
crit_val <- qt(1 - alpha / (2 * n), df = n - p - 2)

cat("Sample size:", n, "\n")
## Sample size: 5000
cat("Number of predictors:", p, "\n")
## Number of predictors: 5
cat("Bonferroni critical value:", round(crit_val, 3), "\n\n")
## Bonferroni critical value: 4.422
# Identify outliers
outlier_idx <- which(abs(brfss_dx$jack_resid) > crit_val)
cat("Number of jackknife residual outliers:", length(outlier_idx), "\n")
## Number of jackknife residual outliers: 0
if (length(outlier_idx) > 0) {
  brfss_dx |>
    slice(outlier_idx) |>
    dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, jack_resid) |>
    kable(digits = 3, caption = "Outliers by Jackknife Residual (Bonferroni-corrected)") |>
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
}

The car::outlierTest() function automates this Bonferroni-corrected test:

car::outlierTest(mod1)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferroni p
## 190  4.33569         1.4816e-05     0.074082

5.2 Leverage

Leverage (\(h_i\)) measures how far an observation’s predictor values are from the mean of all predictors. High-leverage observations have unusual predictor values and exert disproportionate influence on the fitted line.

The leverage for observation \(i\) is:

\[h_i = \frac{1}{n} + \sum_{j=1}^{p} \frac{(X_{ij} - \bar{X}_j)^2}{(n-1)s_j^2}\]

Key properties of leverage:

  • \(0 \leq h_i \leq 1\)
  • \(\sum_{i=1}^{n} h_i = p + 1\)
  • Average leverage: \(\bar{h} = (p+1)/n\)

Rules of thumb for high leverage:

Rule Threshold Reference
\(h_i > 2\bar{h}\) \(2(p+1)/n\) Hoaglin and Welsch (1978)
\(h_i > 3\bar{h}\) \(3(p+1)/n\) Velleman and Welsch (1981)
\(h_i > 0.5\) 0.5 Huber (1981) — clearly large
\(h_i > 0.2\) 0.2 Huber (1981) — worth examining
brfss_dx <- brfss_dx |>
  mutate(leverage = hatvalues(mod1))

h_bar <- (p + 1) / n

cat("Average leverage:", round(h_bar, 5), "\n")
## Average leverage: 0.0012
cat("2x average leverage threshold:", round(2 * h_bar, 5), "\n")
## 2x average leverage threshold: 0.0024
cat("3x average leverage threshold:", round(3 * h_bar, 5), "\n\n")
## 3x average leverage threshold: 0.0036
# High leverage points (using 3x threshold)
high_lev <- brfss_dx |>
  filter(leverage > 3 * h_bar)

cat("Number of high-leverage points (> 3x average):", nrow(high_lev), "\n")
## Number of high-leverage points (> 3x average): 157
ggplot(brfss_dx, aes(x = seq_len(n), y = leverage)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_hline(yintercept = 2 * h_bar, color = "orange", linetype = "dashed",
             linewidth = 0.8) +
  geom_hline(yintercept = 3 * h_bar, color = "red", linetype = "dashed",
             linewidth = 0.8) +
  annotate("text", x = n * 0.9, y = 2 * h_bar + 0.0005, label = "2x average",
           color = "orange") +
  annotate("text", x = n * 0.9, y = 3 * h_bar + 0.0005, label = "3x average",
           color = "red") +
  labs(title = "Leverage Values by Observation Index",
       x = "Observation Index", y = "Leverage (h_i)") +
  theme_minimal()

5.3 Cook’s Distance

Cook’s distance combines residual size and leverage into a single measure of overall influence. It quantifies how much all fitted values change when observation \(i\) is removed:

\[D_i = \frac{r_i^2}{p+1} \cdot \frac{h_i}{1 - h_i}\]

where \(r_i\) is the internally studentized residual. A large Cook’s distance indicates that the observation substantially influences the regression surface.

Rules of thumb for Cook’s distance:

Threshold Interpretation
\(D_i > 4/n\) Common rule for flagging potentially influential points
\(D_i > 1\) Clearly influential (some textbooks)
\(D_i > F_{0.50, p+1, n-p-1}\) Compares to the median of the \(F\)-distribution
brfss_dx <- brfss_dx |>
  mutate(cooks_d = cooks.distance(mod1))

# Threshold
cooks_threshold <- 4 / n

cat("Cook's distance threshold (4/n):", round(cooks_threshold, 5), "\n")
## Cook's distance threshold (4/n): 8e-04
influential_obs <- brfss_dx |>
  filter(cooks_d > cooks_threshold)

cat("Number of influential points (D > 4/n):", nrow(influential_obs), "\n")
## Number of influential points (D > 4/n): 504
ggplot(brfss_dx, aes(x = seq_len(n), y = cooks_d)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_hline(yintercept = cooks_threshold, color = "red",
             linetype = "dashed", linewidth = 0.8) +
  labs(title = "Cook's Distance by Observation Index",
       subtitle = paste0("Red line: 4/n = ", round(cooks_threshold, 5)),
       x = "Observation Index", y = "Cook's Distance") +
  theme_minimal()

5.4 DFFITS and DFBETAS

Two additional influence measures complement Cook’s distance:

DFFITS measures the influence of observation \(i\) on its own fitted value:

\[\text{DFFITS}_i = r_{(-i)} \sqrt{\frac{h_i}{1-h_i}}\]

Flagging threshold: \(|\text{DFFITS}_i| > 2\sqrt{(p+1)/n}\)

DFBETAS measures the influence of observation \(i\) on each individual regression coefficient \(\hat{\beta}_j\):

\[\text{DFBETAS}_{j,i} = \frac{\hat{\beta}_j - \hat{\beta}_{j(-i)}}{\hat{\sigma}_{(-i)} \sqrt{(X^TX)^{-1}_{jj}}}\]

Flagging threshold: \(|\text{DFBETAS}_{j,i}| > 2/\sqrt{n}\)

brfss_dx <- brfss_dx |>
  mutate(dffits_val = dffits(mod1))

dffits_threshold <- 2 * sqrt((p + 1) / n)
dfbetas_threshold <- 2 / sqrt(n)

cat("DFFITS threshold:", round(dffits_threshold, 4), "\n")
## DFFITS threshold: 0.0693
cat("DFBETAS threshold:", round(dfbetas_threshold, 4), "\n\n")
## DFBETAS threshold: 0.0283
# Number exceeding DFFITS threshold
cat("Observations with |DFFITS| > threshold:",
    sum(abs(brfss_dx$dffits_val) > dffits_threshold), "\n")
## Observations with |DFFITS| > threshold: 504
# DFBETAS for each coefficient
dfb <- dfbetas(mod1)
cat("\nObservations exceeding DFBETAS threshold per coefficient:\n")
## 
## Observations exceeding DFBETAS threshold per coefficient:
tibble(
  Coefficient = colnames(dfb),
  n_flagged = map_int(seq_len(ncol(dfb)), \(j) sum(abs(dfb[, j]) > dfbetas_threshold))
) |>
  kable(caption = "DFBETAS Flags by Coefficient") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
DFBETAS Flags by Coefficient
Coefficient n_flagged
(Intercept) 333
physhlth_days 397
sleep_hrs 349
age 263
sexFemale 350
bmi 294

5.5 Residuals vs. Leverage Plot

This plot combines residuals and leverage, with Cook’s distance contours overlaid. Points in the upper-right or lower-right corners are both outlying and high-leverage, making them potentially influential.

ggplot(brfss_dx, aes(x = leverage, y = jack_resid)) +
  geom_point(aes(size = cooks_d), alpha = 0.3, color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 3 * h_bar, color = "red", linetype = "dashed") +
  scale_size_continuous(name = "Cook's D", range = c(0.5, 4)) +
  labs(title = "Jackknife Residuals vs. Leverage",
       subtitle = "Point size proportional to Cook's distance",
       x = "Leverage", y = "Jackknife Residual") +
  theme_minimal()

5.6 What To Do with Outliers and Influential Points

When you identify outliers or influential observations:

  1. Investigate the observations. Are they data entry errors? Impossible values? Special subpopulations?
  2. Fit the model with and without the flagged observations.
  3. Compare the results. If the coefficients change substantially, report both sets of results.
  4. Do NOT automatically delete observations just because they have large residuals or leverage. Only remove observations if there is a scientific or data-quality justification.
# Compare model with and without high-influence points
influential_idx <- which(brfss_dx$cooks_d > cooks_threshold)
brfss_no_infl <- brfss_dx |> slice(-influential_idx)

mod_no_infl <- lm(menthlth_days ~ physhlth_days + sleep_hrs + age + sex + bmi,
                  data = brfss_no_infl)

# Side-by-side comparison
bind_rows(
  tidy(mod1) |> mutate(Model = "Full data"),
  tidy(mod_no_infl) |> mutate(Model = "Without influential obs")
) |>
  dplyr::select(Model, term, estimate, std.error, p.value) |>
  pivot_wider(
    names_from = Model,
    values_from = c(estimate, std.error, p.value),
    names_glue = "{Model}_{.value}"
  ) |>
  kable(digits = 4, caption = "Coefficient Comparison: Full Data vs. Influential Points Removed") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE) |>
  scroll_box(width = "100%")
Coefficient Comparison: Full Data vs. Influential Points Removed
term Full data_estimate Without influential obs_estimate Full data_std.error Without influential obs_std.error Full data_p.value Without influential obs_p.value
(Intercept) 9.9123 5.3267 0.7748 0.4542 0.0000 0.0000
physhlth_days 0.3162 0.2806 0.0130 0.0109 0.0000 0.0000
sleep_hrs -0.6763 -0.3095 0.0751 0.0450 0.0000 0.0000
age -0.0789 -0.0567 0.0059 0.0033 0.0000 0.0000
sexFemale 1.5046 1.3468 0.2025 0.1125 0.0000 0.0000
bmi 0.0385 0.0285 0.0163 0.0095 0.0185 0.0026

6. Multicollinearity

Multicollinearity occurs when two or more predictors in a regression model are moderately or highly correlated. This creates instability in the coefficient estimates: the standard errors inflate, individual predictors may appear non-significant even when the overall model is significant, and estimated coefficients may change dramatically when variables are added or removed.

6.1 Types of Multicollinearity

Type Cause Example
Structural Mathematical artifact from creating new variables Including both \(X\) and \(X^2\); or age and age-squared
Data-based Inherent correlation in the data Income and education; systolic and diastolic blood pressure

6.2 Symptoms of Multicollinearity

How do you know multicollinearity is present?

  • Estimated \(\beta\) values change wildly when predictors are added or removed
  • Individual predictor t-tests are non-significant, but the overall F-test is significant
  • Pairwise correlations among predictors are large (though this misses multicollinearity involving linear combinations of several variables)
  • Variance Inflation Factors (VIF) are large

6.3 Variance Inflation Factor (VIF)

The VIF for predictor \(j\) is:

\[VIF_j = \frac{1}{1 - R_j^2}\]

where \(R_j^2\) is the \(R^2\) obtained from regressing \(X_j\) on all other predictors. The VIF tells you how much the variance of \(\hat{\beta}_j\) is inflated by collinearity relative to what it would be if \(X_j\) were uncorrelated with the other predictors.

An equivalent measure is the tolerance:

\[\text{tolerance}_j = \frac{1}{VIF_j} = 1 - R_j^2\]

Rules of thumb:

VIF value Interpretation
\(VIF_j = 1\) No correlation with other predictors
\(VIF_j > 4\) Needs further investigation
\(VIF_j > 10\) Strong collinearity, action likely needed
vif_values <- car::vif(mod1)

tibble(
  Predictor = names(vif_values),
  VIF = round(vif_values, 3),
  Tolerance = round(1 / vif_values, 3)
) |>
  mutate(Flag = case_when(
    VIF > 10 ~ "Strong collinearity",
    VIF > 4  ~ "Investigate",
    TRUE     ~ "OK"
  )) |>
  kable(caption = "Variance Inflation Factors") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Variance Inflation Factors
Predictor VIF Tolerance Flag
physhlth_days 1.030 0.971 OK
sleep_hrs 1.028 0.973 OK
age 1.029 0.971 OK
sex 1.004 0.996 OK
bmi 1.020 0.980 OK

6.4 Correlation Matrix for Context

While pairwise correlations do not capture all forms of multicollinearity (a variable can be correlated with a linear combination of other variables without being strongly correlated with any single one), they are a useful starting point.

brfss_dx |>
  dplyr::select(physhlth_days, sleep_hrs, age, bmi) |>
  cor(use = "complete.obs") |>
  round(3) |>
  kable(caption = "Pairwise Correlation Matrix (Continuous Predictors)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Pairwise Correlation Matrix (Continuous Predictors)
physhlth_days sleep_hrs age bmi
physhlth_days 1.000 -0.075 0.092 0.120
sleep_hrs -0.075 1.000 0.124 -0.073
age 0.092 0.124 1.000 0.011
bmi 0.120 -0.073 0.011 1.000

6.5 Remedies for Multicollinearity

The appropriate remedy depends on the type of multicollinearity:

For structural multicollinearity (e.g., \(X\) and \(X^2\)):

  • Center the variable by subtracting its mean before squaring. This often eliminates or greatly reduces the collinearity between \(X\) and \(X^2\).
  • Note: centering changes the interpretation of the intercept but not the slope of \(X\).
# Example: structural multicollinearity
brfss_dx <- brfss_dx |>
  mutate(
    bmi_sq = bmi^2,
    bmi_c = bmi - mean(bmi),
    bmi_c_sq = bmi_c^2
  )

# VIF before centering
mod_poly <- lm(menthlth_days ~ bmi + bmi_sq, data = brfss_dx)
cat("VIF before centering:\n")
## VIF before centering:
car::vif(mod_poly)
##      bmi   bmi_sq 
## 30.62497 30.62497
# VIF after centering
mod_poly_c <- lm(menthlth_days ~ bmi_c + bmi_c_sq, data = brfss_dx)
cat("\nVIF after centering:\n")
## 
## VIF after centering:
car::vif(mod_poly_c)
##    bmi_c bmi_c_sq 
## 1.447669 1.447669

For data-based multicollinearity:

  • Drop one of the correlated predictors (use subject-matter knowledge to decide which is more scientifically relevant)
  • Combine correlated variables into a composite (e.g., weight and height into BMI)
  • Use regularization methods (ridge regression, LASSO) in advanced settings
  • Be cautious about dropping variables solely to reduce standard errors, as this can introduce omitted variable bias (p-hacking)

7. Putting It All Together: A Diagnostic Workflow

A systematic approach to regression diagnostics:

Diagnostic Workflow Summary
Step Action Tools
1 Fit the model and examine coefficients lm(), tidy(), glance()
2 Check residuals vs. fitted plot (linearity + homoscedasticity) Residuals vs. fitted, Breusch-Pagan test
3 Check Q-Q plot and histogram (normality) Q-Q plot, histogram, Shapiro-Wilk
4 If assumptions violated, consider transformations and refit log(Y), sqrt(Y), Y^2
5 Compute VIF for multicollinearity car::vif()
6 Compute leverage, Cook’s distance, DFFITS, DFBETAS for influential obs hatvalues(), cooks.distance(), dffits(), dfbetas()
7 Compare model with and without influential observations; report both if different Compare coefficients, CIs

Key Takeaways

Topic Key point
LINE assumptions Linearity, Independence, Normality, Equal Variance must be checked
Residuals Jackknife residuals are preferred for outlier detection
Diagnostic plots Residuals vs. fitted (linearity + variance); Q-Q plot (normality)
Transformations \(\log(Y)\) for right skew; \(Y^2\) for left skew
Outlier testing Use Bonferroni-corrected jackknife residuals
Leverage Measures extremeness in predictor space; \(h_i > 2(p+1)/n\) or \(3(p+1)/n\)
Cook’s distance Combines residual and leverage; \(D_i > 4/n\) flags potential influence
DFFITS/DFBETAS Measure influence on fitted values and individual coefficients
Do not auto-delete Always investigate; report models with and without flagged observations
Multicollinearity VIF > 10 is strong collinearity; center variables for structural collinearity


Part 2: In-Class Lab Activity

EPI 553 — Regression Diagnostics Lab Due: End of class, March 24, 2026


Instructions

In this lab, you will conduct a thorough diagnostic assessment of a regression model using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture. The outcome for the lab is menthlth_days and you will build a model with a different set of predictors than the guided practice.

Variable Description Type
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
age Age in years (capped at 80) Continuous
sex Sex (Male/Female) Factor
bmi Body mass index Continuous
exercise Any physical activity (Yes/No) Factor
income_cat Household income (1–8 ordinal) Numeric
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)

brfss_dx <- readRDS(
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Diagnostics/brfss_dx_2020.rds"
)

Task 1: Fit a Model and Check Assumptions (20 points)

1a. (5 pts) Fit the model: menthlth_days ~ physhlth_days + sleep_hrs + age + sex + exercise + income_cat. Report the model coefficients, \(R^2\), and Adjusted \(R^2\).

1b. (5 pts) Create a residuals vs. fitted values plot using jackknife residuals. Include a loess smoothing line. Does the linearity assumption appear to hold? What about homoscedasticity?

1c. (5 pts) Create a normal Q-Q plot of the jackknife residuals. Does the normality assumption hold? Describe the specific pattern you observe (e.g., right skew, heavy tails, etc.).

1d. (5 pts) Run the Breusch-Pagan test for heteroscedasticity using car::ncvTest(). Report the test statistic and p-value. Given the large sample size, how should you weigh the formal test result against the visual evidence from your residual plot?


Task 2: Transformations (15 points)

2a. (5 pts) Based on your diagnostic plots in Task 1, apply an appropriate transformation to the outcome variable (e.g., \(\log(Y+1)\) or \(\sqrt{Y}\)). Refit the model and create a new Q-Q plot. Did the transformation improve normality?

2b. (5 pts) Create residuals vs. fitted plots for both the original and transformed models side by side. Compare the patterns.

2c. (5 pts) In 2–3 sentences, explain how the transformation changes the interpretation of the regression coefficients. Is the trade-off worth it?


Task 3: Outlier Detection (25 points)

3a. (5 pts) Compute jackknife residuals for the original (untransformed) model. What are the minimum and maximum jackknife residuals? How many observations have \(|r_{(-i)}| > 3\)?

3b. (10 pts) Perform a formal Bonferroni-corrected outlier test. Calculate the critical value manually using qt() and then verify using car::outlierTest(). Are any observations statistically significant outliers?

3c. (5 pts) For the top 5 observations with the largest absolute jackknife residuals, examine their predictor values. Is there anything unusual about these observations? Create a table showing their values.

3d. (5 pts) In 2–3 sentences, explain why it is important to use the Bonferroni correction when testing multiple residuals for outlier status.


Task 4: Leverage and Influence (25 points)

4a. (5 pts) Compute the leverage (\(h_i\)) for all observations. Report the average leverage. How many observations exceed the \(3\bar{h}\) threshold?

4b. (5 pts) Compute Cook’s distance for all observations. Using the \(4/n\) threshold, how many observations are potentially influential?

4c. (5 pts) Create a scatterplot of jackknife residuals vs. leverage with point size proportional to Cook’s distance. Identify any observations that are both high-leverage and have large residuals.

4d. (5 pts) Refit the model after removing all observations with Cook’s distance \(> 4/n\). Compare the coefficients with the full-data model in a single table. Do any coefficients change substantially?

4e. (5 pts) A colleague argues: “We should just remove all outliers and high-leverage points to get a cleaner model.” Write a 3–4 sentence response explaining why this approach is problematic.


Task 5: Multicollinearity Assessment (15 points)

5a. (5 pts) Compute the VIF for all predictors in your model. Present the results in a table with VIF, tolerance, and a flag column. Is there evidence of multicollinearity?

5b. (5 pts) Create a pairwise correlation matrix for the continuous predictors. Which pair has the strongest correlation?

5c. (5 pts) Suppose you wanted to add a quadratic term for BMI to the model (\(\text{bmi}^2\)). Compute the VIF for bmi and bmi_sq in this expanded model. Then center BMI and recompute. Show that centering reduces the VIF. Why does centering help?


End of Lab Activity