Introduction

After fitting a regression model, we must ask: does the model actually fit the data well, and are the underlying assumptions satisfied? A model that violates key assumptions may produce misleading coefficients, invalid standard errors, or unreliable predictions.

Regression diagnostics address three fundamental questions:

Are the regression assumptions satisfied? (Linearity, Independence, Normality, Equal Variance)
Is there multicollinearity among predictors? (Inflated standard errors, unstable coefficients)
Are there influential observations? (Outliers, high-leverage points that distort the fitted model)

This lecture provides the tools to diagnose each of these issues and, where possible, apply remedial measures. Diagnostics should be a routine part of every regression analysis, not an afterthought.

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue with the BRFSS 2020 dataset. For this lecture, we predict mentally unhealthy days from several health-related variables.

brfss_full <- read_xpt(
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2024/Papers/BRFSS Project/Final Analysis/BRFSS Data 2013 - 2023/LLCP2020.XPT"
) |>
  clean_names()

brfss_dx <- brfss_full |>
  mutate(
    # Outcome
    menthlth_days = case_when(
      menthlth == 88                  ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                           ~ NA_real_
    ),
    # Predictors
    physhlth_days = case_when(
      physhlth == 88                  ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                           ~ NA_real_
    ),
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                           ~ NA_real_
    ),
    age = age80,
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    )
  ) |>
  filter(
    !is.na(menthlth_days), !is.na(physhlth_days), !is.na(sleep_hrs),
    !is.na(age), age >= 18, !is.na(sex), !is.na(bmi),
    !is.na(exercise), !is.na(income_cat)
  )

set.seed(1220)
brfss_dx <- brfss_dx |>
  dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
                bmi, exercise, income_cat) |>
  slice_sample(n = 5000)

# Save for lab
saveRDS(brfss_dx,
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Diagnostics/brfss_dx_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_dx), ncol(brfss_dx))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	8

Quick Descriptive Overview

brfss_dx |>
  tbl_summary(
    type = list(
      c(menthlth_days, physhlth_days, sleep_hrs, age, bmi, income_cat) ~ "continuous"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd}), [{min}, {max}]"
    )
  ) |>
  bold_labels()

Characteristic	N = 5,000¹
menthlth_days	4 (8), [0, 30]
physhlth_days	3 (8), [0, 30]
sleep_hrs	7.07 (1.36), [1.00, 14.00]
IMPUTED AGE VALUE COLLAPSED ABOVE 80	54 (17), [18, 80]
sex
Male	2,454 (49%)
Female	2,546 (51%)
bmi	28.3 (6.2), [13.2, 71.1]
exercise	3,965 (79%)
income_cat	6.17 (2.02), [1.00, 8.00]
¹ Mean (SD), [Min, Max]; n (%)

Part 1: Guided Practice — Regression Diagnostics

1. Regression Assumptions (LINE)

Before trusting any regression output, we need to verify that the underlying assumptions are met. The acronym LINE helps us remember the four key assumptions:

Assumption	Description	Diagnostic tool
Linearity	The relationship between predictors and outcome is linear	Residuals vs. fitted plot
Independence	Residual errors are independent of one another	Study design, Durbin-Watson test
Normality	Residual errors follow a normal distribution	Q-Q plot, histogram, Shapiro-Wilk test
Equal variance	Variance of residuals is constant across fitted values (homoscedasticity)	Residuals vs. fitted plot, Scale-Location plot, Breusch-Pagan test

Violations of these assumptions do not necessarily invalidate the analysis, but they may affect the reliability of confidence intervals, p-values, and prediction intervals. The central limit theorem provides some protection when sample sizes are large, particularly for the normality assumption.

1.1 Fitting the Working Model

Let us fit a multiple regression model predicting mentally unhealthy days from physical health days, sleep, age, sex, and BMI.

mod1 <- lm(menthlth_days ~ physhlth_days + sleep_hrs + age + sex + bmi,
           data = brfss_dx)

tidy(mod1, conf.int = TRUE) |>
  kable(digits = 3, caption = "Working Model Coefficients") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Working Model Coefficients
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	9.912	0.775	12.793	0.000	8.393	11.431
physhlth_days	0.316	0.013	24.325	0.000	0.291	0.342
sleep_hrs	-0.676	0.075	-9.010	0.000	-0.823	-0.529
age	-0.079	0.006	-13.306	0.000	-0.090	-0.067
sexFemale	1.505	0.202	7.430	0.000	1.108	1.902
bmi	0.039	0.016	2.357	0.018	0.006	0.071

glance(mod1) |>
  dplyr::select(r.squared, adj.r.squared, AIC, BIC, sigma) |>
  kable(digits = 3, caption = "Model Fit Summary") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Fit Summary
r.squared	adj.r.squared	AIC	BIC	sigma
0.16	0.159	33857.77	33903.39	7.142

2. Residuals: Types and Interpretation

Residuals are the difference between observed and fitted values. They are our primary tool for checking assumptions. However, not all residuals are created equal.

2.1 Raw Residuals

The simplest residual is the raw (ordinary) residual:

\[e_i = Y_i - \hat{Y}_i\]

Raw residuals are on the scale of the outcome variable. Their main limitation is that their variance is not constant across observations, because the variance of \(e_i\) depends on the leverage \(h_i\):

\[\text{Var}(e_i) = \sigma^2(1 - h_i)\]

Observations with higher leverage have smaller residual variance, making it harder to detect unusual observations using raw residuals alone.

# Extract raw residuals
brfss_dx <- brfss_dx |>
  mutate(
    raw_resid = residuals(mod1),
    fitted_val = fitted(mod1)
  )

# Summary of raw residuals
summary(brfss_dx$raw_resid)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -17.6288  -3.4591  -1.7307   0.0000   0.4225  30.8948

2.2 Standardized Residuals

Standardized residuals divide each raw residual by the estimated standard deviation of the residuals (\(\hat{\sigma}\)), also known as the root mean squared error (RMSE):

\[r_i^{std} = \frac{e_i}{\hat{\sigma}}\]

where \(\hat{\sigma} = \sqrt{MSE}\). While this puts residuals on a common scale, it does not account for the fact that different observations have different residual variances (depending on their leverage).

brfss_dx <- brfss_dx |>
  mutate(std_resid = rstandard(mod1))

# Observations with |standardized residual| > 3
brfss_dx |>
  filter(abs(std_resid) > 3) |>
  dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, std_resid) |>
  head(10) |>
  kable(digits = 3, caption = "Observations with Large Standardized Residuals") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Observations with Large Standardized Residuals
menthlth_days	physhlth_days	sleep_hrs	age	std_resid
30	0	6	64	3.958
30	0	7	69	4.057
30	0	5	25	3.169
30	0	6	52	3.812
30	10	7	43	3.140
30	0	8	80	4.328
30	5	10	21	3.332
30	0	8	65	3.921
30	0	6	25	3.526
30	0	5	29	3.216

2.3 Studentized (Internally Studentized) Residuals

Studentized residuals account for the varying variance of each residual by dividing by the observation-specific standard error:

\[r_i = \frac{e_i}{\hat{\sigma}\sqrt{1 - h_i}}\]

This correction means that studentized residuals have approximately unit variance under the model assumptions. In R, rstandard() computes these internally studentized residuals.

2.4 Jackknife (Externally Studentized) Residuals

Jackknife residuals (also called externally studentized or deleted residuals) take the correction one step further. For observation \(i\), the residual is computed using a model fitted without observation \(i\):

\[r_{(-i)} = \frac{e_i}{\hat{\sigma}_{(-i)}\sqrt{1 - h_i}}\]

where \(\hat{\sigma}_{(-i)}\) is the RMSE estimated from the model that excludes observation \(i\). Under the null hypothesis that observation \(i\) is not an outlier, \(r_{(-i)} \sim t_{n-p-2}\).

The jackknife residual is the most powerful residual for detecting outliers because it prevents the potentially outlying observation from influencing the estimate of \(\hat{\sigma}\).

brfss_dx <- brfss_dx |>
  mutate(jack_resid = rstudent(mod1))

# Compare residual types
brfss_dx |>
  dplyr::select(raw_resid, std_resid, jack_resid) |>
  pivot_longer(everything(), names_to = "type", values_to = "value") |>
  ggplot(aes(x = value)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "white", alpha = 0.8) +
  facet_wrap(~type, scales = "free_x",
             labeller = labeller(type = c(
               raw_resid = "Raw Residuals",
               std_resid = "Standardized Residuals",
               jack_resid = "Jackknife Residuals"
             ))) +
  labs(title = "Distribution of Three Residual Types",
       x = "Residual Value", y = "Frequency") +
  theme_minimal()

2.5 Summary of Residual Types

Residual	Formula	Distribution under \(H_0\)	Use
Raw (\(e_i\))	\(Y_i - \hat{Y}_i\)	Unequal variance	Quick check
Standardized	\(e_i / \hat{\sigma}\)	Approximately standard normal	Rough screening
Studentized	\(e_i / (\hat{\sigma}\sqrt{1-h_i})\)	Approximately standard normal	Better screening
Jackknife (\(r_{(-i)}\))	\(e_i / (\hat{\sigma}_{(-i)}\sqrt{1-h_i})\)	\(t_{n-p-2}\)	Formal outlier testing

3. Diagnostic Plots

3.1 Residuals vs. Fitted Values

This is the single most important diagnostic plot. It checks linearity and homoscedasticity simultaneously.

Linearity: If the model is correct, the residuals should scatter randomly around zero with no systematic pattern (no curve, no trend).
Equal variance: The vertical spread of the residuals should remain roughly constant as fitted values increase. A “funnel” shape (spread increasing or decreasing) indicates heteroscedasticity.

ggplot(brfss_dx, aes(x = fitted_val, y = jack_resid)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_smooth(method = "loess", se = FALSE, color = "darkorange", linewidth = 1) +
  labs(title = "Jackknife Residuals vs. Fitted Values",
       x = "Fitted Values", y = "Jackknife Residuals") +
  theme_minimal()

Interpretation guide:

A flat loess line near zero suggests linearity holds.
A curved loess line suggests non-linearity (consider polynomial terms or transformations).
A funnel shape suggests heteroscedasticity (consider a variance-stabilizing transformation such as log or square root).

3.2 Normal Q-Q Plot

The Q-Q plot compares the distribution of the residuals to the theoretical normal distribution. If the normality assumption holds, points should fall approximately along the diagonal reference line.

ggplot(brfss_dx, aes(sample = jack_resid)) +
  stat_qq(alpha = 0.3, color = "steelblue") +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(title = "Normal Q-Q Plot of Jackknife Residuals",
       x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal()

Common Q-Q patterns and what they suggest:

Pattern	Interpretation	Potential remedy
Points follow the line	Normality satisfied	None needed
Right skew (upper tail curves upward)	Right-skewed residuals	Log or square root transformation of \(Y\)
Left skew (lower tail curves downward)	Left-skewed residuals	Square transformation of \(Y\)
Heavy tails (both ends deviate)	Heavy-tailed distribution	Robust regression or transformation
S-shape	Both skew and kurtosis issues	Consider a different model family

3.3 Histogram of Residuals

A histogram provides a complementary view to the Q-Q plot.

ggplot(brfss_dx, aes(x = jack_resid)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50,
                 fill = "steelblue", color = "white", alpha = 0.8) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1),
                color = "red", linewidth = 1) +
  labs(title = "Histogram of Jackknife Residuals with Normal Curve",
       x = "Jackknife Residual", y = "Density") +
  theme_minimal()

3.4 Scale-Location Plot

The Scale-Location plot shows \(\sqrt{|\text{standardized residuals}|}\) vs. fitted values. It is specifically designed to assess homoscedasticity. A flat trend line indicates constant variance.

brfss_dx <- brfss_dx |>
  mutate(sqrt_abs_std = sqrt(abs(std_resid)))

ggplot(brfss_dx, aes(x = fitted_val, y = sqrt_abs_std)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "loess", se = FALSE, color = "darkorange", linewidth = 1) +
  labs(title = "Scale-Location Plot",
       x = "Fitted Values",
       y = expression(sqrt("|Standardized Residuals|"))) +
  theme_minimal()

3.5 Base R Diagnostic Plots

R provides four built-in diagnostic plots through plot() on an lm object. These are quick but less customizable.

par(mfrow = c(2, 2))
plot(mod1)

par(mfrow = c(1, 1))

The four panels are:

Residuals vs. Fitted (linearity + homoscedasticity)
Normal Q-Q (normality)
Scale-Location (homoscedasticity)
Residuals vs. Leverage (influential observations via Cook’s distance contours)

3.6 Formal Tests for Assumptions

While visual diagnostics are essential, formal tests can supplement them, especially for borderline cases.

# Breusch-Pagan test for heteroscedasticity
bp_test <- car::ncvTest(mod1)
tibble(
  Test = "Breusch-Pagan",
  Statistic = bp_test$ChiSquare,
  df = bp_test$Df,
  p_value = bp_test$p
) |>
  kable(digits = 4, caption = "Breusch-Pagan Test for Heteroscedasticity") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Breusch-Pagan Test for Heteroscedasticity
Test	Statistic	df	p_value
Breusch-Pagan	1157.809	1	0

Note: With very large samples (n = 5000), formal tests are often overpowered and detect trivial departures from assumptions. In large samples, the visual diagnostic is typically more informative than the p-value. If the Q-Q plot looks reasonable and the residual plot has no alarming pattern, you may proceed even if a formal test rejects.

4. Remedial Measures: Transformations

When assumptions are violated, transforming the outcome variable is a common remedy.

Problem	Transformation	R code
Right-skewed residuals	\(\log(Y)\)	`lm(log(y) ~ x)`
Right-skewed residuals (mild)	\(\sqrt{Y}\)	`lm(sqrt(y) ~ x)`
Left-skewed residuals	\(Y^2\)	`lm(I(y^2) ~ x)`
Non-constant variance	\(\log(Y)\) or \(\sqrt{Y}\)	Depends on pattern

4.1 Trying a Transformation

If the Q-Q plot suggests right skew, we can try a log transformation. Since our outcome (menthlth_days) includes zeros, we use \(\log(Y + 1)\).

mod_log <- lm(log(menthlth_days + 1) ~ physhlth_days + sleep_hrs + age + sex + bmi,
              data = brfss_dx)

# Compare Q-Q plots side by side
par(mfrow = c(1, 2))

qqnorm(rstudent(mod1), main = "Q-Q: Original Scale")
qqline(rstudent(mod1), col = "red")

qqnorm(rstudent(mod_log), main = "Q-Q: Log(Y+1) Scale")
qqline(rstudent(mod_log), col = "red")

par(mfrow = c(1, 1))

Important: Transforming the outcome changes the interpretation of the coefficients. With \(\log(Y+1)\), a one-unit increase in \(X\) is associated with a multiplicative change in \(Y+1\), not an additive change.

5. Outliers and Influential Observations

An outlier is an observation with a residual that is much larger than the rest. An influential observation is one whose removal substantially changes the fitted model (coefficients, predictions, or both).

Not all outliers are influential, and not all influential points are outliers. Influence depends on two components:

Residual size (the distance of \(Y_i\) from the regression surface)
Leverage (the distance of \(X_i\) from the center of the predictor space)

5.1 Detecting Outliers via Jackknife Residuals

We test whether any jackknife residual is statistically significant using a Bonferroni correction to account for multiple testing.

For \(n\) observations and \(p\) predictors:

\(H_0\): observation \(i\) is not an outlier
Under \(H_0\): \(r_{(-i)} \sim t_{n-p-2}\)
Bonferroni-corrected significance level: \(\alpha / n\)
Observation \(i\) is an outlier if \(|r_{(-i)}| > t_{n-p-2,\; 1-\alpha/(2n)}\)

n <- nrow(brfss_dx)
p <- length(coef(mod1)) - 1  # number of predictors (excluding intercept)
alpha <- 0.05

# Critical value with Bonferroni correction
crit_val <- qt(1 - alpha / (2 * n), df = n - p - 2)

cat("Sample size:", n, "\n")

## Sample size: 5000

cat("Number of predictors:", p, "\n")

## Number of predictors: 5

cat("Bonferroni critical value:", round(crit_val, 3), "\n\n")

## Bonferroni critical value: 4.422

# Identify outliers
outlier_idx <- which(abs(brfss_dx$jack_resid) > crit_val)
cat("Number of jackknife residual outliers:", length(outlier_idx), "\n")

## Number of jackknife residual outliers: 0

if (length(outlier_idx) > 0) {
  brfss_dx |>
    slice(outlier_idx) |>
    dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, jack_resid) |>
    kable(digits = 3, caption = "Outliers by Jackknife Residual (Bonferroni-corrected)") |>
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
}

The car::outlierTest() function automates this Bonferroni-corrected test:

car::outlierTest(mod1)

## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferroni p
## 190  4.33569         1.4816e-05     0.074082

5.2 Leverage

Leverage (\(h_i\)) measures how far an observation’s predictor values are from the mean of all predictors. High-leverage observations have unusual predictor values and exert disproportionate influence on the fitted line.

The leverage for observation \(i\) is:

\[h_i = \frac{1}{n} + \sum_{j=1}^{p} \frac{(X_{ij} - \bar{X}_j)^2}{(n-1)s_j^2}\]

Key properties of leverage:

\(0 \leq h_i \leq 1\)
\(\sum_{i=1}^{n} h_i = p + 1\)
Average leverage: \(\bar{h} = (p+1)/n\)

Rules of thumb for high leverage:

Rule	Threshold	Reference
\(h_i > 2\bar{h}\)	\(2(p+1)/n\)	Hoaglin and Welsch (1978)
\(h_i > 3\bar{h}\)	\(3(p+1)/n\)	Velleman and Welsch (1981)
\(h_i > 0.5\)	0.5	Huber (1981) — clearly large
\(h_i > 0.2\)	0.2	Huber (1981) — worth examining

brfss_dx <- brfss_dx |>
  mutate(leverage = hatvalues(mod1))

h_bar <- (p + 1) / n

cat("Average leverage:", round(h_bar, 5), "\n")

## Average leverage: 0.0012

cat("2x average leverage threshold:", round(2 * h_bar, 5), "\n")

## 2x average leverage threshold: 0.0024

cat("3x average leverage threshold:", round(3 * h_bar, 5), "\n\n")

## 3x average leverage threshold: 0.0036

# High leverage points (using 3x threshold)
high_lev <- brfss_dx |>
  filter(leverage > 3 * h_bar)

cat("Number of high-leverage points (> 3x average):", nrow(high_lev), "\n")

## Number of high-leverage points (> 3x average): 157

ggplot(brfss_dx, aes(x = seq_len(n), y = leverage)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_hline(yintercept = 2 * h_bar, color = "orange", linetype = "dashed",
             linewidth = 0.8) +
  geom_hline(yintercept = 3 * h_bar, color = "red", linetype = "dashed",
             linewidth = 0.8) +
  annotate("text", x = n * 0.9, y = 2 * h_bar + 0.0005, label = "2x average",
           color = "orange") +
  annotate("text", x = n * 0.9, y = 3 * h_bar + 0.0005, label = "3x average",
           color = "red") +
  labs(title = "Leverage Values by Observation Index",
       x = "Observation Index", y = "Leverage (h_i)") +
  theme_minimal()

5.3 Cook’s Distance

Cook’s distance combines residual size and leverage into a single measure of overall influence. It quantifies how much all fitted values change when observation \(i\) is removed:

\[D_i = \frac{r_i^2}{p+1} \cdot \frac{h_i}{1 - h_i}\]

where \(r_i\) is the internally studentized residual. A large Cook’s distance indicates that the observation substantially influences the regression surface.

Rules of thumb for Cook’s distance:

Threshold	Interpretation
\(D_i > 4/n\)	Common rule for flagging potentially influential points
\(D_i > 1\)	Clearly influential (some textbooks)
\(D_i > F_{0.50, p+1, n-p-1}\)	Compares to the median of the \(F\)-distribution

brfss_dx <- brfss_dx |>
  mutate(cooks_d = cooks.distance(mod1))

# Threshold
cooks_threshold <- 4 / n

cat("Cook's distance threshold (4/n):", round(cooks_threshold, 5), "\n")

## Cook's distance threshold (4/n): 8e-04

influential_obs <- brfss_dx |>
  filter(cooks_d > cooks_threshold)

cat("Number of influential points (D > 4/n):", nrow(influential_obs), "\n")

## Number of influential points (D > 4/n): 504

ggplot(brfss_dx, aes(x = seq_len(n), y = cooks_d)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_hline(yintercept = cooks_threshold, color = "red",
             linetype = "dashed", linewidth = 0.8) +
  labs(title = "Cook's Distance by Observation Index",
       subtitle = paste0("Red line: 4/n = ", round(cooks_threshold, 5)),
       x = "Observation Index", y = "Cook's Distance") +
  theme_minimal()

5.4 DFFITS and DFBETAS

Two additional influence measures complement Cook’s distance:

DFFITS measures the influence of observation \(i\) on its own fitted value:

\[\text{DFFITS}_i = r_{(-i)} \sqrt{\frac{h_i}{1-h_i}}\]

Flagging threshold: \(|\text{DFFITS}_i| > 2\sqrt{(p+1)/n}\)

DFBETAS measures the influence of observation \(i\) on each individual regression coefficient \(\hat{\beta}_j\):

\[\text{DFBETAS}_{j,i} = \frac{\hat{\beta}_j - \hat{\beta}_{j(-i)}}{\hat{\sigma}_{(-i)} \sqrt{(X^TX)^{-1}_{jj}}}\]

Flagging threshold: \(|\text{DFBETAS}_{j,i}| > 2/\sqrt{n}\)

brfss_dx <- brfss_dx |>
  mutate(dffits_val = dffits(mod1))

dffits_threshold <- 2 * sqrt((p + 1) / n)
dfbetas_threshold <- 2 / sqrt(n)

cat("DFFITS threshold:", round(dffits_threshold, 4), "\n")

## DFFITS threshold: 0.0693

cat("DFBETAS threshold:", round(dfbetas_threshold, 4), "\n\n")

## DFBETAS threshold: 0.0283

# Number exceeding DFFITS threshold
cat("Observations with |DFFITS| > threshold:",
    sum(abs(brfss_dx$dffits_val) > dffits_threshold), "\n")

## Observations with |DFFITS| > threshold: 504

# DFBETAS for each coefficient
dfb <- dfbetas(mod1)
cat("\nObservations exceeding DFBETAS threshold per coefficient:\n")

## 
## Observations exceeding DFBETAS threshold per coefficient:

tibble(
  Coefficient = colnames(dfb),
  n_flagged = map_int(seq_len(ncol(dfb)), \(j) sum(abs(dfb[, j]) > dfbetas_threshold))
) |>
  kable(caption = "DFBETAS Flags by Coefficient") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

DFBETAS Flags by Coefficient
Coefficient	n_flagged
(Intercept)	333
physhlth_days	397
sleep_hrs	349
age	263
sexFemale	350
bmi	294

5.5 Residuals vs. Leverage Plot

This plot combines residuals and leverage, with Cook’s distance contours overlaid. Points in the upper-right or lower-right corners are both outlying and high-leverage, making them potentially influential.

ggplot(brfss_dx, aes(x = leverage, y = jack_resid)) +
  geom_point(aes(size = cooks_d), alpha = 0.3, color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 3 * h_bar, color = "red", linetype = "dashed") +
  scale_size_continuous(name = "Cook's D", range = c(0.5, 4)) +
  labs(title = "Jackknife Residuals vs. Leverage",
       subtitle = "Point size proportional to Cook's distance",
       x = "Leverage", y = "Jackknife Residual") +
  theme_minimal()

5.6 What To Do with Outliers and Influential Points

When you identify outliers or influential observations:

Investigate the observations. Are they data entry errors? Impossible values? Special subpopulations?
Fit the model with and without the flagged observations.
Compare the results. If the coefficients change substantially, report both sets of results.
Do NOT automatically delete observations just because they have large residuals or leverage. Only remove observations if there is a scientific or data-quality justification.

# Compare model with and without high-influence points
influential_idx <- which(brfss_dx$cooks_d > cooks_threshold)
brfss_no_infl <- brfss_dx |> slice(-influential_idx)

mod_no_infl <- lm(menthlth_days ~ physhlth_days + sleep_hrs + age + sex + bmi,
                  data = brfss_no_infl)

# Side-by-side comparison
bind_rows(
  tidy(mod1) |> mutate(Model = "Full data"),
  tidy(mod_no_infl) |> mutate(Model = "Without influential obs")
) |>
  dplyr::select(Model, term, estimate, std.error, p.value) |>
  pivot_wider(
    names_from = Model,
    values_from = c(estimate, std.error, p.value),
    names_glue = "{Model}_{.value}"
  ) |>
  kable(digits = 4, caption = "Coefficient Comparison: Full Data vs. Influential Points Removed") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE) |>
  scroll_box(width = "100%")

Coefficient Comparison: Full Data vs. Influential Points Removed
term	Full data_estimate	Without influential obs_estimate	Full data_std.error	Without influential obs_std.error	Full data_p.value	Without influential obs_p.value
(Intercept)	9.9123	5.3267	0.7748	0.4542	0.0000	0.0000
physhlth_days	0.3162	0.2806	0.0130	0.0109	0.0000	0.0000
sleep_hrs	-0.6763	-0.3095	0.0751	0.0450	0.0000	0.0000
age	-0.0789	-0.0567	0.0059	0.0033	0.0000	0.0000
sexFemale	1.5046	1.3468	0.2025	0.1125	0.0000	0.0000
bmi	0.0385	0.0285	0.0163	0.0095	0.0185	0.0026

6. Multicollinearity

Multicollinearity occurs when two or more predictors in a regression model are moderately or highly correlated. This creates instability in the coefficient estimates: the standard errors inflate, individual predictors may appear non-significant even when the overall model is significant, and estimated coefficients may change dramatically when variables are added or removed.

6.1 Types of Multicollinearity

Type	Cause	Example
Structural	Mathematical artifact from creating new variables	Including both \(X\) and \(X^2\); or age and age-squared
Data-based	Inherent correlation in the data	Income and education; systolic and diastolic blood pressure

6.2 Symptoms of Multicollinearity

How do you know multicollinearity is present?

Estimated \(\beta\) values change wildly when predictors are added or removed
Individual predictor t-tests are non-significant, but the overall F-test is significant
Pairwise correlations among predictors are large (though this misses multicollinearity involving linear combinations of several variables)
Variance Inflation Factors (VIF) are large

6.3 Variance Inflation Factor (VIF)

The VIF for predictor \(j\) is:

\[VIF_j = \frac{1}{1 - R_j^2}\]

where \(R_j^2\) is the \(R^2\) obtained from regressing \(X_j\) on all other predictors. The VIF tells you how much the variance of \(\hat{\beta}_j\) is inflated by collinearity relative to what it would be if \(X_j\) were uncorrelated with the other predictors.

An equivalent measure is the tolerance:

\[\text{tolerance}_j = \frac{1}{VIF_j} = 1 - R_j^2\]

Rules of thumb:

VIF value	Interpretation
\(VIF_j = 1\)	No correlation with other predictors
\(VIF_j > 4\)	Needs further investigation
\(VIF_j > 10\)	Strong collinearity, action likely needed

vif_values <- car::vif(mod1)

tibble(
  Predictor = names(vif_values),
  VIF = round(vif_values, 3),
  Tolerance = round(1 / vif_values, 3)
) |>
  mutate(Flag = case_when(
    VIF > 10 ~ "Strong collinearity",
    VIF > 4  ~ "Investigate",
    TRUE     ~ "OK"
  )) |>
  kable(caption = "Variance Inflation Factors") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Variance Inflation Factors
Predictor	VIF	Tolerance	Flag
physhlth_days	1.030	0.971	OK
sleep_hrs	1.028	0.973	OK
age	1.029	0.971	OK
sex	1.004	0.996	OK
bmi	1.020	0.980	OK

6.4 Correlation Matrix for Context

While pairwise correlations do not capture all forms of multicollinearity (a variable can be correlated with a linear combination of other variables without being strongly correlated with any single one), they are a useful starting point.

brfss_dx |>
  dplyr::select(physhlth_days, sleep_hrs, age, bmi) |>
  cor(use = "complete.obs") |>
  round(3) |>
  kable(caption = "Pairwise Correlation Matrix (Continuous Predictors)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Pairwise Correlation Matrix (Continuous Predictors)
	physhlth_days	sleep_hrs	age	bmi
physhlth_days	1.000	-0.075	0.092	0.120
sleep_hrs	-0.075	1.000	0.124	-0.073
age	0.092	0.124	1.000	0.011
bmi	0.120	-0.073	0.011	1.000

6.5 Remedies for Multicollinearity

The appropriate remedy depends on the type of multicollinearity:

For structural multicollinearity (e.g., \(X\) and \(X^2\)):

Center the variable by subtracting its mean before squaring. This often eliminates or greatly reduces the collinearity between \(X\) and \(X^2\).
Note: centering changes the interpretation of the intercept but not the slope of \(X\).

# Example: structural multicollinearity
brfss_dx <- brfss_dx |>
  mutate(
    bmi_sq = bmi^2,
    bmi_c = bmi - mean(bmi),
    bmi_c_sq = bmi_c^2
  )

# VIF before centering
mod_poly <- lm(menthlth_days ~ bmi + bmi_sq, data = brfss_dx)
cat("VIF before centering:\n")

## VIF before centering:

car::vif(mod_poly)

##      bmi   bmi_sq 
## 30.62497 30.62497

# VIF after centering
mod_poly_c <- lm(menthlth_days ~ bmi_c + bmi_c_sq, data = brfss_dx)
cat("\nVIF after centering:\n")

## 
## VIF after centering:

car::vif(mod_poly_c)

##    bmi_c bmi_c_sq 
## 1.447669 1.447669

For data-based multicollinearity:

Drop one of the correlated predictors (use subject-matter knowledge to decide which is more scientifically relevant)
Combine correlated variables into a composite (e.g., weight and height into BMI)
Use regularization methods (ridge regression, LASSO) in advanced settings
Be cautious about dropping variables solely to reduce standard errors, as this can introduce omitted variable bias (p-hacking)

7. Putting It All Together: A Diagnostic Workflow

A systematic approach to regression diagnostics:

Diagnostic Workflow Summary
Step	Action	Tools
1	Fit the model and examine coefficients	lm(), tidy(), glance()
2	Check residuals vs. fitted plot (linearity + homoscedasticity)	Residuals vs. fitted, Breusch-Pagan test
3	Check Q-Q plot and histogram (normality)	Q-Q plot, histogram, Shapiro-Wilk
4	If assumptions violated, consider transformations and refit	log(Y), sqrt(Y), Y^2
5	Compute VIF for multicollinearity	car::vif()
6	Compute leverage, Cook’s distance, DFFITS, DFBETAS for influential obs	hatvalues(), cooks.distance(), dffits(), dfbetas()
7	Compare model with and without influential observations; report both if different	Compare coefficients, CIs

Key Takeaways

Topic	Key point
LINE assumptions	Linearity, Independence, Normality, Equal Variance must be checked
Residuals	Jackknife residuals are preferred for outlier detection
Diagnostic plots	Residuals vs. fitted (linearity + variance); Q-Q plot (normality)
Transformations	\(\log(Y)\) for right skew; \(Y^2\) for left skew
Outlier testing	Use Bonferroni-corrected jackknife residuals
Leverage	Measures extremeness in predictor space; \(h_i > 2(p+1)/n\) or \(3(p+1)/n\)
Cook’s distance	Combines residual and leverage; \(D_i > 4/n\) flags potential influence
DFFITS/DFBETAS	Measure influence on fitted values and individual coefficients
Do not auto-delete	Always investigate; report models with and without flagged observations
Multicollinearity	VIF > 10 is strong collinearity; center variables for structural collinearity

Part 2: In-Class Lab Activity

EPI 553 — Regression Diagnostics Lab Due: End of class, March 24, 2026

Instructions

In this lab, you will conduct a thorough diagnostic assessment of a regression model using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. The outcome for the lab is menthlth_days and you will build a model with a different set of predictors than the guided practice.

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`bmi`	Body mass index	Continuous
`exercise`	Any physical activity (Yes/No)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)

brfss_dx <- readRDS(
  "/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Diagnostics/brfss_dx_2020.rds"
)

Task 1: Fit a Model and Check Assumptions (20 points)

1a. (5 pts) Fit the model: menthlth_days ~ physhlth_days + sleep_hrs + age + sex + exercise + income_cat. Report the model coefficients, \(R^2\), and Adjusted \(R^2\).

1b. (5 pts) Create a residuals vs. fitted values plot using jackknife residuals. Include a loess smoothing line. Does the linearity assumption appear to hold? What about homoscedasticity?

1c. (5 pts) Create a normal Q-Q plot of the jackknife residuals. Does the normality assumption hold? Describe the specific pattern you observe (e.g., right skew, heavy tails, etc.).

1d. (5 pts) Run the Breusch-Pagan test for heteroscedasticity using car::ncvTest(). Report the test statistic and p-value. Given the large sample size, how should you weigh the formal test result against the visual evidence from your residual plot?

Task 2: Transformations (15 points)

2a. (5 pts) Based on your diagnostic plots in Task 1, apply an appropriate transformation to the outcome variable (e.g., \(\log(Y+1)\) or \(\sqrt{Y}\)). Refit the model and create a new Q-Q plot. Did the transformation improve normality?

2b. (5 pts) Create residuals vs. fitted plots for both the original and transformed models side by side. Compare the patterns.

2c. (5 pts) In 2–3 sentences, explain how the transformation changes the interpretation of the regression coefficients. Is the trade-off worth it?

Task 3: Outlier Detection (25 points)

3a. (5 pts) Compute jackknife residuals for the original (untransformed) model. What are the minimum and maximum jackknife residuals? How many observations have \(|r_{(-i)}| > 3\)?

3b. (10 pts) Perform a formal Bonferroni-corrected outlier test. Calculate the critical value manually using qt() and then verify using car::outlierTest(). Are any observations statistically significant outliers?

3c. (5 pts) For the top 5 observations with the largest absolute jackknife residuals, examine their predictor values. Is there anything unusual about these observations? Create a table showing their values.

3d. (5 pts) In 2–3 sentences, explain why it is important to use the Bonferroni correction when testing multiple residuals for outlier status.

Task 4: Leverage and Influence (25 points)

4a. (5 pts) Compute the leverage (\(h_i\)) for all observations. Report the average leverage. How many observations exceed the \(3\bar{h}\) threshold?

4b. (5 pts) Compute Cook’s distance for all observations. Using the \(4/n\) threshold, how many observations are potentially influential?

4c. (5 pts) Create a scatterplot of jackknife residuals vs. leverage with point size proportional to Cook’s distance. Identify any observations that are both high-leverage and have large residuals.

4d. (5 pts) Refit the model after removing all observations with Cook’s distance \(> 4/n\). Compare the coefficients with the full-data model in a single table. Do any coefficients change substantially?

4e. (5 pts) A colleague argues: “We should just remove all outliers and high-leverage points to get a cleaner model.” Write a 3–4 sentence response explaining why this approach is problematic.

Task 5: Multicollinearity Assessment (15 points)

5a. (5 pts) Compute the VIF for all predictors in your model. Present the results in a table with VIF, tolerance, and a flag column. Is there evidence of multicollinearity?

5b. (5 pts) Create a pairwise correlation matrix for the continuous predictors. Which pair has the strongest correlation?

5c. (5 pts) Suppose you wanted to add a quadratic term for BMI to the model (\(\text{bmi}^2\)). Compute the VIF for bmi and bmi_sq in this expanded model. Then center BMI and recompute. Show that centering reduces the VIF. Why does centering help?

End of Lab Activity

Regression Diagnostics

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Muntasir Masum

March 24, 2026