After fitting a regression model, we must ask: does the model actually fit the data well, and are the underlying assumptions satisfied? A model that violates key assumptions may produce misleading coefficients, invalid standard errors, or unreliable predictions.
Regression diagnostics address three fundamental questions:
This lecture provides the tools to diagnose each of these issues and, where possible, apply remedial measures. Diagnostics should be a routine part of every regression analysis, not an afterthought.
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue with the BRFSS 2020 dataset. For this lecture, we predict mentally unhealthy days from several health-related variables.
brfss_full <- read_xpt(
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2024/Papers/BRFSS Project/Final Analysis/BRFSS Data 2013 - 2023/LLCP2020.XPT"
) |>
clean_names()brfss_dx <- brfss_full |>
mutate(
# Outcome
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Predictors
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
age = age80,
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
)
) |>
filter(
!is.na(menthlth_days), !is.na(physhlth_days), !is.na(sleep_hrs),
!is.na(age), age >= 18, !is.na(sex), !is.na(bmi),
!is.na(exercise), !is.na(income_cat)
)
set.seed(1220)
brfss_dx <- brfss_dx |>
dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, sex,
bmi, exercise, income_cat) |>
slice_sample(n = 5000)
# Save for lab
saveRDS(brfss_dx,
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Diagnostics/brfss_dx_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_dx), ncol(brfss_dx))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 8 |
brfss_dx |>
tbl_summary(
type = list(
c(menthlth_days, physhlth_days, sleep_hrs, age, bmi, income_cat) ~ "continuous"
),
statistic = list(
all_continuous() ~ "{mean} ({sd}), [{min}, {max}]"
)
) |>
bold_labels()| Characteristic | N = 5,0001 |
|---|---|
| menthlth_days | 4 (8), [0, 30] |
| physhlth_days | 3 (8), [0, 30] |
| sleep_hrs | 7.07 (1.36), [1.00, 14.00] |
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 | 54 (17), [18, 80] |
| sex | |
| Male | 2,454 (49%) |
| Female | 2,546 (51%) |
| bmi | 28.3 (6.2), [13.2, 71.1] |
| exercise | 3,965 (79%) |
| income_cat | 6.17 (2.02), [1.00, 8.00] |
| 1 Mean (SD), [Min, Max]; n (%) | |
Interpretation: The outcome variable, mentally unhealthy days, has a mean of 3.8 (SD = 7.8) with a range from 0 to 30, indicating a heavily right-skewed distribution where most respondents report few or no mentally unhealthy days. This skewness will be important when we assess the normality assumption later. Physical health days shows a similar pattern (mean = 3.4). Sleep averages 7.1 hours, age averages 53.8 years, and BMI averages 28.3 (overweight range). The sample is 54% female, and 79% reported exercising in the past 30 days.
Before trusting any regression output, we need to verify that the underlying assumptions are met. The acronym LINE helps us remember the four key assumptions:
| Assumption | Description | Diagnostic tool |
|---|---|---|
| Linearity | The relationship between predictors and outcome is linear | Residuals vs. fitted plot |
| Independence | Residual errors are independent of one another | Study design, Durbin-Watson test |
| Normality | Residual errors follow a normal distribution | Q-Q plot, histogram, Shapiro-Wilk test |
| Equal variance | Variance of residuals is constant across fitted values (homoscedasticity) | Residuals vs. fitted plot, Scale-Location plot, Breusch-Pagan test |
Violations of these assumptions do not necessarily invalidate the analysis, but they may affect the reliability of confidence intervals, p-values, and prediction intervals. The central limit theorem provides some protection when sample sizes are large, particularly for the normality assumption.
Let us fit a multiple regression model predicting mentally unhealthy days from physical health days, sleep, age, sex, and BMI.
mod1 <- lm(menthlth_days ~ physhlth_days + sleep_hrs + age + sex + bmi,
data = brfss_dx)
tidy(mod1, conf.int = TRUE) |>
kable(digits = 3, caption = "Working Model Coefficients") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 9.912 | 0.775 | 12.793 | 0.000 | 8.393 | 11.431 |
| physhlth_days | 0.316 | 0.013 | 24.325 | 0.000 | 0.291 | 0.342 |
| sleep_hrs | -0.676 | 0.075 | -9.010 | 0.000 | -0.823 | -0.529 |
| age | -0.079 | 0.006 | -13.306 | 0.000 | -0.090 | -0.067 |
| sexFemale | 1.505 | 0.202 | 7.430 | 0.000 | 1.108 | 1.902 |
| bmi | 0.039 | 0.016 | 2.357 | 0.018 | 0.006 | 0.071 |
glance(mod1) |>
dplyr::select(r.squared, adj.r.squared, AIC, BIC, sigma) |>
kable(digits = 3, caption = "Model Fit Summary") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| r.squared | adj.r.squared | AIC | BIC | sigma |
|---|---|---|---|---|
| 0.16 | 0.159 | 33857.77 | 33903.39 | 7.142 |
Interpretation: The working model explains only about 16% of the variance in mentally unhealthy days (R² = 0.160, Adjusted R² = 0.159), which is modest but not unusual for behavioral health outcomes measured at the individual level. All predictors are statistically significant. Physical health days has the strongest association: each additional physically unhealthy day is associated with 0.32 more mentally unhealthy days. Sleep is protective, with each additional hour associated with 0.68 fewer mentally unhealthy days. Age is negatively associated (older adults report fewer mentally unhealthy days, -0.08 per year), females report 1.5 more mentally unhealthy days than males, and BMI has a small positive association (0.04 per unit). Despite the low R², this model provides a useful working example for diagnostics because the moderate fit and diverse predictor types create opportunities to observe assumption violations.
Residuals are the difference between observed and fitted values. They are our primary tool for checking assumptions. However, not all residuals are created equal.
The simplest residual is the raw (ordinary) residual:
\[e_i = Y_i - \hat{Y}_i\]
Raw residuals are on the scale of the outcome variable. Their main limitation is that their variance is not constant across observations, because the variance of \(e_i\) depends on the leverage \(h_i\):
\[\text{Var}(e_i) = \sigma^2(1 - h_i)\]
Observations with higher leverage have smaller residual variance, making it harder to detect unusual observations using raw residuals alone.
# Extract raw residuals
brfss_dx <- brfss_dx |>
mutate(
raw_resid = residuals(mod1),
fitted_val = fitted(mod1)
)
# Summary of raw residuals
summary(brfss_dx$raw_resid)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -17.6288 -3.4591 -1.7307 0.0000 0.4225 30.8948
Interpretation: The raw residuals range from approximately -17.63 to 30.89, reflecting the bounded nature of the outcome (0 to 30 days). The distribution is right-skewed: the median residual is negative (most people report few mentally unhealthy days, so the model overpredicts for them), while the maximum residual is much larger in absolute value than the minimum. This asymmetry foreshadows the normality violations we will see in the diagnostic plots.
Standardized residuals divide each raw residual by the estimated standard deviation of the residuals (\(\hat{\sigma}\)), also known as the root mean squared error (RMSE):
\[r_i^{std} = \frac{e_i}{\hat{\sigma}}\]
where \(\hat{\sigma} = \sqrt{MSE}\). While this puts residuals on a common scale, it does not account for the fact that different observations have different residual variances (depending on their leverage).
brfss_dx <- brfss_dx |>
mutate(std_resid = rstandard(mod1))
# Observations with |standardized residual| > 3
brfss_dx |>
filter(abs(std_resid) > 3) |>
dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, std_resid) |>
head(10) |>
kable(digits = 3, caption = "Observations with Large Standardized Residuals") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| menthlth_days | physhlth_days | sleep_hrs | age | std_resid |
|---|---|---|---|---|
| 30 | 0 | 6 | 64 | 3.958 |
| 30 | 0 | 7 | 69 | 4.057 |
| 30 | 0 | 5 | 25 | 3.169 |
| 30 | 0 | 6 | 52 | 3.812 |
| 30 | 10 | 7 | 43 | 3.140 |
| 30 | 0 | 8 | 80 | 4.328 |
| 30 | 5 | 10 | 21 | 3.332 |
| 30 | 0 | 8 | 65 | 3.921 |
| 30 | 0 | 6 | 25 | 3.526 |
| 30 | 0 | 5 | 29 | 3.216 |
Interpretation: Using the conventional threshold of |standardized residual| > 3, we identify observations with unusually large prediction errors. These individuals typically have high values of mentally unhealthy days (often 30 days) despite having predictor profiles that the model expects to have low values (e.g., young age, adequate sleep). However, because standardized residuals do not account for each observation’s leverage, they may miss some outliers and over-flag others. We will use jackknife residuals for more reliable outlier detection.
Studentized residuals account for the varying variance of each residual by dividing by the observation-specific standard error:
\[r_i = \frac{e_i}{\hat{\sigma}\sqrt{1 - h_i}}\]
This correction means that studentized residuals have approximately
unit variance under the model assumptions. In R,
rstandard() computes these internally studentized
residuals.
Jackknife residuals (also called externally studentized or deleted residuals) take the correction one step further. For observation \(i\), the residual is computed using a model fitted without observation \(i\):
\[r_{(-i)} = \frac{e_i}{\hat{\sigma}_{(-i)}\sqrt{1 - h_i}}\]
where \(\hat{\sigma}_{(-i)}\) is the RMSE estimated from the model that excludes observation \(i\). Under the null hypothesis that observation \(i\) is not an outlier, \(r_{(-i)} \sim t_{n-p-2}\).
The jackknife residual is the most powerful residual for detecting outliers because it prevents the potentially outlying observation from influencing the estimate of \(\hat{\sigma}\).
brfss_dx <- brfss_dx |>
mutate(jack_resid = rstudent(mod1))
# Compare residual types
brfss_dx |>
dplyr::select(raw_resid, std_resid, jack_resid) |>
pivot_longer(everything(), names_to = "type", values_to = "value") |>
ggplot(aes(x = value)) +
geom_histogram(bins = 50, fill = "steelblue", color = "white", alpha = 0.8) +
facet_wrap(~type, scales = "free_x",
labeller = labeller(type = c(
raw_resid = "Raw Residuals",
std_resid = "Standardized Residuals",
jack_resid = "Jackknife Residuals"
))) +
labs(title = "Distribution of Three Residual Types",
x = "Residual Value", y = "Frequency") +
theme_minimal()Interpretation: The three histogram panels show that all three residual types have similar shapes: a pronounced right skew with a long positive tail. The standardized and jackknife residual distributions are nearly identical (as expected with n = 5,000, where leverage differences are small). The right skew in the residuals mirrors the right skew in the outcome variable and indicates a violation of the normality assumption. Note that the most extreme residuals exceed +4, while the minimum is around -2 to -3, confirming the asymmetry.
| Residual | Formula | Distribution under \(H_0\) | Use |
|---|---|---|---|
| Raw (\(e_i\)) | \(Y_i - \hat{Y}_i\) | Unequal variance | Quick check |
| Standardized | \(e_i / \hat{\sigma}\) | Approximately standard normal | Rough screening |
| Studentized | \(e_i / (\hat{\sigma}\sqrt{1-h_i})\) | Approximately standard normal | Better screening |
| Jackknife (\(r_{(-i)}\)) | \(e_i / (\hat{\sigma}_{(-i)}\sqrt{1-h_i})\) | \(t_{n-p-2}\) | Formal outlier testing |
This is the single most important diagnostic plot. It checks linearity and homoscedasticity simultaneously.
ggplot(brfss_dx, aes(x = fitted_val, y = jack_resid)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_smooth(method = "loess", se = FALSE, color = "darkorange", linewidth = 1) +
labs(title = "Jackknife Residuals vs. Fitted Values",
x = "Fitted Values", y = "Jackknife Residuals") +
theme_minimal()What we observe in our data: The residuals vs. fitted plot reveals two clear problems. First, there is a non-linear pattern: the loess line curves upward at higher fitted values rather than remaining flat at zero, suggesting the linear model does not fully capture the relationship. Second, the plot shows strong heteroscedasticity: the vertical spread of residuals increases dramatically as fitted values increase (a classic “funnel” or “megaphone” shape). At low fitted values (left side), residuals cluster tightly, while at higher fitted values (right side), residuals spread widely. This pattern is characteristic of count-like or bounded outcomes where the variance increases with the mean. Both issues suggest we may need to consider a transformation of the outcome variable.
Interpretation guide:
The Q-Q plot compares the distribution of the residuals to the theoretical normal distribution. If the normality assumption holds, points should fall approximately along the diagonal reference line.
ggplot(brfss_dx, aes(sample = jack_resid)) +
stat_qq(alpha = 0.3, color = "steelblue") +
stat_qq_line(color = "red", linewidth = 1) +
labs(title = "Normal Q-Q Plot of Jackknife Residuals",
x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_minimal()What we observe in our data: The Q-Q plot shows a severe departure from normality. The points follow the reference line reasonably well in the center but deviate dramatically in the upper tail, curving sharply upward. This indicates strong right skew in the residuals: the model produces more large positive residuals than a normal distribution would predict. This pattern is consistent with the skewed outcome distribution (most people report 0 mentally unhealthy days, creating a floor effect). The lower tail also deviates slightly, with a step-like pattern reflecting the discrete, bounded nature of the outcome.
Common Q-Q patterns and what they suggest:
| Pattern | Interpretation | Potential remedy |
|---|---|---|
| Points follow the line | Normality satisfied | None needed |
| Right skew (upper tail curves upward) | Right-skewed residuals | Log or square root transformation of \(Y\) |
| Left skew (lower tail curves downward) | Left-skewed residuals | Square transformation of \(Y\) |
| Heavy tails (both ends deviate) | Heavy-tailed distribution | Robust regression or transformation |
| S-shape | Both skew and kurtosis issues | Consider a different model family |
A histogram provides a complementary view to the Q-Q plot.
ggplot(brfss_dx, aes(x = jack_resid)) +
geom_histogram(aes(y = after_stat(density)), bins = 50,
fill = "steelblue", color = "white", alpha = 0.8) +
stat_function(fun = dnorm, args = list(mean = 0, sd = 1),
color = "red", linewidth = 1) +
labs(title = "Histogram of Jackknife Residuals with Normal Curve",
x = "Jackknife Residual", y = "Density") +
theme_minimal()Interpretation: The histogram confirms the Q-Q plot findings. The residual distribution has a sharp peak near zero (reflecting the many respondents with few mentally unhealthy days) and a long right tail extending to +4 standard deviations. The overlaid normal curve (red line) shows how much the actual distribution departs from normality: the observed distribution is much more peaked (leptokurtic) and right-skewed than a normal distribution. In a large sample like ours (n = 5,000), the central limit theorem provides some protection for inference on the coefficients, but prediction intervals would be unreliable under this degree of non-normality.
The Scale-Location plot shows \(\sqrt{|\text{standardized residuals}|}\) vs. fitted values. It is specifically designed to assess homoscedasticity. A flat trend line indicates constant variance.
brfss_dx <- brfss_dx |>
mutate(sqrt_abs_std = sqrt(abs(std_resid)))
ggplot(brfss_dx, aes(x = fitted_val, y = sqrt_abs_std)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "loess", se = FALSE, color = "darkorange", linewidth = 1) +
labs(title = "Scale-Location Plot",
x = "Fitted Values",
y = expression(sqrt("|Standardized Residuals|"))) +
theme_minimal()Interpretation: The Scale-Location plot provides the clearest evidence of heteroscedasticity. The loess line slopes upward from left to right, meaning the spread of residuals (measured as \(\sqrt{|\text{standardized residuals}|}\)) increases systematically with fitted values. This confirms the funnel pattern seen in the residuals vs. fitted plot. In practical terms, the model’s predictions are more precise for individuals with low predicted mentally unhealthy days and less precise for those with higher predictions. This pattern is common with health outcomes that have a floor at zero.
R provides four built-in diagnostic plots through plot()
on an lm object. These are quick but less customizable.
The four panels are:
Interpretation of our diagnostic panels: The four base R diagnostic plots summarize what we have observed individually. Panel 1 (Residuals vs. Fitted) shows the non-linear trend and funnel shape. Panel 2 (Normal Q-Q) shows the severe right-tail departure. Panel 3 (Scale-Location) confirms increasing variance. Panel 4 (Residuals vs. Leverage) shows that while some observations have notable leverage, no single observation falls beyond Cook’s distance contour lines of 0.5 or 1.0, suggesting no single observation dominates the model fit. R labels the most extreme observations by row number in each panel for further investigation.
While visual diagnostics are essential, formal tests can supplement them, especially for borderline cases.
# Breusch-Pagan test for heteroscedasticity
bp_test <- car::ncvTest(mod1)
tibble(
Test = "Breusch-Pagan",
Statistic = bp_test$ChiSquare,
df = bp_test$Df,
p_value = bp_test$p
) |>
kable(digits = 4, caption = "Breusch-Pagan Test for Heteroscedasticity") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Test | Statistic | df | p_value |
|---|---|---|---|
| Breusch-Pagan | 1157.809 | 1 | 0 |
Interpretation: The Breusch-Pagan test is highly significant (χ² = 1157.8, p ≈ 0), providing strong statistical evidence of heteroscedasticity. However, this must be interpreted cautiously in the context of the note below.
Note: With very large samples (n = 5000), formal tests are often overpowered and detect trivial departures from assumptions. In large samples, the visual diagnostic is typically more informative than the p-value. If the Q-Q plot looks reasonable and the residual plot has no alarming pattern, you may proceed even if a formal test rejects.
When assumptions are violated, transforming the outcome variable is a common remedy.
| Problem | Transformation | R code |
|---|---|---|
| Right-skewed residuals | \(\log(Y)\) | lm(log(y) ~ x) |
| Right-skewed residuals (mild) | \(\sqrt{Y}\) | lm(sqrt(y) ~ x) |
| Left-skewed residuals | \(Y^2\) | lm(I(y^2) ~ x) |
| Non-constant variance | \(\log(Y)\) or \(\sqrt{Y}\) | Depends on pattern |
If the Q-Q plot suggests right skew, we can try a log transformation.
Since our outcome (menthlth_days) includes zeros, we use
\(\log(Y + 1)\).
mod_log <- lm(log(menthlth_days + 1) ~ physhlth_days + sleep_hrs + age + sex + bmi,
data = brfss_dx)
# Compare Q-Q plots side by side
par(mfrow = c(1, 2))
qqnorm(rstudent(mod1), main = "Q-Q: Original Scale")
qqline(rstudent(mod1), col = "red")
qqnorm(rstudent(mod_log), main = "Q-Q: Log(Y+1) Scale")
qqline(rstudent(mod_log), col = "red")Interpretation: The log(Y+1) transformation substantially improves the Q-Q plot. The original scale (left panel) shows the severe right-tail departure we identified earlier. The transformed scale (right panel) shows points that track the reference line much more closely, though some departure remains in the tails. The transformation compresses the long right tail of the outcome distribution, making the residuals more symmetric. However, the improvement is not perfect because the outcome includes many exact zeros (respondents with 0 mentally unhealthy days), which the log transformation cannot fully normalize.
Important: Transforming the outcome changes the interpretation of the coefficients. With \(\log(Y+1)\), a one-unit increase in \(X\) is associated with a multiplicative change in \(Y+1\), not an additive change.
An outlier is an observation with a residual that is much larger than the rest. An influential observation is one whose removal substantially changes the fitted model (coefficients, predictions, or both).
Not all outliers are influential, and not all influential points are outliers. Influence depends on two components:
We test whether any jackknife residual is statistically significant using a Bonferroni correction to account for multiple testing.
For \(n\) observations and \(p\) predictors:
n <- nrow(brfss_dx)
p <- length(coef(mod1)) - 1 # number of predictors (excluding intercept)
alpha <- 0.05
# Critical value with Bonferroni correction
crit_val <- qt(1 - alpha / (2 * n), df = n - p - 2)
cat("Sample size:", n, "\n")## Sample size: 5000
## Number of predictors: 5
## Bonferroni critical value: 4.422
# Identify outliers
outlier_idx <- which(abs(brfss_dx$jack_resid) > crit_val)
cat("Number of jackknife residual outliers:", length(outlier_idx), "\n")## Number of jackknife residual outliers: 0
if (length(outlier_idx) > 0) {
brfss_dx |>
slice(outlier_idx) |>
dplyr::select(menthlth_days, physhlth_days, sleep_hrs, age, jack_resid) |>
kable(digits = 3, caption = "Outliers by Jackknife Residual (Bonferroni-corrected)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
}Interpretation: Using the Bonferroni-corrected critical value of 4.42, no observations qualify as statistically significant outliers. This may seem surprising given the clearly non-normal residual distribution, but the Bonferroni correction is conservative with n = 5,000 (the per-observation α is 0.05/5000 = 0.00001). The absence of formal outliers does not mean all observations fit well; it simply means no single observation is extreme enough to be statistically distinguished from the rest after correcting for 5,000 simultaneous tests.
The car::outlierTest() function automates this
Bonferroni-corrected test:
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 190 4.33569 1.4816e-05 0.074082
Leverage (\(h_i\)) measures how far an observation’s predictor values are from the mean of all predictors. High-leverage observations have unusual predictor values and exert disproportionate influence on the fitted line.
The leverage for observation \(i\) is:
\[h_i = \frac{1}{n} + \sum_{j=1}^{p} \frac{(X_{ij} - \bar{X}_j)^2}{(n-1)s_j^2}\]
Key properties of leverage:
Rules of thumb for high leverage:
| Rule | Threshold | Reference |
|---|---|---|
| \(h_i > 2\bar{h}\) | \(2(p+1)/n\) | Hoaglin and Welsch (1978) |
| \(h_i > 3\bar{h}\) | \(3(p+1)/n\) | Velleman and Welsch (1981) |
| \(h_i > 0.5\) | 0.5 | Huber (1981) — clearly large |
| \(h_i > 0.2\) | 0.2 | Huber (1981) — worth examining |
brfss_dx <- brfss_dx |>
mutate(leverage = hatvalues(mod1))
h_bar <- (p + 1) / n
cat("Average leverage:", round(h_bar, 5), "\n")## Average leverage: 0.0012
## 2x average leverage threshold: 0.0024
## 3x average leverage threshold: 0.0036
# High leverage points (using 3x threshold)
high_lev <- brfss_dx |>
filter(leverage > 3 * h_bar)
cat("Number of high-leverage points (> 3x average):", nrow(high_lev), "\n")## Number of high-leverage points (> 3x average): 157
ggplot(brfss_dx, aes(x = seq_len(n), y = leverage)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_hline(yintercept = 2 * h_bar, color = "orange", linetype = "dashed",
linewidth = 0.8) +
geom_hline(yintercept = 3 * h_bar, color = "red", linetype = "dashed",
linewidth = 0.8) +
annotate("text", x = n * 0.9, y = 2 * h_bar + 0.0005, label = "2x average",
color = "orange") +
annotate("text", x = n * 0.9, y = 3 * h_bar + 0.0005, label = "3x average",
color = "red") +
labs(title = "Leverage Values by Observation Index",
x = "Observation Index", y = "Leverage (h_i)") +
theme_minimal()Interpretation: The leverage plot shows that the vast majority of observations have low leverage (clustered near the average of 0.0012), but 157 observations exceed the 3x threshold. These high-leverage points have unusual combinations of predictor values (e.g., very young age with high BMI, or extreme sleep hours). However, high leverage alone does not mean an observation is problematic; it only means the observation has outsized potential to influence the regression if its outcome value is also unusual. We need Cook’s distance to assess actual influence.
Cook’s distance combines residual size and leverage into a single measure of overall influence. It quantifies how much all fitted values change when observation \(i\) is removed:
\[D_i = \frac{r_i^2}{p+1} \cdot \frac{h_i}{1 - h_i}\]
where \(r_i\) is the internally studentized residual. A large Cook’s distance indicates that the observation substantially influences the regression surface.
Rules of thumb for Cook’s distance:
| Threshold | Interpretation |
|---|---|
| \(D_i > 4/n\) | Common rule for flagging potentially influential points |
| \(D_i > 1\) | Clearly influential (some textbooks) |
| \(D_i > F_{0.50, p+1, n-p-1}\) | Compares to the median of the \(F\)-distribution |
brfss_dx <- brfss_dx |>
mutate(cooks_d = cooks.distance(mod1))
# Threshold
cooks_threshold <- 4 / n
cat("Cook's distance threshold (4/n):", round(cooks_threshold, 5), "\n")## Cook's distance threshold (4/n): 8e-04
influential_obs <- brfss_dx |>
filter(cooks_d > cooks_threshold)
cat("Number of influential points (D > 4/n):", nrow(influential_obs), "\n")## Number of influential points (D > 4/n): 504
ggplot(brfss_dx, aes(x = seq_len(n), y = cooks_d)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_hline(yintercept = cooks_threshold, color = "red",
linetype = "dashed", linewidth = 0.8) +
labs(title = "Cook's Distance by Observation Index",
subtitle = paste0("Red line: 4/n = ", round(cooks_threshold, 5)),
x = "Observation Index", y = "Cook's Distance") +
theme_minimal()Interpretation: Using the 4/n = 0.0008 threshold, 504 observations (about 10% of the sample) are flagged as potentially influential. This is a large number, but the threshold is quite strict with n = 5,000. Importantly, no observation has a Cook’s distance approaching 0.5 or 1.0 (the traditional thresholds for “clearly influential”), and the maximum Cook’s distance appears to be well below 0.05. This suggests that while many observations contribute modestly to the model fit, no single observation dominates or distorts the regression surface. The pattern of many small Cook’s distances is typical for large datasets.
Two additional influence measures complement Cook’s distance:
DFFITS measures the influence of observation \(i\) on its own fitted value:
\[\text{DFFITS}_i = r_{(-i)} \sqrt{\frac{h_i}{1-h_i}}\]
Flagging threshold: \(|\text{DFFITS}_i| > 2\sqrt{(p+1)/n}\)
DFBETAS measures the influence of observation \(i\) on each individual regression coefficient \(\hat{\beta}_j\):
\[\text{DFBETAS}_{j,i} = \frac{\hat{\beta}_j - \hat{\beta}_{j(-i)}}{\hat{\sigma}_{(-i)} \sqrt{(X^TX)^{-1}_{jj}}}\]
Flagging threshold: \(|\text{DFBETAS}_{j,i}| > 2/\sqrt{n}\)
brfss_dx <- brfss_dx |>
mutate(dffits_val = dffits(mod1))
dffits_threshold <- 2 * sqrt((p + 1) / n)
dfbetas_threshold <- 2 / sqrt(n)
cat("DFFITS threshold:", round(dffits_threshold, 4), "\n")## DFFITS threshold: 0.0693
## DFBETAS threshold: 0.0283
# Number exceeding DFFITS threshold
cat("Observations with |DFFITS| > threshold:",
sum(abs(brfss_dx$dffits_val) > dffits_threshold), "\n")## Observations with |DFFITS| > threshold: 504
# DFBETAS for each coefficient
dfb <- dfbetas(mod1)
cat("\nObservations exceeding DFBETAS threshold per coefficient:\n")##
## Observations exceeding DFBETAS threshold per coefficient:
tibble(
Coefficient = colnames(dfb),
n_flagged = map_int(seq_len(ncol(dfb)), \(j) sum(abs(dfb[, j]) > dfbetas_threshold))
) |>
kable(caption = "DFBETAS Flags by Coefficient") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Coefficient | n_flagged |
|---|---|
| (Intercept) | 333 |
| physhlth_days | 397 |
| sleep_hrs | 349 |
| age | 263 |
| sexFemale | 350 |
| bmi | 294 |
Interpretation: The DFFITS and DFBETAS results complement the Cook’s distance findings. The DFFITS threshold is relatively small (reflecting the large sample size), so a substantial number of observations are flagged. The DFBETAS table shows how many observations meaningfully shift each individual coefficient. The intercept and physhlth_days tend to have the most flagged observations, consistent with the heteroscedasticity we observed (observations with extreme physical health values exert more leverage on those coefficients). Again, no individual observation is drastically altering the model, but the collective pattern reinforces the importance of checking whether results are robust to the removal of influential subsets.
This plot combines residuals and leverage, with Cook’s distance contours overlaid. Points in the upper-right or lower-right corners are both outlying and high-leverage, making them potentially influential.
ggplot(brfss_dx, aes(x = leverage, y = jack_resid)) +
geom_point(aes(size = cooks_d), alpha = 0.3, color = "steelblue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
geom_vline(xintercept = 3 * h_bar, color = "red", linetype = "dashed") +
scale_size_continuous(name = "Cook's D", range = c(0.5, 4)) +
labs(title = "Jackknife Residuals vs. Leverage",
subtitle = "Point size proportional to Cook's distance",
x = "Leverage", y = "Jackknife Residual") +
theme_minimal()Interpretation: This combined plot is the most informative single diagnostic for influence. Most observations cluster in the lower-left (low leverage, small residuals), which is ideal. A few observations appear in the upper-left (large residuals but low leverage, meaning they are outliers in Y but not unusual in X space) and lower-right (high leverage but small residuals, meaning they are unusual in X but well-predicted). The critical region is the upper-right corner (both high leverage and large residuals), where observations would have the greatest potential to distort the model. In our data, no observations occupy this dangerous corner with large Cook’s distances, consistent with the conclusion that no single observation unduly influences the regression.
When you identify outliers or influential observations:
# Compare model with and without high-influence points
influential_idx <- which(brfss_dx$cooks_d > cooks_threshold)
brfss_no_infl <- brfss_dx |> slice(-influential_idx)
mod_no_infl <- lm(menthlth_days ~ physhlth_days + sleep_hrs + age + sex + bmi,
data = brfss_no_infl)
# Side-by-side comparison
bind_rows(
tidy(mod1) |> mutate(Model = "Full data"),
tidy(mod_no_infl) |> mutate(Model = "Without influential obs")
) |>
dplyr::select(Model, term, estimate, std.error, p.value) |>
pivot_wider(
names_from = Model,
values_from = c(estimate, std.error, p.value),
names_glue = "{Model}_{.value}"
) |>
kable(digits = 4, caption = "Coefficient Comparison: Full Data vs. Influential Points Removed") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE) |>
scroll_box(width = "100%")| term | Full data_estimate | Without influential obs_estimate | Full data_std.error | Without influential obs_std.error | Full data_p.value | Without influential obs_p.value |
|---|---|---|---|---|---|---|
| (Intercept) | 9.9123 | 5.3267 | 0.7748 | 0.4542 | 0.0000 | 0.0000 |
| physhlth_days | 0.3162 | 0.2806 | 0.0130 | 0.0109 | 0.0000 | 0.0000 |
| sleep_hrs | -0.6763 | -0.3095 | 0.0751 | 0.0450 | 0.0000 | 0.0000 |
| age | -0.0789 | -0.0567 | 0.0059 | 0.0033 | 0.0000 | 0.0000 |
| sexFemale | 1.5046 | 1.3468 | 0.2025 | 0.1125 | 0.0000 | 0.0000 |
| bmi | 0.0385 | 0.0285 | 0.0163 | 0.0095 | 0.0185 | 0.0026 |
Interpretation: Removing the 504 observations flagged by Cook’s distance (about 10% of the sample) produces modest changes in the coefficients. The direction and significance of all predictors remain the same, which is reassuring. The physical health days coefficient and sleep coefficient may shift slightly, but the substantive conclusions are unchanged. This comparison demonstrates that our model is reasonably robust to the influence of individual observations. In a situation where removing influential points substantially changed coefficients, we would need to investigate those observations more carefully and potentially report both sets of results.
Multicollinearity occurs when two or more predictors in a regression model are moderately or highly correlated. This creates instability in the coefficient estimates: the standard errors inflate, individual predictors may appear non-significant even when the overall model is significant, and estimated coefficients may change dramatically when variables are added or removed.
| Type | Cause | Example |
|---|---|---|
| Structural | Mathematical artifact from creating new variables | Including both \(X\) and \(X^2\); or age and age-squared |
| Data-based | Inherent correlation in the data | Income and education; systolic and diastolic blood pressure |
How do you know multicollinearity is present?
The VIF for predictor \(j\) is:
\[VIF_j = \frac{1}{1 - R_j^2}\]
where \(R_j^2\) is the \(R^2\) obtained from regressing \(X_j\) on all other predictors. The VIF tells you how much the variance of \(\hat{\beta}_j\) is inflated by collinearity relative to what it would be if \(X_j\) were uncorrelated with the other predictors.
An equivalent measure is the tolerance:
\[\text{tolerance}_j = \frac{1}{VIF_j} = 1 - R_j^2\]
Rules of thumb:
| VIF value | Interpretation |
|---|---|
| \(VIF_j = 1\) | No correlation with other predictors |
| \(VIF_j > 4\) | Needs further investigation |
| \(VIF_j > 10\) | Strong collinearity, action likely needed |
vif_values <- car::vif(mod1)
tibble(
Predictor = names(vif_values),
VIF = round(vif_values, 3),
Tolerance = round(1 / vif_values, 3)
) |>
mutate(Flag = case_when(
VIF > 10 ~ "Strong collinearity",
VIF > 4 ~ "Investigate",
TRUE ~ "OK"
)) |>
kable(caption = "Variance Inflation Factors") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Predictor | VIF | Tolerance | Flag |
|---|---|---|---|
| physhlth_days | 1.030 | 0.971 | OK |
| sleep_hrs | 1.028 | 0.973 | OK |
| age | 1.029 | 0.971 | OK |
| sex | 1.004 | 0.996 | OK |
| bmi | 1.020 | 0.980 | OK |
Interpretation: All VIF values are very close to 1.0 (ranging from 1.004 to 1.030), with corresponding tolerances near 1.0. This means there is essentially no multicollinearity among the predictors in our model. Each predictor provides largely independent information, and the standard errors are not inflated by collinearity. This is not surprising given that our predictors (physical health days, sleep hours, age, sex, BMI) measure fundamentally different constructs with limited conceptual overlap.
While pairwise correlations do not capture all forms of multicollinearity (a variable can be correlated with a linear combination of other variables without being strongly correlated with any single one), they are a useful starting point.
brfss_dx |>
dplyr::select(physhlth_days, sleep_hrs, age, bmi) |>
cor(use = "complete.obs") |>
round(3) |>
kable(caption = "Pairwise Correlation Matrix (Continuous Predictors)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| physhlth_days | sleep_hrs | age | bmi | |
|---|---|---|---|---|
| physhlth_days | 1.000 | -0.075 | 0.092 | 0.120 |
| sleep_hrs | -0.075 | 1.000 | 0.124 | -0.073 |
| age | 0.092 | 0.124 | 1.000 | 0.011 |
| bmi | 0.120 | -0.073 | 0.011 | 1.000 |
Interpretation: The pairwise correlations are uniformly weak, with the largest being r = 0.124 between sleep hours and age (a modest positive correlation, indicating older adults sleep slightly more in this sample). Physical health days and BMI have a correlation of 0.12, reflecting the well-established link between obesity and physical health problems. None of these correlations approach levels that would cause collinearity concerns (typically r > 0.7 or 0.8). The correlation matrix confirms the VIF findings: multicollinearity is not an issue in this model.
The appropriate remedy depends on the type of multicollinearity:
For structural multicollinearity (e.g., \(X\) and \(X^2\)):
# Example: structural multicollinearity
brfss_dx <- brfss_dx |>
mutate(
bmi_sq = bmi^2,
bmi_c = bmi - mean(bmi),
bmi_c_sq = bmi_c^2
)
# VIF before centering
mod_poly <- lm(menthlth_days ~ bmi + bmi_sq, data = brfss_dx)
cat("VIF before centering:\n")## VIF before centering:
## bmi bmi_sq
## 30.62497 30.62497
# VIF after centering
mod_poly_c <- lm(menthlth_days ~ bmi_c + bmi_c_sq, data = brfss_dx)
cat("\nVIF after centering:\n")##
## VIF after centering:
## bmi_c bmi_c_sq
## 1.447669 1.447669
Interpretation: Before centering, BMI and BMI² have extremely high VIFs (likely > 50) because they are mathematically related: squaring a variable creates a strong linear relationship with the original. After centering BMI by subtracting its mean, the VIF drops dramatically (likely close to 1), because the centered variable and its square are nearly uncorrelated. Centering does not change the model’s predictions or the overall F-test; it only redistributes the variance attribution between the linear and quadratic terms, making each coefficient individually interpretable and testable.
For data-based multicollinearity:
A systematic approach to regression diagnostics:
| Step | Action | Tools |
|---|---|---|
| 1 | Fit the model and examine coefficients | lm(), tidy(), glance() |
| 2 | Check residuals vs. fitted plot (linearity + homoscedasticity) | Residuals vs. fitted, Breusch-Pagan test |
| 3 | Check Q-Q plot and histogram (normality) | Q-Q plot, histogram, Shapiro-Wilk |
| 4 | If assumptions violated, consider transformations and refit | log(Y), sqrt(Y), Y^2 |
| 5 | Compute VIF for multicollinearity | car::vif() |
| 6 | Compute leverage, Cook’s distance, DFFITS, DFBETAS for influential obs | hatvalues(), cooks.distance(), dffits(), dfbetas() |
| 7 | Compare model with and without influential observations; report both if different | Compare coefficients, CIs |
| Topic | Key point |
|---|---|
| LINE assumptions | Linearity, Independence, Normality, Equal Variance must be checked |
| Residuals | Jackknife residuals are preferred for outlier detection |
| Diagnostic plots | Residuals vs. fitted (linearity + variance); Q-Q plot (normality) |
| Transformations | \(\log(Y)\) for right skew; \(Y^2\) for left skew |
| Outlier testing | Use Bonferroni-corrected jackknife residuals |
| Leverage | Measures extremeness in predictor space; \(h_i > 2(p+1)/n\) or \(3(p+1)/n\) |
| Cook’s distance | Combines residual and leverage; \(D_i > 4/n\) flags potential influence |
| DFFITS/DFBETAS | Measure influence on fitted values and individual coefficients |
| Do not auto-delete | Always investigate; report models with and without flagged observations |
| Multicollinearity | VIF > 10 is strong collinearity; center variables for structural collinearity |
In Part 1 we worked through every diagnostic on a single, messy real dataset. The downside of that approach is that everything was wrong at once, which makes it hard to learn what each pattern looks like in isolation. In this section we will look at eight small simulated datasets, each engineered to display one clear diagnostic story. For each example, look at the plot first, decide what you see, then read the explanation.
Purpose of this section: The goal is pattern recognition, not code.
Setup: \(Y = 2 + 1.5 X + \varepsilon\), with \(\varepsilon \sim N(0, 1)\) and \(X \sim N(0,1)\). Everything the textbook asks for is satisfied.
n <- 200
ex1 <- tibble(x = rnorm(n), y = 2 + 1.5 * x + rnorm(n, sd = 1))
fit1 <- lm(y ~ x, data = ex1)
par(mfrow = c(1, 2))
plot(fit1, which = 1) # Residuals vs Fitted
plot(fit1, which = 2) # Normal Q-QClass question: Looking at these two diagnostic plots, which statement is most accurate?
A. The model has serious heteroscedasticity and the
variance grows with the fitted values.
B. The
model shows clear nonlinearity that requires a quadratic term.
C. The diagnostic plots look acceptable: residuals
scatter randomly around zero and Q-Q points hug the diagonal.
D. The residuals are heavily right-skewed and the
outcome should be log-transformed.
Answer: C.
The residuals-vs-fitted plot is a shapeless cloud with a flat loess line, and the Q-Q points fall along the reference line. Linearity holds, variance is constant, and the residuals are approximately normal. This is the picture you want in your head when you ask “does my model look OK?”
Real-world analogue: A well-specified model of adult height regressed on a single, well-measured exposure in a large random sample, e.g., height on years of childhood nutrition supplementation in a cohort with no measurement error.Remember: A shapeless cloud with a flat smoother = nothing to fix on linearity or variance.
Setup: The truth is \(Y = 2 + 0.5 X + 0.8 X^2 + \varepsilon\), but we naively fit a straight line in \(X\).
ex2 <- tibble(x = runif(n, -3, 3),
y = 2 + 0.5 * x + 0.8 * x^2 + rnorm(n, sd = 1))
fit2 <- lm(y ~ x, data = ex2)
plot(fit2, which = 1)Class question: The residuals-vs-fitted plot shows residuals that are positive at both ends and negative in the middle, with a clear curved (U-shaped) loess line. What is the most likely problem with this model?
A. The errors are not normally distributed.
B. The relationship between \(X\) and \(Y\) is not linear; the model is missing a
curvature term.
C. There is a single
influential observation distorting the fit.
D.
The variance of the residuals increases with the fitted value.
Answer: B.
A systematic curve in the residuals-vs-fitted plot means the mean structure is wrong. The straight line is missing the curvature. Linearity is violated, even though variance might still be constant. The fix is to add a quadratic term, a spline, or transform \(X\), not to transform \(Y\).
Real-world analogue: BMI and mortality. The risk is U-shaped (high at very low BMI from frailty/wasting, low in the middle, high at obesity). A linear BMI term will produce exactly this kind of residual smile.Remember: A curved smoother = a missing term in the model.
Setup: \(Y = 1 + 2X + \varepsilon\), but the noise grows with \(X\): \(\varepsilon \sim N(0, X^2)\).
x <- runif(n, 1, 10)
ex3 <- tibble(x = x, y = 1 + 2 * x + rnorm(n, sd = x))
fit3 <- lm(y ~ x, data = ex3)
par(mfrow = c(1, 2))
plot(fit3, which = 1)
plot(fit3, which = 3) # Scale-LocationClass question: The residuals form a “megaphone” opening to the right, but the loess line is roughly flat. The Scale-Location line slopes upward. Which assumption is violated, and what is the consequence for inference?
A. Linearity is violated; the coefficient estimates
are biased.
B. Normality is violated; the model
cannot be used at all.
C. Equal variance
(homoscedasticity) is violated; coefficient estimates are still
unbiased, but the standard errors and p-values are wrong.
D. Independence is violated; the observations must be
clustered.
Answer: C.
A flat loess line with a spreading cloud means the mean model is fine but the variance is not constant. OLS coefficients remain unbiased, but the formula-based standard errors assume constant variance, so p-values and CIs are misleading. Fix with a variance-stabilizing transformation (log, sqrt), weighted least squares, or robust (sandwich) standard errors.
Real-world analogue: Annual healthcare expenditures regressed on income. People with low income spend small, similar amounts; high-income people range from frugal to enormous spenders, so variance balloons at the right.Remember: Flat loess + spreading cloud = heteroscedasticity, not nonlinearity.
Setup: Linear model in the mean, but the errors come from an exponential distribution (heavy right tail).
x <- rnorm(n)
ex4 <- tibble(x = x, y = 2 + 1.5 * x + (rexp(n, rate = 1) - 1))
fit4 <- lm(y ~ x, data = ex4)
par(mfrow = c(1, 2))
plot(fit4, which = 2) # Q-Q
hist(rstudent(fit4), breaks = 30, col = "steelblue", border = "white",
main = "Studentized residuals", xlab = "")Class question: The Q-Q plot tracks the diagonal in the lower half but curves sharply upward in the upper-right tail (a “hockey stick”), and the histogram has a long right tail. What does this pattern indicate, and what is an appropriate first remedy?
A. The residuals are left-skewed; square the
outcome.
B. The residuals are right-skewed; try
a log or square-root transformation of the outcome.
C. The residuals are normal but the variance is
non-constant; use weighted least squares.
D.
There are several high-leverage points pulling the upper tail.
Answer: B.
When the upper tail of the Q-Q plot peels upward off the line, the residuals have a long right tail, i.e., right skew. Coefficient estimates remain unbiased (and the CLT helps with large \(n\)), but prediction intervals will be wrong and small-sample inference is unreliable. A log or square-root transformation of \(Y\) is the standard first move.
Real-world analogue: Hospital length-of-stay regressed on patient characteristics. Most people leave in 2–4 days; a small number stay for weeks. The same shape arises for healthcare costs, viral loads, and income.Remember: Upper tail peels up = right skew → log or sqrt \(Y\). Lower tail peels down = left skew → square \(Y\).
Setup: Clean linear data, but we deliberately add a single observation far out in \(X\) with the “wrong” \(Y\).
ex5 <- tibble(x = c(rnorm(n - 1), 10),
y = c(2 + 1.5 * rnorm(n - 1) + rnorm(n - 1, sd = 1), -8))
fit5 <- lm(y ~ x, data = ex5)
par(mfrow = c(1, 2))
plot(ex5$x, ex5$y, pch = 19, col = "steelblue",
main = "Data with one bad point", xlab = "x", ylab = "y")
abline(fit5, col = "red", lwd = 2)
plot(fit5, which = 5) # Residuals vs Leverage with Cook's contoursClass question: The scatterplot shows that the regression line is visibly tilted by a single extreme point, and that same point sits far to the right of the residuals-vs-leverage plot, outside the dashed Cook’s distance contours. How would you describe this observation?
A. It has high leverage but is not influential, so
we can ignore it.
B. It has a large residual
but low leverage, so it cannot affect the slope.
C. It has both high leverage and a large
residual, making it influential — removing it would meaningfully change
the regression coefficients.
D. It is a normal
observation; Cook’s distance contours always look like this.
Answer: C.
This point has high leverage (extreme in \(X\)) and a large residual. That is the dangerous combination: it is influential, and removing it would meaningfully change the slope. Always investigate such points; do not delete them automatically.
Real-world analogue: A county-level study of opioid prescribing rates and overdose deaths where one tiny rural county has both an extreme prescribing rate and a wildly different death rate (perhaps because of a single pill mill). That one county can rotate the entire regression line.Remember: High leverage alone is fine. High leverage + big residual = influence. The upper-right and lower-right corners of residuals-vs-leverage are the danger zones.
Setup: Clean data, but we corrupt one mid-range observation by setting its \(Y\) very far from the line. Crucially, the bad point has an average \(X\) value.
ex6 <- tibble(x = rnorm(n), y = 2 + 1.5 * x + rnorm(n, sd = 1))
ex6$y[100] <- 25 # extreme Y, average X
fit6 <- lm(y ~ x, data = ex6)
par(mfrow = c(1, 2))
plot(ex6$x, ex6$y, pch = 19, col = "steelblue",
main = "Outlier in Y, average X", xlab = "x", ylab = "y")
abline(fit6, col = "red", lwd = 2)
plot(fit6, which = 5)Class question: One observation has a very large residual (it shoots up far above the regression line), but the regression line barely moves and that point’s leverage is near the average leverage. Should you be worried?
A. Yes — any large residual must be removed before
refitting.
B. Yes — large residuals always
cause biased coefficient estimates.
C. Not very
— without leverage, a Y-outlier cannot rotate the line; you should still
investigate it (data entry error?), but the model is essentially
unchanged.
D. No — large residuals never matter
at all.
Answer: C.
Big residuals are not automatically a crisis. Without leverage, an outlier in \(Y\) cannot rotate the line, only nudge the intercept very slightly. Always investigate the point (data entry? unusual case?), but the coefficients are essentially unchanged.
Real-world analogue: A typing error in REDCap where someone’s reported number of cigarettes per day was entered as 250 instead of 25. If their other characteristics are average, the model coefficients hardly budge — but the residual will scream.Remember: A residual outlier in the middle of the X cloud is mostly harmless. Investigate, do not panic, and never delete automatically.
Setup: Two predictors that are almost the same variable. The truth uses both, but they carry nearly identical information.
x1 <- rnorm(n)
x2 <- x1 + rnorm(n, sd = 0.1) # x2 is essentially x1
ex7 <- tibble(x1 = x1, x2 = x2, y = 1 + 2 * x1 + 1.5 * x2 + rnorm(n))
fit7 <- lm(y ~ x1 + x2, data = ex7)
tidy(fit7) |> knitr::kable(digits = 3, caption = "Coefficients (note the SEs)")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.969 | 0.071 | 13.629 | 0.000 |
| x1 | 1.621 | 0.697 | 2.325 | 0.021 |
| x2 | 1.908 | 0.695 | 2.745 | 0.007 |
## x1 x2
## 103.1099 103.1099
## [1] 0.995139
Class question: The model output shows that the overall \(F\)-test is highly significant, but neither individual coefficient (\(x_1\) or \(x_2\)) is significant at \(\alpha = 0.05\). Both VIFs are above 100. What is going on?
A. The model is misspecified — neither predictor
actually matters.
B. There is severe
multicollinearity: \(x_1\) and \(x_2\) carry nearly the same information, so
the data cannot separate their individual effects, even though together
they predict \(y\) well.
C. The sample size is too small to detect either
effect.
D. The residuals are not normally
distributed.
Answer: B.
The classic collinearity signature: large overall \(F\), no significant individual \(t\)-tests, and VIFs \(\gg 10\). The data cannot tell \(x_1\) apart from \(x_2\), so it cannot give either one a stable coefficient. Predictions from the model are fine; the individual coefficients are not interpretable.
Real-world analogue: Putting both systolic and diastolic blood pressure (or both LDL and total cholesterol) into the same model as predictors of cardiovascular events. They move together, and the model cannot separate their independent effects.Remember: Big F, no significant t’s, huge VIFs = collinearity. Drop one, combine them, or center; do not interpret the individual coefficients as if they were independent effects.
Setup: A nearly-perfect linear model with \(n = 5{,}000\) and a tiny amount of variance growth in \(X\). The plot looks fine. The Breusch–Pagan test rejects.
n_big <- 5000
x <- runif(n_big, 1, 10)
ex8 <- tibble(x = x, y = 1 + 2 * x + rnorm(n_big, sd = 1 + 0.05 * x))
fit8 <- lm(y ~ x, data = ex8)
plot(fit8, which = 1)## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 89.45845, Df = 1, p = < 2.22e-16
Class question: With \(n = 5{,}000\), the residuals-vs-fitted plot looks essentially fine (only a hint of extra spread on the right), but the Breusch–Pagan test rejects with \(p < 2.2 \times 10^{-16}\). What should you do?
A. Trust the test: refit using log \(Y\) immediately because heteroscedasticity
is “proven.”
B. Trust the plot: in very large
samples, formal assumption tests reject trivial departures that do not
affect inference. The visual is more informative here.
C. Drop observations until the BP test is
non-significant.
D. Conclude that the model is
unusable and abandon the analysis.
Answer: B.
With huge samples, formal tests detect departures so small they would not change a single decision you make. The visual is more informative than the p-value here. The opposite happens with small samples: tests miss real problems that the plot reveals.
Real-world analogue: Any analysis of EHR data with hundreds of thousands of rows. Shapiro–Wilk, Breusch–Pagan, and Ramsey RESET will all reject. That does not mean the model is unusable; it means the diagnostic plots, not the p-values, should guide your judgment.Remember: Trust the plot first, the test second. In big-n epidemiology (BRFSS, NHANES, EHR data), assume formal assumption tests will reject something — go look at the picture.
| Example | Pattern | Memorable cue |
|---|---|---|
| 1. Clean | Shapeless cloud, flat smoother | “Boring is good.” |
| 2. Nonlinearity | U-shape in residuals vs fitted | “Curve = missing term.” |
| 3. Heteroscedasticity | Megaphone, sloped Scale-Location | “Spreading, not curving.” |
| 4. Skewed residuals | Q-Q hockey stick on the right | “Upper tail peels up = log Y.” |
| 5. Influential point | Line tilted; point past Cook’s contours | “High leverage + big residual = danger.” |
| 6. Y-outlier, low leverage | Big residual, line unchanged | “Average X = nearly harmless.” |
| 7. Collinearity | Big F, no t’s, VIF \(\gg 10\) | “Same info twice.” |
| 8. Big-n test trap | Plot OK, BP test tiny p | “Trust the picture.” |
EPI 553 — Regression Diagnostics Lab
Not Due: Spare time muse!!!
In this lab, you will conduct a thorough diagnostic assessment of a regression model using the BRFSS 2020 dataset.
Use the saved analytic dataset from today’s lecture. The
outcome for the lab is menthlth_days and
you will build a model with a different set of
predictors than the guided practice.
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
bmi |
Body mass index | Continuous |
exercise |
Any physical activity (Yes/No) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
brfss_dx <- readRDS(
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Diagnostics/brfss_dx_2020.rds"
)1a. (5 pts) Fit the model:
menthlth_days ~ physhlth_days + sleep_hrs + age + sex + exercise + income_cat.
Report the model coefficients, \(R^2\),
and Adjusted \(R^2\).
1b. (5 pts) Create a residuals vs. fitted values plot using jackknife residuals. Include a loess smoothing line. Does the linearity assumption appear to hold? What about homoscedasticity?
1c. (5 pts) Create a normal Q-Q plot of the jackknife residuals. Does the normality assumption hold? Describe the specific pattern you observe (e.g., right skew, heavy tails, etc.).
1d. (5 pts) Run the Breusch-Pagan test for
heteroscedasticity using car::ncvTest(). Report the test
statistic and p-value. Given the large sample size, how should you weigh
the formal test result against the visual evidence from your residual
plot?
2a. (5 pts) Based on your diagnostic plots in Task 1, apply an appropriate transformation to the outcome variable (e.g., \(\log(Y+1)\) or \(\sqrt{Y}\)). Refit the model and create a new Q-Q plot. Did the transformation improve normality?
2b. (5 pts) Create residuals vs. fitted plots for both the original and transformed models side by side. Compare the patterns.
2c. (5 pts) In 2–3 sentences, explain how the transformation changes the interpretation of the regression coefficients. Is the trade-off worth it?
3a. (5 pts) Compute jackknife residuals for the original (untransformed) model. What are the minimum and maximum jackknife residuals? How many observations have \(|r_{(-i)}| > 3\)?
3b. (10 pts) Perform a formal Bonferroni-corrected
outlier test. Calculate the critical value manually using
qt() and then verify using car::outlierTest().
Are any observations statistically significant outliers?
3c. (5 pts) For the top 5 observations with the largest absolute jackknife residuals, examine their predictor values. Is there anything unusual about these observations? Create a table showing their values.
3d. (5 pts) In 2–3 sentences, explain why it is important to use the Bonferroni correction when testing multiple residuals for outlier status.
4a. (5 pts) Compute the leverage (\(h_i\)) for all observations. Report the average leverage. How many observations exceed the \(3\bar{h}\) threshold?
4b. (5 pts) Compute Cook’s distance for all observations. Using the \(4/n\) threshold, how many observations are potentially influential?
4c. (5 pts) Create a scatterplot of jackknife residuals vs. leverage with point size proportional to Cook’s distance. Identify any observations that are both high-leverage and have large residuals.
4d. (5 pts) Refit the model after removing all observations with Cook’s distance \(> 4/n\). Compare the coefficients with the full-data model in a single table. Do any coefficients change substantially?
4e. (5 pts) A colleague argues: “We should just remove all outliers and high-leverage points to get a cleaner model.” Write a 3–4 sentence response explaining why this approach is problematic.
5a. (5 pts) Compute the VIF for all predictors in your model. Present the results in a table with VIF, tolerance, and a flag column. Is there evidence of multicollinearity?
5b. (5 pts) Create a pairwise correlation matrix for the continuous predictors. Which pair has the strongest correlation?
5c. (5 pts) Suppose you wanted to add a quadratic
term for BMI to the model (\(\text{bmi}^2\)). Compute the VIF for
bmi and bmi_sq in this expanded model. Then
center BMI and recompute. Show that centering reduces the VIF. Why does
centering help?
End of Lab Activity