Learning Objectives: By the end of this lecture, students will be able to: (1) distinguish outliers, high-leverage points, and influential observations; (2) compute and interpret standardized residuals, leverage, Cook’s \(D\), and DFFITS; (3) apply appropriate remedial measures including casewise deletion, transformations, and robust regression.


1 Motivating the Problem

The ordinary least squares (OLS) estimator minimizes the sum of squared residuals. Because squaring amplifies large deviations, even a single atypical observation can pull the fitted line away from the bulk of the data and distort coefficient estimates, standard errors, and inference. Before accepting any regression output, we must diagnose whether extreme observations are driving our results.

Three distinct types of extreme observations must be carefully distinguished:

Type Definition Primary Impact
Outlier Observation with an unusually large residual — far from the fitted regression surface in the \(Y\) direction. Inflates MSE; distorts \(R^2\) and hypothesis tests.
High-Leverage Point Observation far from the center of the predictor space (unusual \(X\) values), regardless of its residual. Can exert strong influence even with a small residual.
Influential Point Observation whose removal substantially changes one or more estimated coefficients. Directly biases parameter estimates and predictions.

Key Insight: A high-leverage point is not necessarily influential — if it falls close to the regression line it is well-fitted. An outlier is not necessarily influential — if it has low leverage its effect on the fit may be small. Influence combines both extremity in \(X\) and extremity in \(Y\).


2 Residual Analysis

Residuals are the primary diagnostic tool. However, raw residuals have different variances and are not directly comparable across observations. Several standardized forms address this.

2.1 Raw and Studentized Residuals

The raw residual for observation \(i\) is:

\[e_i = y_i - \hat{y}_i\]

Raw residuals have variance \(\text{Var}(e_i) = \sigma^2(1 - h_{ii})\), where \(h_{ii}\) is the leverage of observation \(i\) (defined below). To produce residuals with unit variance, we standardize:

Internally studentized (standardized) residual:

\[r_i = \frac{e_i}{s\sqrt{1 - h_{ii}}}\]

Externally studentized (studentized deleted) residual:

\[t_i = \frac{e_i}{s_{(i)}\sqrt{1 - h_{ii}}}\]

where \(s_{(i)}\) is the RMSE from the model with observation \(i\) deleted. Under the standard regression assumptions, \(t_i \sim t_{n-p-1}\).

Rule of Thumb: Flag observations with \(|r_i| > 2\) for further inspection, and treat \(|r_i| > 3\) as strong evidence of an outlier. With large samples some exceedances are expected by chance — apply a Bonferroni-adjusted threshold \(\alpha^* = \alpha / n\).

2.2 Leverage (Hat Values)

The hat matrix

\[\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\]

maps observed \(\mathbf{y}\) to fitted values: \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\). Its diagonal elements \(h_{ii} \in [0, 1]\) measure how far observation \(i\) is from the centroid of the predictor space. For simple linear regression:

\[h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^n (x_j - \bar{x})^2}\]

Properties:

\[\sum_{i=1}^n h_{ii} = p, \qquad \bar{h} = \frac{p}{n}\]

where \(p\) is the number of parameters (including the intercept).

Threshold: Flag observations with \(h_{ii} > 2p/n\) (twice the average leverage). In small samples use \(h_{ii} > 3p/n\).

Conceptual Note: Leverage measures potential influence based on the design matrix \(\mathbf{X}\) alone — it says nothing about \(Y\). A high-leverage point has an unusual combination of predictor values and therefore exerts strong “pull” on the fitted line, but whether it actually distorts estimates depends on whether its \(Y\) value is consistent with the pattern of the other data.

2.3 Cook’s Distance

Cook’s distance measures the aggregate change in all fitted values when observation \(i\) is deleted:

\[D_i = \frac{(\hat{\mathbf{b}} - \hat{\mathbf{b}}_{(i)})'(\mathbf{X}'\mathbf{X})(\hat{\mathbf{b}} - \hat{\mathbf{b}}_{(i)})}{p \cdot \text{MSE}}\]

An equivalent and more convenient computational form is:

\[D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}\]

This second form is instructive: \(D_i\) is the product of the outlyingness of the observation (\(r_i^2/p\)) and its leverage (\(h_{ii}/(1-h_{ii})\)). A large \(D_i\) requires both an unusual \(Y\) and an unusual \(X\) position.

\(D_i\) Value Interpretation Action
\(D_i < 0.5\) Little influence Generally safe to proceed
\(0.5 \leq D_i < 1.0\) Moderate influence Investigate the observation
\(D_i \geq 1.0\) High influence Serious concern; remediation needed

Graphical approach: Plot \(D_i\) against observation index. Compare to the \(F(p,\, n-p)\) distribution: values beyond the 50th percentile suggest the observation is meaningfully influential.

2.4 DFFITS and DFBETAS

Cook’s \(D\) aggregates influence across all fitted values. Occasionally it is more informative to measure influence on individual predictions or individual coefficients.

DFFITS — influence on the \(i\)-th predicted value (self-prediction):

\[\text{DFFITS}_i = \frac{\hat{y}_i - \hat{y}_{i(i)}}{s_{(i)}\sqrt{h_{ii}}} = t_i \sqrt{\frac{h_{ii}}{1 - h_{ii}}}\]

Thresholds: \(|\text{DFFITS}_i| > 2\sqrt{p/n}\) for large \(n\); \(|\text{DFFITS}_i| > 1\) for small \(n\).

DFBETAS — influence on the \(j\)-th coefficient:

\[\text{DFBETAS}_{j,i} = \frac{\hat{b}_j - \hat{b}_{j(i)}}{s_{(i)}\sqrt{c_{jj}}}\]

where \(c_{jj} = [(\mathbf{X}'\mathbf{X})^{-1}]_{jj}\). Threshold: \(|\text{DFBETAS}_{j,i}| > 2/\sqrt{n}\).

DFBETAS is particularly useful because it identifies which specific coefficient an observation is distorting. An observation may be highly influential for one coefficient but negligible for others.

2.5 Summary Table of Diagnostics

Diagnostic What It Measures Threshold
Raw residual \(e_i\) Vertical distance from fitted line No standard threshold
Studentized residual \(r_i\) Outlyingness in \(Y\) (variance-stabilized) \(|r_i| > 2\) (flag); \(> 3\) (strong)
Externally studentized \(t_i\) Outlyingness in \(Y\) (uses deleted MSE) Bonferroni-corrected \(t\) critical value
Leverage \(h_{ii}\) Extremity in predictor space \(> 2p/n\)
Cook’s \(D_i\) Aggregate influence on all fits \(> 0.5\) (moderate); \(> 1.0\) (high)
DFFITS Influence on own predicted value \(> 2\sqrt{p/n}\)
DFBETAS\(_{j,i}\) Influence on \(j\)-th coefficient \(> 2/\sqrt{n}\)

2.6 Residual Plots

Numerical diagnostics should always be complemented by graphical examination. The following plots form a standard diagnostic battery:

  • Residuals vs. fitted values (\(\hat{y}\)): Non-random patterns suggest model misspecification or heteroscedasticity. Fanning indicates non-constant variance.
  • Normal Q-Q plot of residuals: Systematic departure from a straight line indicates non-normality. Heavy tails show as S-shaped curves.
  • Scale-location plot (\(\sqrt{|r_i|}\) vs. \(\hat{y}\)): A horizontal band indicates constant variance; a rising trend indicates heteroscedasticity.
  • Residuals vs. leverage (influence plot): Observations in the upper- or lower-right corners (high leverage + large residual) are most dangerous. Contours of Cook’s \(D\) are often overlaid.
  • Index plot of Cook’s \(D\) or DFFITS: Identifies which specific observations merit scrutiny.

3 Remedial Measures

After identifying extreme observations, the analyst must decide how to proceed. The appropriate action depends on the source of the anomaly — there is no one-size-fits-all answer.

Before Acting: Always investigate why an observation is extreme. Data entry errors should be corrected or deleted. Legitimate observations representing a genuinely different sub-population may call for a more complex model rather than deletion. Influential observations that are correct and representative should never be silently discarded.

3.1 Casewise Deletion

When to use: The observation results from a data entry error, equipment malfunction, or a known event that makes it unrepresentative of the population of interest.

Procedure: Remove the observation(s), refit the model, and report both the complete-case and reduced-sample results. Sensitivity of conclusions to deletion is itself informative.

Cautions:

  • Deletion reduces sample size and statistical power.
  • Deleting observations solely because they are influential, without substantive justification, constitutes data manipulation and invalidates inference.
  • If multiple influential observations exist, sequential deletion can be misleading due to masking — removing one outlier may reveal another that was hidden.
  • Masking and swamping: In the presence of multiple outliers, some genuine outliers may appear clean (masking) and clean observations may appear outlying (swamping). Robust methods are more resistant to these problems.

3.2 Variable Transformations

Transformations can reduce the influence of outliers by pulling extreme values toward the bulk of the distribution, and may simultaneously address non-linearity and heteroscedasticity.

3.2.1 Transformations on \(Y\)

The Box-Cox family provides a systematic approach to choosing a power transformation of the response:

\[y_i^{(\lambda)} = \begin{cases} \dfrac{y_i^\lambda - 1}{\lambda} & \lambda \neq 0 \\[6pt] \ln(y_i) & \lambda = 0 \end{cases}\]

Estimate \(\lambda\) by maximizing the profile log-likelihood:

\[\ell(\lambda) = -\frac{n}{2}\ln\!\left(\frac{\text{RSS}(\lambda)}{n}\right) + (\lambda - 1)\sum_{i=1}^n \ln(y_i)\]

\(\lambda\) Transformation Typical Use Case
\(1\) Identity Data already well-behaved
\(0.5\) \(\sqrt{y}\) Count data, mild right skew
\(0\) \(\ln(y)\) Right-skewed, multiplicative relationships
\(-0.5\) \(1/\sqrt{y}\) Moderate left-pull
\(-1\) \(1/y\) Strong right skew, rates

Important: After a log transformation, the median prediction on the original scale is \(\exp(\hat{y})\). Note that \(E[\exp(\ln y)] \neq \exp(E[\ln y])\) due to Jensen’s inequality — a smearing correction may be needed when back-transforming means.

3.2.2 Transformations on \(X\)

Transforming predictors can reduce leverage by pulling extreme \(X\) values toward the center of the distribution. Common transformations include \(\ln(x)\), \(\sqrt{x}\), and \(1/x\). The Box-Tidwell procedure estimates power transformations for predictors analogously to Box-Cox. Adding polynomial or spline terms can also address nonlinearity that makes observations appear as outliers only because the wrong functional form was assumed.

3.3 Robust Regression

Robust regression methods reduce the weight given to observations with large residuals, yielding estimates less sensitive to outliers without requiring explicit deletion.

3.3.1 M-Estimators

M-estimators replace the sum of squared residuals with a sum of a loss function \(\rho\) applied to scaled residuals:

\[\min_{\mathbf{b}} \sum_{i=1}^n \rho\!\left(\frac{e_i}{s}\right)\]

The first-order conditions are:

\[\mathbf{X}'\boldsymbol{\psi}\!\left(\frac{\mathbf{e}}{s}\right) = \mathbf{0}, \qquad \psi = \rho'\]

These are solved iteratively via IRLS (Iteratively Reweighted Least Squares):

\[w_i^{(t)} = \frac{\psi(e_i^{(t)}/s)}{e_i^{(t)}/s}\]

At each iteration, we solve the weighted least squares problem \((\mathbf{X}'\mathbf{W}^{(t)}\mathbf{X})\mathbf{b}^{(t+1)} = \mathbf{X}'\mathbf{W}^{(t)}\mathbf{y}\) and update weights until convergence.

Common loss functions:

Estimator \(\rho(u)\) Notes
OLS \(u^2/2\) No robustness; all residuals equally weighted
Huber \(\begin{cases} u^2/2, & |u| \leq k \\ k|u| - k^2/2, & |u| > k \end{cases}\) Linear in tails; \(k = 1.345\) gives 95% efficiency vs. OLS under normality
Tukey bisquare \(\begin{cases} \frac{c^2}{6}\!\left[1 - \left(1 - \frac{u^2}{c^2}\right)^3\right], & |u| \leq c \\ c^2/6, & |u| > c \end{cases}\) Zero weight beyond \(c = 4.685\); excellent robustness, non-convex
LAD \(|u|\) Minimizes \(\sum|e_i|\); equivalent to median regression

3.3.2 Least Trimmed Squares (LTS)

LTS minimizes the sum of the \(h\) smallest squared residuals:

\[\min_{\mathbf{b}} \sum_{i=1}^{h} e_{(i)}^2\]

where \(e_{(1)}^2 \leq e_{(2)}^2 \leq \cdots \leq e_{(n)}^2\) are the order statistics of squared residuals, and typically \(h = \lfloor n/2 \rfloor + \lfloor (p+1)/2 \rfloor\). LTS has a 50% breakdown point but is computationally intensive; it is most useful as a diagnostic starting point for heavily contaminated data.

3.3.3 Quantile Regression

Quantile regression estimates the conditional \(\tau\)-th quantile of \(Y\) given \(\mathbf{X}\) by minimizing:

\[\min_{\mathbf{b}} \sum_{i=1}^n \rho_\tau(e_i), \qquad \rho_\tau(u) = u\bigl(\tau - \mathbf{1}(u < 0)\bigr)\]

Median regression (\(\tau = 0.5\)) is equivalent to LAD and is highly robust to outliers in \(Y\). Modeling multiple quantiles simultaneously provides a richer picture of the conditional distribution.

When to use: Asymmetric distributions, heteroscedasticity, or when the effect of \(X\) may differ across the distribution of \(Y\).

3.4 Weighted Least Squares (WLS)

If \(\text{Var}(\varepsilon_i) = \sigma^2 / w_i\) for known weights \(w_i\), WLS minimizes:

\[\sum_{i=1}^n w_i e_i^2\]

with solution:

\[\hat{\mathbf{b}}_{\text{WLS}} = (\mathbf{X}'\mathbf{W}\mathbf{X})^{-1}\mathbf{X}'\mathbf{W}\mathbf{y}, \qquad \mathbf{W} = \text{diag}(w_1, \ldots, w_n)\]

Caution: WLS is not the same as robust regression. WLS requires specifying the variance function in advance; it does not adaptively downweight outliers based on their residuals.

3.5 Other Options

3.5.1 Model Respecification

Sometimes an observation appears influential because an important predictor is omitted or the functional form is incorrect. Adding interaction terms, quadratic terms, or a previously omitted variable may substantially reduce influence statistics across the board.

3.5.2 Mixed Models and Random Effects

When outlying observations belong to identifiable groups (e.g., a particular school, hospital, or time period), a mixed model with random effects can partially pool information and naturally shrink extreme group-level estimates toward the grand mean.

3.5.3 Heavy-Tailed Error Distributions

Modeling the error term with a \(t\) distribution with \(\nu\) degrees of freedom:

\[\varepsilon_i \sim t_\nu(0,\, \sigma^2)\]

makes the likelihood less sensitive to large residuals without requiring explicit identification and deletion of outliers. As \(\nu \to \infty\) the model approaches normality; small \(\nu\) (e.g., \(\nu = 4\)\(6\)) provides substantial robustness. This is equivalent to a form of adaptive robust regression from a Bayesian perspective.


4 Decision Framework

The following workflow summarizes the recommended sequence of steps when extreme observations are detected:

  1. Detect: Compute studentized residuals, leverage, Cook’s \(D\), DFFITS. Plot residuals vs. leverage.
  2. Investigate: Return to the original data for flagged observations. Check for data entry errors, recording anomalies, or known events.
  3. Classify: Is the observation an error? A legitimate but unusual case? Does it belong to a different subpopulation?
  4. Act appropriately: Correct errors. Delete with justification if truly non-representative. Consider transformations or robust methods if the phenomenon naturally produces occasional extreme values.
  5. Compare results: Report the full-sample and sensitivity analyses. Conclusions that change substantially upon deletion of a few observations are fragile and should be communicated as such.
  6. Document: Fully report which observations were flagged, what investigation revealed, and what action was taken.

Reporting Standard: When influential observations are identified, best practice is to report both the full-data results and the results with influential observations removed or downweighted. Transparency about sensitivity is more credible than silently deleting inconvenient data points.


5 Software Implementation

5.1 R

fit <- lm(y ~ x1 + x2, data = dat)

# Standard diagnostics
plot(fit)                   # 4 standard diagnostic plots
influence.measures(fit)     # hat, Cook's D, DFFITS, DFBETAS (all at once)
hatvalues(fit)              # leverage h_ii
rstandard(fit)              # internally studentized residuals r_i
rstudent(fit)               # externally studentized residuals t_i
cooks.distance(fit)         # Cook's D
dffits(fit)                 # DFFITS
dfbetas(fit)                # DFBETAS (one column per coefficient)

# Robust regression (MASS package)
library(MASS)
fit_huber <- rlm(y ~ x1 + x2, data = dat, psi = psi.huber)
fit_bisq  <- rlm(y ~ x1 + x2, data = dat, psi = psi.bisquare)

# Quantile / median regression (quantreg package)
library(quantreg)
fit_med <- rq(y ~ x1 + x2, tau = 0.5, data = dat)

# Box-Cox transformation (MASS)
bc <- boxcox(fit)           # plots profile log-likelihood for lambda
bc$x[which.max(bc$y)]      # MLE of lambda

5.2 Python (statsmodels)

import statsmodels.api as sm

model = sm.OLS(y, X).fit()
infl  = model.get_influence()

leverage  = infl.hat_matrix_diag          # h_ii
std_resid = infl.resid_studentized_internal   # r_i
ext_resid = infl.resid_studentized_external   # t_i
cooks_d   = infl.cooks_distance[0]        # D_i
dffits    = infl.dffits[0]
dfbetas   = infl.dfbetas

# Robust regression (M-estimator with Huber weights)
rlm_model = sm.RLM(y, X, M=sm.robust.norms.HuberT()).fit()

6 Practice Problems

  1. Conceptual: Construct a hypothetical scatterplot with \(n = 15\) observations. Sketch a point that is (a) an outlier but not influential; (b) high leverage but not influential; (c) both high leverage and influential. Explain why each configuration arises.

  2. Computation: For a simple linear regression with \(n = 20\) observations, the fitted model is \(\hat{y} = 3.2 + 1.8x\). Observation 7 has \(x_7 = 15\) (\(\bar{x} = 6\), \(SS_x = 200\)), \(e_7 = 4.1\), and \(\text{MSE} = 2.25\). Compute \(h_{77}\), the internally studentized residual \(r_7\), and Cook’s \(D_7\).

  3. Interpretation: A Cook’s \(D\) plot reveals that observation 3 has \(D_3 = 1.42\). Describe the steps you would take before deciding how to handle this observation. Under what circumstances would deletion be justified? Under what circumstances would robust regression be preferred?

  4. Robust vs. OLS: Explain in your own words why the Huber M-estimator provides a smooth transition between OLS (for small residuals) and LAD (for large residuals). What are the practical advantages over simply deleting outliers?

  5. Transformations: A Box-Cox analysis on a regression of housing prices on square footage yields \(\hat{\lambda} = 0.08\) with a 95% confidence interval of \([-0.05,\, 0.20]\). What transformation would you apply, and why? How does this affect the interpretation of coefficients?


7 Key Terms

Term Notation Definition
Leverage \(h_{ii}\) Diagonal of \(\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\); measures extremity in predictor space
Hat matrix \(\mathbf{H}\) Projection matrix mapping \(\mathbf{y}\) to \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\)
Internally studentized residual \(r_i\) \(e_i / [s\sqrt{1-h_{ii}}]\)
Externally studentized residual \(t_i\) \(e_i / [s_{(i)}\sqrt{1-h_{ii}}]\); follows \(t_{n-p-1}\)
Cook’s distance \(D_i\) Aggregate change in all fitted values upon deletion of \(i\)
DFFITS Change in \(\hat{y}_i\) upon deletion of \(i\), scaled by \(s_{(i)}\sqrt{h_{ii}}\)
DFBETAS Change in \(\hat{b}_j\) upon deletion of \(i\), scaled by coefficient SE
Breakdown point Maximum fraction of contaminated observations an estimator can tolerate; OLS: 0%, LTS: $$50%
IRLS Iteratively Reweighted Least Squares; algorithm for computing M-estimators
Masking One outlier hides another by pulling the fit toward it

8 References

  • Cook, R. D. & Weisberg, S. (1982). Residuals and Influence in Regression. Chapman & Hall.
  • Fox, J. (2016). Applied Regression Analysis and Generalized Linear Models (3rd ed.). SAGE. [Chapters 11–13]
  • Huber, P. J. & Ronchetti, E. M. (2009). Robust Statistics (2nd ed.). Wiley.
  • Montgomery, D. C., Peck, E. A., & Vining, G. G. (2021). Introduction to Linear Regression Analysis (6th ed.). Wiley. [Chapter 4]
  • Rousseeuw, P. J. & Leroy, A. M. (1987). Robust Regression and Outlier Detection. Wiley.
  • Velleman, P. F. & Welsch, R. E. (1981). Efficient computing of regression diagnostics. The American Statistician, 35(4), 234–242.