Learning Objectives: By the end of this lecture, students will be able to: (1) distinguish outliers, high-leverage points, and influential observations; (2) compute and interpret standardized residuals, leverage, Cook’s \(D\), and DFFITS; (3) apply appropriate remedial measures including casewise deletion, transformations, and robust regression.
The ordinary least squares (OLS) estimator minimizes the sum of squared residuals. Because squaring amplifies large deviations, even a single atypical observation can pull the fitted line away from the bulk of the data and distort coefficient estimates, standard errors, and inference. Before accepting any regression output, we must diagnose whether extreme observations are driving our results.
Three distinct types of extreme observations must be carefully distinguished:
| Type | Definition | Primary Impact |
|---|---|---|
| Outlier | Observation with an unusually large residual — far from the fitted regression surface in the \(Y\) direction. | Inflates MSE; distorts \(R^2\) and hypothesis tests. |
| High-Leverage Point | Observation far from the center of the predictor space (unusual \(X\) values), regardless of its residual. | Can exert strong influence even with a small residual. |
| Influential Point | Observation whose removal substantially changes one or more estimated coefficients. | Directly biases parameter estimates and predictions. |
Key Insight: A high-leverage point is not necessarily influential — if it falls close to the regression line it is well-fitted. An outlier is not necessarily influential — if it has low leverage its effect on the fit may be small. Influence combines both extremity in \(X\) and extremity in \(Y\).
Residuals are the primary diagnostic tool. However, raw residuals have different variances and are not directly comparable across observations. Several standardized forms address this.
The raw residual for observation \(i\) is:
\[e_i = y_i - \hat{y}_i\]
Raw residuals have variance \(\text{Var}(e_i) = \sigma^2(1 - h_{ii})\), where \(h_{ii}\) is the leverage of observation \(i\) (defined below). To produce residuals with unit variance, we standardize:
Internally studentized (standardized) residual:
\[r_i = \frac{e_i}{s\sqrt{1 - h_{ii}}}\]
Externally studentized (studentized deleted) residual:
\[t_i = \frac{e_i}{s_{(i)}\sqrt{1 - h_{ii}}}\]
where \(s_{(i)}\) is the RMSE from the model with observation \(i\) deleted. Under the standard regression assumptions, \(t_i \sim t_{n-p-1}\).
Rule of Thumb: Flag observations with \(|r_i| > 2\) for further inspection, and treat \(|r_i| > 3\) as strong evidence of an outlier. With large samples some exceedances are expected by chance — apply a Bonferroni-adjusted threshold \(\alpha^* = \alpha / n\).
The hat matrix
\[\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\]
maps observed \(\mathbf{y}\) to fitted values: \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\). Its diagonal elements \(h_{ii} \in [0, 1]\) measure how far observation \(i\) is from the centroid of the predictor space. For simple linear regression:
\[h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^n (x_j - \bar{x})^2}\]
Properties:
\[\sum_{i=1}^n h_{ii} = p, \qquad \bar{h} = \frac{p}{n}\]
where \(p\) is the number of parameters (including the intercept).
Threshold: Flag observations with \(h_{ii} > 2p/n\) (twice the average leverage). In small samples use \(h_{ii} > 3p/n\).
Conceptual Note: Leverage measures potential influence based on the design matrix \(\mathbf{X}\) alone — it says nothing about \(Y\). A high-leverage point has an unusual combination of predictor values and therefore exerts strong “pull” on the fitted line, but whether it actually distorts estimates depends on whether its \(Y\) value is consistent with the pattern of the other data.
Cook’s distance measures the aggregate change in all fitted values when observation \(i\) is deleted:
\[D_i = \frac{(\hat{\mathbf{b}} - \hat{\mathbf{b}}_{(i)})'(\mathbf{X}'\mathbf{X})(\hat{\mathbf{b}} - \hat{\mathbf{b}}_{(i)})}{p \cdot \text{MSE}}\]
An equivalent and more convenient computational form is:
\[D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}\]
This second form is instructive: \(D_i\) is the product of the outlyingness of the observation (\(r_i^2/p\)) and its leverage (\(h_{ii}/(1-h_{ii})\)). A large \(D_i\) requires both an unusual \(Y\) and an unusual \(X\) position.
| \(D_i\) Value | Interpretation | Action |
|---|---|---|
| \(D_i < 0.5\) | Little influence | Generally safe to proceed |
| \(0.5 \leq D_i < 1.0\) | Moderate influence | Investigate the observation |
| \(D_i \geq 1.0\) | High influence | Serious concern; remediation needed |
Graphical approach: Plot \(D_i\) against observation index. Compare to the \(F(p,\, n-p)\) distribution: values beyond the 50th percentile suggest the observation is meaningfully influential.
Cook’s \(D\) aggregates influence across all fitted values. Occasionally it is more informative to measure influence on individual predictions or individual coefficients.
DFFITS — influence on the \(i\)-th predicted value (self-prediction):
\[\text{DFFITS}_i = \frac{\hat{y}_i - \hat{y}_{i(i)}}{s_{(i)}\sqrt{h_{ii}}} = t_i \sqrt{\frac{h_{ii}}{1 - h_{ii}}}\]
Thresholds: \(|\text{DFFITS}_i| > 2\sqrt{p/n}\) for large \(n\); \(|\text{DFFITS}_i| > 1\) for small \(n\).
DFBETAS — influence on the \(j\)-th coefficient:
\[\text{DFBETAS}_{j,i} = \frac{\hat{b}_j - \hat{b}_{j(i)}}{s_{(i)}\sqrt{c_{jj}}}\]
where \(c_{jj} = [(\mathbf{X}'\mathbf{X})^{-1}]_{jj}\). Threshold: \(|\text{DFBETAS}_{j,i}| > 2/\sqrt{n}\).
DFBETAS is particularly useful because it identifies which specific coefficient an observation is distorting. An observation may be highly influential for one coefficient but negligible for others.
| Diagnostic | What It Measures | Threshold |
|---|---|---|
| Raw residual \(e_i\) | Vertical distance from fitted line | No standard threshold |
| Studentized residual \(r_i\) | Outlyingness in \(Y\) (variance-stabilized) | \(|r_i| > 2\) (flag); \(> 3\) (strong) |
| Externally studentized \(t_i\) | Outlyingness in \(Y\) (uses deleted MSE) | Bonferroni-corrected \(t\) critical value |
| Leverage \(h_{ii}\) | Extremity in predictor space | \(> 2p/n\) |
| Cook’s \(D_i\) | Aggregate influence on all fits | \(> 0.5\) (moderate); \(> 1.0\) (high) |
| DFFITS | Influence on own predicted value | \(> 2\sqrt{p/n}\) |
| DFBETAS\(_{j,i}\) | Influence on \(j\)-th coefficient | \(> 2/\sqrt{n}\) |
Numerical diagnostics should always be complemented by graphical examination. The following plots form a standard diagnostic battery:
After identifying extreme observations, the analyst must decide how to proceed. The appropriate action depends on the source of the anomaly — there is no one-size-fits-all answer.
Before Acting: Always investigate why an observation is extreme. Data entry errors should be corrected or deleted. Legitimate observations representing a genuinely different sub-population may call for a more complex model rather than deletion. Influential observations that are correct and representative should never be silently discarded.
When to use: The observation results from a data entry error, equipment malfunction, or a known event that makes it unrepresentative of the population of interest.
Procedure: Remove the observation(s), refit the model, and report both the complete-case and reduced-sample results. Sensitivity of conclusions to deletion is itself informative.
Cautions:
Transformations can reduce the influence of outliers by pulling extreme values toward the bulk of the distribution, and may simultaneously address non-linearity and heteroscedasticity.
The Box-Cox family provides a systematic approach to choosing a power transformation of the response:
\[y_i^{(\lambda)} = \begin{cases} \dfrac{y_i^\lambda - 1}{\lambda} & \lambda \neq 0 \\[6pt] \ln(y_i) & \lambda = 0 \end{cases}\]
Estimate \(\lambda\) by maximizing the profile log-likelihood:
\[\ell(\lambda) = -\frac{n}{2}\ln\!\left(\frac{\text{RSS}(\lambda)}{n}\right) + (\lambda - 1)\sum_{i=1}^n \ln(y_i)\]
| \(\lambda\) | Transformation | Typical Use Case |
|---|---|---|
| \(1\) | Identity | Data already well-behaved |
| \(0.5\) | \(\sqrt{y}\) | Count data, mild right skew |
| \(0\) | \(\ln(y)\) | Right-skewed, multiplicative relationships |
| \(-0.5\) | \(1/\sqrt{y}\) | Moderate left-pull |
| \(-1\) | \(1/y\) | Strong right skew, rates |
Important: After a log transformation, the median prediction on the original scale is \(\exp(\hat{y})\). Note that \(E[\exp(\ln y)] \neq \exp(E[\ln y])\) due to Jensen’s inequality — a smearing correction may be needed when back-transforming means.
Transforming predictors can reduce leverage by pulling extreme \(X\) values toward the center of the distribution. Common transformations include \(\ln(x)\), \(\sqrt{x}\), and \(1/x\). The Box-Tidwell procedure estimates power transformations for predictors analogously to Box-Cox. Adding polynomial or spline terms can also address nonlinearity that makes observations appear as outliers only because the wrong functional form was assumed.
Robust regression methods reduce the weight given to observations with large residuals, yielding estimates less sensitive to outliers without requiring explicit deletion.
M-estimators replace the sum of squared residuals with a sum of a loss function \(\rho\) applied to scaled residuals:
\[\min_{\mathbf{b}} \sum_{i=1}^n \rho\!\left(\frac{e_i}{s}\right)\]
The first-order conditions are:
\[\mathbf{X}'\boldsymbol{\psi}\!\left(\frac{\mathbf{e}}{s}\right) = \mathbf{0}, \qquad \psi = \rho'\]
These are solved iteratively via IRLS (Iteratively Reweighted Least Squares):
\[w_i^{(t)} = \frac{\psi(e_i^{(t)}/s)}{e_i^{(t)}/s}\]
At each iteration, we solve the weighted least squares problem \((\mathbf{X}'\mathbf{W}^{(t)}\mathbf{X})\mathbf{b}^{(t+1)} = \mathbf{X}'\mathbf{W}^{(t)}\mathbf{y}\) and update weights until convergence.
Common loss functions:
| Estimator | \(\rho(u)\) | Notes |
|---|---|---|
| OLS | \(u^2/2\) | No robustness; all residuals equally weighted |
| Huber | \(\begin{cases} u^2/2, & |u| \leq k \\ k|u| - k^2/2, & |u| > k \end{cases}\) | Linear in tails; \(k = 1.345\) gives 95% efficiency vs. OLS under normality |
| Tukey bisquare | \(\begin{cases} \frac{c^2}{6}\!\left[1 - \left(1 - \frac{u^2}{c^2}\right)^3\right], & |u| \leq c \\ c^2/6, & |u| > c \end{cases}\) | Zero weight beyond \(c = 4.685\); excellent robustness, non-convex |
| LAD | \(|u|\) | Minimizes \(\sum|e_i|\); equivalent to median regression |
LTS minimizes the sum of the \(h\) smallest squared residuals:
\[\min_{\mathbf{b}} \sum_{i=1}^{h} e_{(i)}^2\]
where \(e_{(1)}^2 \leq e_{(2)}^2 \leq \cdots \leq e_{(n)}^2\) are the order statistics of squared residuals, and typically \(h = \lfloor n/2 \rfloor + \lfloor (p+1)/2 \rfloor\). LTS has a 50% breakdown point but is computationally intensive; it is most useful as a diagnostic starting point for heavily contaminated data.
Quantile regression estimates the conditional \(\tau\)-th quantile of \(Y\) given \(\mathbf{X}\) by minimizing:
\[\min_{\mathbf{b}} \sum_{i=1}^n \rho_\tau(e_i), \qquad \rho_\tau(u) = u\bigl(\tau - \mathbf{1}(u < 0)\bigr)\]
Median regression (\(\tau = 0.5\)) is equivalent to LAD and is highly robust to outliers in \(Y\). Modeling multiple quantiles simultaneously provides a richer picture of the conditional distribution.
When to use: Asymmetric distributions, heteroscedasticity, or when the effect of \(X\) may differ across the distribution of \(Y\).
If \(\text{Var}(\varepsilon_i) = \sigma^2 / w_i\) for known weights \(w_i\), WLS minimizes:
\[\sum_{i=1}^n w_i e_i^2\]
with solution:
\[\hat{\mathbf{b}}_{\text{WLS}} = (\mathbf{X}'\mathbf{W}\mathbf{X})^{-1}\mathbf{X}'\mathbf{W}\mathbf{y}, \qquad \mathbf{W} = \text{diag}(w_1, \ldots, w_n)\]
Caution: WLS is not the same as robust regression. WLS requires specifying the variance function in advance; it does not adaptively downweight outliers based on their residuals.
Sometimes an observation appears influential because an important predictor is omitted or the functional form is incorrect. Adding interaction terms, quadratic terms, or a previously omitted variable may substantially reduce influence statistics across the board.
When outlying observations belong to identifiable groups (e.g., a particular school, hospital, or time period), a mixed model with random effects can partially pool information and naturally shrink extreme group-level estimates toward the grand mean.
Modeling the error term with a \(t\) distribution with \(\nu\) degrees of freedom:
\[\varepsilon_i \sim t_\nu(0,\, \sigma^2)\]
makes the likelihood less sensitive to large residuals without requiring explicit identification and deletion of outliers. As \(\nu \to \infty\) the model approaches normality; small \(\nu\) (e.g., \(\nu = 4\)–\(6\)) provides substantial robustness. This is equivalent to a form of adaptive robust regression from a Bayesian perspective.
The following workflow summarizes the recommended sequence of steps when extreme observations are detected:
Reporting Standard: When influential observations are identified, best practice is to report both the full-data results and the results with influential observations removed or downweighted. Transparency about sensitivity is more credible than silently deleting inconvenient data points.
fit <- lm(y ~ x1 + x2, data = dat)
# Standard diagnostics
plot(fit) # 4 standard diagnostic plots
influence.measures(fit) # hat, Cook's D, DFFITS, DFBETAS (all at once)
hatvalues(fit) # leverage h_ii
rstandard(fit) # internally studentized residuals r_i
rstudent(fit) # externally studentized residuals t_i
cooks.distance(fit) # Cook's D
dffits(fit) # DFFITS
dfbetas(fit) # DFBETAS (one column per coefficient)
# Robust regression (MASS package)
library(MASS)
fit_huber <- rlm(y ~ x1 + x2, data = dat, psi = psi.huber)
fit_bisq <- rlm(y ~ x1 + x2, data = dat, psi = psi.bisquare)
# Quantile / median regression (quantreg package)
library(quantreg)
fit_med <- rq(y ~ x1 + x2, tau = 0.5, data = dat)
# Box-Cox transformation (MASS)
bc <- boxcox(fit) # plots profile log-likelihood for lambda
bc$x[which.max(bc$y)] # MLE of lambdaimport statsmodels.api as sm
model = sm.OLS(y, X).fit()
infl = model.get_influence()
leverage = infl.hat_matrix_diag # h_ii
std_resid = infl.resid_studentized_internal # r_i
ext_resid = infl.resid_studentized_external # t_i
cooks_d = infl.cooks_distance[0] # D_i
dffits = infl.dffits[0]
dfbetas = infl.dfbetas
# Robust regression (M-estimator with Huber weights)
rlm_model = sm.RLM(y, X, M=sm.robust.norms.HuberT()).fit()Conceptual: Construct a hypothetical scatterplot with \(n = 15\) observations. Sketch a point that is (a) an outlier but not influential; (b) high leverage but not influential; (c) both high leverage and influential. Explain why each configuration arises.
Computation: For a simple linear regression with \(n = 20\) observations, the fitted model is \(\hat{y} = 3.2 + 1.8x\). Observation 7 has \(x_7 = 15\) (\(\bar{x} = 6\), \(SS_x = 200\)), \(e_7 = 4.1\), and \(\text{MSE} = 2.25\). Compute \(h_{77}\), the internally studentized residual \(r_7\), and Cook’s \(D_7\).
Interpretation: A Cook’s \(D\) plot reveals that observation 3 has \(D_3 = 1.42\). Describe the steps you would take before deciding how to handle this observation. Under what circumstances would deletion be justified? Under what circumstances would robust regression be preferred?
Robust vs. OLS: Explain in your own words why the Huber M-estimator provides a smooth transition between OLS (for small residuals) and LAD (for large residuals). What are the practical advantages over simply deleting outliers?
Transformations: A Box-Cox analysis on a regression of housing prices on square footage yields \(\hat{\lambda} = 0.08\) with a 95% confidence interval of \([-0.05,\, 0.20]\). What transformation would you apply, and why? How does this affect the interpretation of coefficients?
| Term | Notation | Definition |
|---|---|---|
| Leverage | \(h_{ii}\) | Diagonal of \(\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\); measures extremity in predictor space |
| Hat matrix | \(\mathbf{H}\) | Projection matrix mapping \(\mathbf{y}\) to \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\) |
| Internally studentized residual | \(r_i\) | \(e_i / [s\sqrt{1-h_{ii}}]\) |
| Externally studentized residual | \(t_i\) | \(e_i / [s_{(i)}\sqrt{1-h_{ii}}]\); follows \(t_{n-p-1}\) |
| Cook’s distance | \(D_i\) | Aggregate change in all fitted values upon deletion of \(i\) |
| DFFITS | — | Change in \(\hat{y}_i\) upon deletion of \(i\), scaled by \(s_{(i)}\sqrt{h_{ii}}\) |
| DFBETAS | — | Change in \(\hat{b}_j\) upon deletion of \(i\), scaled by coefficient SE |
| Breakdown point | — | Maximum fraction of contaminated observations an estimator can tolerate; OLS: 0%, LTS: $$50% |
| IRLS | — | Iteratively Reweighted Least Squares; algorithm for computing M-estimators |
| Masking | — | One outlier hides another by pulling the fit toward it |