Cook’s Distance

Cook’s distance or Cook’s D is a commonly used estimate of the influence of a data point when performing least squares regression analysis.
Cook’s distance is useful for identifying outliers in the X values (observations for predictor variables). It also shows the influence of each observation on the fitted response values.
(Case Deletion Diagnostics) If predictions are the same with or without the observation in question, then the observation has no influence on the regression model. If the predictions differ greatly when the observation is not included in the analysis, then the observation is influential.
Cook’s distance measures the effect of deleting a given observation. Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression.
Cook’s distance is the scaled change in fitted values. Each resulting element in a diagnostic calculation is the normalized change in the vector of coefficients due to the deletion of an observation.
In a practical ordinary least squares analysis, Cook’s distance can be used in several ways: to indicate data points that are particularly worth checking for validity; to indicate regions of the design space where it would be good to be able to obtain more data points.
It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.

Using Cook’s Distance

Points with a large Cook’s distance are considered to merit closer examination in the analysis.
Influential cases are not usually a problem when their removal from the dataset would leave the parameter estimates essentially unchanged: the ones we worry about are those whose presence really does change the results.

Cook’s Distance Formula

Cook’s D is a good measure of the influence of an observation and is proportional to the sum of the squared differences between predictions made with all observations in the analysis and predictions made leaving out the observation in question.

It is calculated as: \[D_i = \frac{ \sum_{j=1}^n (\hat Y_j\ - \hat Y_{j(i)})^2 }{p \ \mathrm{MSE}}, \]

where:

$\hat Y_j$ , is the prediction from the full regression model for observation j;
$\hat Y_{j(i)}$, is the prediction for observation j from a refitted regression model in which observation i has been omitted;
$p$ is the number of fitted parameters in the model;
$ $ , is the mean square error of the regression model.

For the case of simple linear regression, the following are the algebraically equivalent expressions \[D_i = \frac{e_i^2}{p \ \mathrm{MSE}}\left[\frac{h_{ii}}{(1-h_{ii})^2}\right], \]

\[ D_i = \frac{ (\hat \beta - \hat {\beta}^{(-i)})^T(X^TX)(\hat \beta - \hat {\beta}^{(-i)}) } {(1+p)s^2}, \]

where:

$h_{ii} \,$ is the leverage, i.e., the i-th diagonal element of the hat matrix \[\mathbf{X}\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\]
$e_i$ , is the residual (i.e., the difference between the observed value and the value fitted by the proposed model).

R Code

R code for computing Cook’s Distance

fit = lm(mpg ~ cyl + wt,data=mtcars )
cooks.distance(fit)

##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##        0.0050772590        0.0004442585        0.0567764620        0.0018029260 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##        0.0235271472        0.0050205614        0.0178733213        0.0091033181 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##        0.0065061176        0.0004643600        0.0075293380        0.0116847953 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##        0.0102875723        0.0005228914        0.0035498738        0.0001501537 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##        0.3189363624        0.1592990291        0.0276449872        0.2233281268 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##        0.0913548207        0.0040263378        0.0120218543        0.0165559199 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##        0.0569730451        0.0001790454        0.0033281614        0.0216355209 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##        0.0237336584        0.0105550987        0.0072685192        0.0727399065

cooks.distance(fit)[which.max(cooks.distance(fit))]

## Chrysler Imperial 
##         0.3189364

Plots

plot(fit,which=4)

plot(cooks.distance(fit),type="b",pch=18,col="red")

N = 32
k = 2
cutoff = 4/ (N-k-1)
abline(h=cutoff,lty=2)

The `broom::augment()` function

library(broom)
augment(fit)

## # A tibble: 32 x 10
##    .rownames     mpg   cyl    wt .fitted .resid   .hat .sigma .cooksd .std.resid
##    <chr>       <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
##  1 Mazda RX4    21       6  2.62    22.3 -1.28  0.0548   2.60 5.08e-3     -0.512
##  2 Mazda RX4 ~  21       6  2.88    21.5 -0.465 0.0376   2.61 4.44e-4     -0.185
##  3 Datsun 710   22.8     4  2.32    26.3 -3.45  0.0798   2.52 5.68e-2     -1.40 
##  4 Hornet 4 D~  21.4     6  3.22    20.4  1.02  0.0321   2.61 1.80e-3      0.404
##  5 Hornet Spo~  18.7     8  3.44    16.6  2.05  0.0912   2.58 2.35e-2      0.839
##  6 Valiant      18.1     6  3.46    19.6 -1.50  0.0407   2.60 5.02e-3     -0.596
##  7 Duster 360   14.3     8  3.57    16.2 -1.93  0.0801   2.59 1.79e-2     -0.785
##  8 Merc 240D    24.4     4  3.19    23.5  0.924 0.152    2.61 9.10e-3      0.391
##  9 Merc 230     22.8     4  3.15    23.6 -0.804 0.146    2.61 6.51e-3     -0.339
## 10 Merc 280     19.2     6  3.44    19.7 -0.463 0.0396   2.61 4.64e-4     -0.184
## # ... with 22 more rows

Interpretation

Some texts tell you that points for which Cook’s distance is higher than 1 are to be considered as influential.
Other texts give you a threshold of $4/N$ or \[ {4 \over (N-k-1)} ,\] where N is the number of observations and k the number of explanatory variables.
The R help file advises that an observation with Cook’s distance larger than three times the mean Cook’s distance might be an outlier. .
John Fox (mentioned above), in his booklet on regression diagnostics is rather cautious when it comes to giving numerical thresholds. He advises the use of graphics and to examine in closer details the points with “values of D that are substantially larger than the rest”. According to Fox, thresholds should just be used to enhance graphical displays.

A common rule of thumb is that an observation with a value of Cook’s D over 1.0 has too much influence. As with all rules of thumb, this rule should be applied judiciously and not thoughtlessly.

Cook’s Distance in relation to other measures

Cook’s distance refers to how far, on average, predicted y-values will move if the observation in question is dropped from the data set.
: refers to how much a parameter estimate changes if the observation in question is dropped from the data set.
Cook’s distance is arguably more important if you are doing predictive modeling, whereas is more important in explanatory modeling.
: Although the raw values resulting from the equations are different, Cook’s distance and are conceptually identical and there is a closed-form formula to convert one value to the other.

References

John Fox - . Regression Diagnostics: An Introduction. Sage Publications.(1991)

Cook’s Distance

Cook’s Distance

Using Cook’s Distance

Cook’s Distance Formula

R Code

Plots

The broom::augment() function

Interpretation

Cook’s Distance in relation to other measures

References

The `broom::augment()` function