Cook’s distance or Cook’s D is a commonly used estimate of the influence of a data point when performing least squares regression analysis.
Cook’s distance is useful for identifying outliers in the X values (observations for predictor variables). It also shows the influence of each observation on the fitted response values.
(Case Deletion Diagnostics) If predictions are the same with or without the observation in question, then the observation has no influence on the regression model. If the predictions differ greatly when the observation is not included in the analysis, then the observation is influential.
Cook’s distance measures the effect of deleting a given observation. Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression.
Cook’s distance is the scaled change in fitted values. Each resulting element in a diagnostic calculation is the normalized change in the vector of coefficients due to the deletion of an observation.
In a practical ordinary least squares analysis, Cook’s distance can be used in several ways: to indicate data points that are particularly worth checking for validity; to indicate regions of the design space where it would be good to be able to obtain more data points.
It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.
Points with a large Cook’s distance are considered to merit closer examination in the analysis.
Influential cases are not usually a problem when their removal from the dataset would leave the parameter estimates essentially unchanged: the ones we worry about are those whose presence really does change the results.
Cook’s D is a good measure of the influence of an observation and is proportional to the sum of the squared differences between predictions made with all observations in the analysis and predictions made leaving out the observation in question.
It is calculated as: \[D_i = \frac{ \sum_{j=1}^n (\hat Y_j\ - \hat Y_{j(i)})^2 }{p \ \mathrm{MSE}}, \]
where:
For the case of simple linear regression, the following are the algebraically equivalent expressions \[D_i = \frac{e_i^2}{p \ \mathrm{MSE}}\left[\frac{h_{ii}}{(1-h_{ii})^2}\right], \]
\[ D_i = \frac{ (\hat \beta - \hat {\beta}^{(-i)})^T(X^TX)(\hat \beta - \hat {\beta}^{(-i)}) } {(1+p)s^2}, \]
where:
R code for computing Cook’s Distance
fit = lm(mpg ~ cyl + wt,data=mtcars )
cooks.distance(fit)
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 0.0050772590 0.0004442585 0.0567764620 0.0018029260
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 0.0235271472 0.0050205614 0.0178733213 0.0091033181
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 0.0065061176 0.0004643600 0.0075293380 0.0116847953
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 0.0102875723 0.0005228914 0.0035498738 0.0001501537
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 0.3189363624 0.1592990291 0.0276449872 0.2233281268
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 0.0913548207 0.0040263378 0.0120218543 0.0165559199
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 0.0569730451 0.0001790454 0.0033281614 0.0216355209
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 0.0237336584 0.0105550987 0.0072685192 0.0727399065
cooks.distance(fit)[which.max(cooks.distance(fit))]
## Chrysler Imperial
## 0.3189364
plot(fit,which=4)
plot(cooks.distance(fit),type="b",pch=18,col="red")
N = 32
k = 2
cutoff = 4/ (N-k-1)
abline(h=cutoff,lty=2)
library(broom)
augment(fit)
## # A tibble: 32 x 10
## .rownames mpg cyl wt .fitted .resid .hat .sigma .cooksd .std.resid
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 2.62 22.3 -1.28 0.0548 2.60 5.08e-3 -0.512
## 2 Mazda RX4 ~ 21 6 2.88 21.5 -0.465 0.0376 2.61 4.44e-4 -0.185
## 3 Datsun 710 22.8 4 2.32 26.3 -3.45 0.0798 2.52 5.68e-2 -1.40
## 4 Hornet 4 D~ 21.4 6 3.22 20.4 1.02 0.0321 2.61 1.80e-3 0.404
## 5 Hornet Spo~ 18.7 8 3.44 16.6 2.05 0.0912 2.58 2.35e-2 0.839
## 6 Valiant 18.1 6 3.46 19.6 -1.50 0.0407 2.60 5.02e-3 -0.596
## 7 Duster 360 14.3 8 3.57 16.2 -1.93 0.0801 2.59 1.79e-2 -0.785
## 8 Merc 240D 24.4 4 3.19 23.5 0.924 0.152 2.61 9.10e-3 0.391
## 9 Merc 230 22.8 4 3.15 23.6 -0.804 0.146 2.61 6.51e-3 -0.339
## 10 Merc 280 19.2 6 3.44 19.7 -0.463 0.0396 2.61 4.64e-4 -0.184
## # ... with 22 more rows
Some texts tell you that points for which Cook’s distance is higher than 1 are to be considered as influential.
Other texts give you a threshold of \(4/N\) or \[ {4 \over (N-k-1)} ,\] where N is the number of observations and k the number of explanatory variables.
The R help file advises that an observation with Cook’s distance larger than three times the mean Cook’s distance might be an outlier. .
John Fox (mentioned above), in his booklet on regression diagnostics is rather cautious when it comes to giving numerical thresholds. He advises the use of graphics and to examine in closer details the points with “values of D that are substantially larger than the rest”. According to Fox, thresholds should just be used to enhance graphical displays.
A common rule of thumb is that an observation with a value of Cook’s D over 1.0 has too much influence. As with all rules of thumb, this rule should be applied judiciously and not thoughtlessly.
Cook’s distance refers to how far, on average, predicted y-values will move if the observation in question is dropped from the data set.
: refers to how much a parameter estimate changes if the observation in question is dropped from the data set.
Cook’s distance is arguably more important if you are doing predictive modeling, whereas is more important in explanatory modeling.
: Although the raw values resulting from the equations are different, Cook’s distance and are conceptually identical and there is a closed-form formula to convert one value to the other.