Cook’s Distance

Cook’s Distance

Cook’s Distance

Cook’s Distance Formula

Cook’s D is a good measure of the influence of an observation and is proportional to the sum of the squared differences between predictions made with all observations in the analysis and predictions made leaving out the observation in question.

It is calculated as: \[D_i = \frac{ \sum_{j=1}^n (\hat Y_j\ - \hat Y_{j(i)})^2 }{p \ \mathrm{MSE}}, \]

where:

For the case of simple linear regression, the following are the algebraically equivalent expressions \[D_i = \frac{e_i^2}{p \ \mathrm{MSE}}\left[\frac{h_{ii}}{(1-h_{ii})^2}\right], \]

\[ D_i = \frac{ (\hat \beta - \hat {\beta}^{(-i)})^T(X^TX)(\hat \beta - \hat {\beta}^{(-i)}) } {(1+p)s^2}, \]

where:

R code for computing Cook’s Distance

cooks.distance(Fit_4) |> round(3)
plot(cooks.distance(Fit_4),type="b",pch=18,col="red")

N = 32
k = 2
cutoff = 4/ (N-k-1)
abline(h=cutoff,lty=2)


plot(cooks.distance(Fit_4),type="b",pch=18,col="red")

N = 32
k = 2
cutoff = 4/ (N-k-1)
abline(h=cutoff,lty=2)

Interpreting Cook’s Distance

Interpretation

Cook’s Distance in relation to other measures

Cook’s Distance in relation to other measures


Detecting highly influential observations

%There are different opinions regarding what cut-off values to use for spotting highly influential points. A simple operational guideline of D_i>1 has been suggested.[2] Others have indicated that D_i>4/n, where n is the number of observations, might be used.[3] %A conservative approach relies on the fact that Cook’s distance has the form W/p, where W is formally identical to the Wald statistic that one uses for testing that H_0:i=0 using some {[-i]}.[citation needed] Recalling that W/p has an F{p,n-p} distribution (with p and n-p degrees of freedom), we see that Cook’s distance is equivalent to the F statistic for testing this hypothesis, and we can thus use F_{p,n-p, 1-} as a threshold. %Interpretation[edit] %Specifically D_i can be interpreted as the distance one’s estimates move within the confidence ellipsoid that represents a region of plausible values for the parameters.[clarification needed] This is shown by an alternative but equivalent representation of Cook’s distance in terms of changes to the estimates of the regression parameters between the cases where the particular observation is either included or excluded from the regression analysis.