Removing Outliers with Cook’s Distance

Christian Thieme

5/14/2021


Understanding How to Use Cook’s Distance to Remove Outliers

There are many techniques to remove outliers from a dataset. One method that is often used in regression settings is utilizing Cook’s Distance. Cook’s distance is an estimate of the influence of a data point. It takes into account both the leverage and residual of each observation. Cook’s distance is a summary of how much a regression model changes when the \(i^{th}\) observation is removed.

When looking to see which observations may be outliers, a general rule of thumb is to investigate any point that is more than 3x the mean (note: there are several other regularly used criteria as well). I’ll show an example of how this works with a well known dataset called Hitters from the ISLR library. The Hitters dataset contains information on 250+ baseball players and their career stats and salary.

First we’ll import the dataset:

Hitters <- na.omit(Hitters)
glimpse(Hitters)
## Rows: 263
## Columns: 20
## $ AtBat     <int> 315, 479, 496, 321, 594, 185, 298, 323, 401, 574, 202, 41...
## $ Hits      <int> 81, 130, 141, 87, 169, 37, 73, 81, 92, 159, 53, 113, 60, ...
## $ HmRun     <int> 7, 18, 20, 10, 4, 1, 0, 6, 17, 21, 4, 13, 0, 7, 20, 2, 8,...
## $ Runs      <int> 24, 66, 65, 39, 74, 23, 24, 26, 49, 107, 31, 48, 30, 29, ...
## $ RBI       <int> 38, 72, 78, 42, 51, 8, 24, 32, 66, 75, 26, 61, 11, 27, 75...
## $ Walks     <int> 39, 76, 37, 30, 35, 21, 7, 8, 65, 59, 27, 47, 22, 30, 73,...
## $ Years     <int> 14, 3, 11, 2, 11, 2, 3, 2, 13, 10, 9, 4, 6, 13, 15, 5, 8,...
## $ CAtBat    <int> 3449, 1624, 5628, 396, 4408, 214, 509, 341, 5206, 4631, 1...
## $ CHits     <int> 835, 457, 1575, 101, 1133, 42, 108, 86, 1332, 1300, 467, ...
## $ CHmRun    <int> 69, 63, 225, 12, 19, 1, 0, 6, 253, 90, 15, 41, 4, 36, 177...
## $ CRuns     <int> 321, 224, 828, 48, 501, 30, 41, 32, 784, 702, 192, 205, 3...
## $ CRBI      <int> 414, 266, 838, 46, 336, 9, 37, 34, 890, 504, 186, 204, 10...
## $ CWalks    <int> 375, 263, 354, 33, 194, 24, 12, 8, 866, 488, 161, 203, 20...
## $ League    <fct> N, A, N, N, A, N, A, N, A, A, N, N, A, N, N, A, N, N, A, ...
## $ Division  <fct> W, W, E, E, W, E, W, W, E, E, W, E, E, E, W, W, W, E, W, ...
## $ PutOuts   <int> 632, 880, 200, 805, 282, 76, 121, 143, 0, 238, 304, 211, ...
## $ Assists   <int> 43, 82, 11, 40, 421, 127, 283, 290, 0, 445, 45, 11, 151, ...
## $ Errors    <int> 10, 14, 3, 4, 25, 7, 9, 19, 0, 22, 11, 7, 6, 8, 10, 16, 2...
## $ Salary    <dbl> 475.000, 480.000, 500.000, 91.500, 750.000, 70.000, 100.0...
## $ NewLeague <fct> N, A, N, N, A, A, A, N, A, A, N, N, A, N, N, A, N, N, N, ...

Next, we’ll initialize a multiple linear regression model using all the available features with the goal of prediction a player’s salary.

model <- lm(Salary ~ ., data = Hitters)
summary(model)
## 
## Call:
## lm(formula = Salary ~ ., data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -907.62 -178.35  -31.11  139.09 1877.04 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  163.10359   90.77854   1.797 0.073622 .  
## AtBat         -1.97987    0.63398  -3.123 0.002008 ** 
## Hits           7.50077    2.37753   3.155 0.001808 ** 
## HmRun          4.33088    6.20145   0.698 0.485616    
## Runs          -2.37621    2.98076  -0.797 0.426122    
## RBI           -1.04496    2.60088  -0.402 0.688204    
## Walks          6.23129    1.82850   3.408 0.000766 ***
## Years         -3.48905   12.41219  -0.281 0.778874    
## CAtBat        -0.17134    0.13524  -1.267 0.206380    
## CHits          0.13399    0.67455   0.199 0.842713    
## CHmRun        -0.17286    1.61724  -0.107 0.914967    
## CRuns          1.45430    0.75046   1.938 0.053795 .  
## CRBI           0.80771    0.69262   1.166 0.244691    
## CWalks        -0.81157    0.32808  -2.474 0.014057 *  
## LeagueN       62.59942   79.26140   0.790 0.430424    
## DivisionW   -116.84925   40.36695  -2.895 0.004141 ** 
## PutOuts        0.28189    0.07744   3.640 0.000333 ***
## Assists        0.37107    0.22120   1.678 0.094723 .  
## Errors        -3.36076    4.39163  -0.765 0.444857    
## NewLeagueN   -24.76233   79.00263  -0.313 0.754218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 315.6 on 243 degrees of freedom
## Multiple R-squared:  0.5461, Adjusted R-squared:  0.5106 
## F-statistic: 15.39 on 19 and 243 DF,  p-value: < 2.2e-16

We see that our baseline model has an Adjusted \(R^2\) of 0.5106. Now, let’s look at the diagnostic plots:

par(mfrow = c(2, 2))
plot(model)

In looking at the diagnostic plots we see that there are indeed some outliers (among other issues such as heteroscedasticity). If you look at the plot on the bottom right, Residuals vs Leverage, you’ll see that some of the outliers have some significant leverage as well. Let’s say for the sake of example, that we wanted to remove these outliers from our dataset so that we could have a better fitting model. How could we do that? We can utilize the cooks.distance function on the model we just ran and then filter out any values greater than 3x the mean. Let’s first take a look at how many observations meet this criteria:

cooksD <- cooks.distance(model)
influential <- cooksD[(cooksD > (3 * mean(cooksD, na.rm = TRUE)))]
names_of_influential <- names(influential)
influential
##     -Andre Dawson     -Bill Buckner    -Darrell Evans    -Don Mattingly 
##        0.02470675        0.04312019        0.04419006        0.02758502 
##      -Dale Murphy     -Eddie Murray      -Gary Carter    -Graig Nettles 
##        0.03663747        0.05908460        0.04216671        0.03421525 
##         -Jim Rice     -Jim Sundberg     -Mike Schmidt      -Ozzie Smith 
##        0.04803117        0.04271726        0.15287842        0.12734079 
##        -Pete Rose -Rickey Henderson   -Reggie Jackson   -Rafael Ramirez 
##        0.35918447        0.05635707        0.20012923        0.02313200 
##        -Steve Sax    -Terry Kennedy 
##        0.03654366        0.03188851

We see that 18 players have a Cook’s distance greater than 3x the mean. Let’s exclude these players and rerun the model to see if we have a better fit.

df <- Hitters[names_of_influential,]
hitters_without_outliers <- Hitters %>% anti_join(df)

model2 <- lm(Salary ~ ., data = hitters_without_outliers)
summary(model2)
## 
## Call:
## lm(formula = Salary ~ ., data = hitters_without_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.27 -137.97   -4.58  112.51  575.28 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.78811   67.33285  -0.041 0.967007    
## AtBat        -1.08775    0.49317  -2.206 0.028422 *  
## Hits          6.25552    1.93935   3.226 0.001444 ** 
## HmRun        -0.64803    4.72412  -0.137 0.891016    
## Runs         -3.61168    2.19010  -1.649 0.100524    
## RBI          -0.16825    1.98430  -0.085 0.932504    
## Walks         5.33193    1.36745   3.899 0.000127 ***
## Years         5.57046    9.03971   0.616 0.538372    
## CAtBat       -0.19421    0.10849  -1.790 0.074773 .  
## CHits         0.65510    0.54946   1.192 0.234417    
## CHmRun        2.52817    1.31093   1.929 0.055048 .  
## CRuns         0.93499    0.56677   1.650 0.100406    
## CRBI         -0.38499    0.55105  -0.699 0.485498    
## CWalks       -0.49660    0.26829  -1.851 0.065480 .  
## LeagueN      59.98186   58.52210   1.025 0.306490    
## DivisionW   -71.47544   29.09752  -2.456 0.014790 *  
## PutOuts       0.19302    0.05954   3.242 0.001368 ** 
## Assists       0.15762    0.15848   0.995 0.320999    
## Errors       -2.73880    3.15543  -0.868 0.386339    
## NewLeagueN  -30.98821   58.40277  -0.531 0.596223    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 225 degrees of freedom
## Multiple R-squared:  0.6721, Adjusted R-squared:  0.6445 
## F-statistic: 24.28 on 19 and 225 DF,  p-value: < 2.2e-16

Our model fit has improved substantially. We have improved from an Adjusted \(R^2\) of 0.5106 to 0.6445 with the removal of only 18 observations. This demonstrates how influential outliers can be in regression models. Let’s look at the diagnostic plots for our new model.

par(mfrow = c(2, 2))
plot(model2)

In comparing with our previous diagnostic plots, these plots are greatly improved. Looking again at the Residuals vs Leverage plot, we see that we don’t have any remaining points with significant leverage, leading to a better fit for our model.

The example above was for demonstration purposes only. One should never just remove outliers without doing a deep and thorough analysis of the points in question. Among other things, doing so might lead to a good fit on the training data, but poor predictions on unseen data.

Cook’s Distance is an excellent tool to add to your regression analysis toolbox! You now have a meaningful way to investigate outliers in your models. Happy modeling!