There are many techniques to remove outliers from a dataset. One method that is often used in regression settings is utilizing Cook’s Distance. Cook’s distance is an estimate of the influence of a data point. It takes into account both the leverage and residual of each observation. Cook’s distance is a summary of how much a regression model changes when the \(i^{th}\) observation is removed.
When looking to see which observations may be outliers, a general rule of thumb is to investigate any point that is more than 3x the mean (note: there are several other regularly used criteria as well). I’ll show an example of how this works with a well known dataset called Hitters from the ISLR
library. The Hitters dataset contains information on 250+ baseball players and their career stats and salary.
First we’ll import the dataset:
<- na.omit(Hitters)
Hitters glimpse(Hitters)
## Rows: 263
## Columns: 20
## $ AtBat <int> 315, 479, 496, 321, 594, 185, 298, 323, 401, 574, 202, 41...
## $ Hits <int> 81, 130, 141, 87, 169, 37, 73, 81, 92, 159, 53, 113, 60, ...
## $ HmRun <int> 7, 18, 20, 10, 4, 1, 0, 6, 17, 21, 4, 13, 0, 7, 20, 2, 8,...
## $ Runs <int> 24, 66, 65, 39, 74, 23, 24, 26, 49, 107, 31, 48, 30, 29, ...
## $ RBI <int> 38, 72, 78, 42, 51, 8, 24, 32, 66, 75, 26, 61, 11, 27, 75...
## $ Walks <int> 39, 76, 37, 30, 35, 21, 7, 8, 65, 59, 27, 47, 22, 30, 73,...
## $ Years <int> 14, 3, 11, 2, 11, 2, 3, 2, 13, 10, 9, 4, 6, 13, 15, 5, 8,...
## $ CAtBat <int> 3449, 1624, 5628, 396, 4408, 214, 509, 341, 5206, 4631, 1...
## $ CHits <int> 835, 457, 1575, 101, 1133, 42, 108, 86, 1332, 1300, 467, ...
## $ CHmRun <int> 69, 63, 225, 12, 19, 1, 0, 6, 253, 90, 15, 41, 4, 36, 177...
## $ CRuns <int> 321, 224, 828, 48, 501, 30, 41, 32, 784, 702, 192, 205, 3...
## $ CRBI <int> 414, 266, 838, 46, 336, 9, 37, 34, 890, 504, 186, 204, 10...
## $ CWalks <int> 375, 263, 354, 33, 194, 24, 12, 8, 866, 488, 161, 203, 20...
## $ League <fct> N, A, N, N, A, N, A, N, A, A, N, N, A, N, N, A, N, N, A, ...
## $ Division <fct> W, W, E, E, W, E, W, W, E, E, W, E, E, E, W, W, W, E, W, ...
## $ PutOuts <int> 632, 880, 200, 805, 282, 76, 121, 143, 0, 238, 304, 211, ...
## $ Assists <int> 43, 82, 11, 40, 421, 127, 283, 290, 0, 445, 45, 11, 151, ...
## $ Errors <int> 10, 14, 3, 4, 25, 7, 9, 19, 0, 22, 11, 7, 6, 8, 10, 16, 2...
## $ Salary <dbl> 475.000, 480.000, 500.000, 91.500, 750.000, 70.000, 100.0...
## $ NewLeague <fct> N, A, N, N, A, A, A, N, A, A, N, N, A, N, N, A, N, N, N, ...
Next, we’ll initialize a multiple linear regression model using all the available features with the goal of prediction a player’s salary.
<- lm(Salary ~ ., data = Hitters)
model summary(model)
##
## Call:
## lm(formula = Salary ~ ., data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -907.62 -178.35 -31.11 139.09 1877.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 163.10359 90.77854 1.797 0.073622 .
## AtBat -1.97987 0.63398 -3.123 0.002008 **
## Hits 7.50077 2.37753 3.155 0.001808 **
## HmRun 4.33088 6.20145 0.698 0.485616
## Runs -2.37621 2.98076 -0.797 0.426122
## RBI -1.04496 2.60088 -0.402 0.688204
## Walks 6.23129 1.82850 3.408 0.000766 ***
## Years -3.48905 12.41219 -0.281 0.778874
## CAtBat -0.17134 0.13524 -1.267 0.206380
## CHits 0.13399 0.67455 0.199 0.842713
## CHmRun -0.17286 1.61724 -0.107 0.914967
## CRuns 1.45430 0.75046 1.938 0.053795 .
## CRBI 0.80771 0.69262 1.166 0.244691
## CWalks -0.81157 0.32808 -2.474 0.014057 *
## LeagueN 62.59942 79.26140 0.790 0.430424
## DivisionW -116.84925 40.36695 -2.895 0.004141 **
## PutOuts 0.28189 0.07744 3.640 0.000333 ***
## Assists 0.37107 0.22120 1.678 0.094723 .
## Errors -3.36076 4.39163 -0.765 0.444857
## NewLeagueN -24.76233 79.00263 -0.313 0.754218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 315.6 on 243 degrees of freedom
## Multiple R-squared: 0.5461, Adjusted R-squared: 0.5106
## F-statistic: 15.39 on 19 and 243 DF, p-value: < 2.2e-16
We see that our baseline model has an Adjusted \(R^2\) of 0.5106. Now, let’s look at the diagnostic plots:
par(mfrow = c(2, 2))
plot(model)
In looking at the diagnostic plots we see that there are indeed some outliers (among other issues such as heteroscedasticity). If you look at the plot on the bottom right, Residuals vs Leverage, you’ll see that some of the outliers have some significant leverage as well. Let’s say for the sake of example, that we wanted to remove these outliers from our dataset so that we could have a better fitting model. How could we do that? We can utilize the cooks.distance
function on the model we just ran and then filter out any values greater than 3x the mean. Let’s first take a look at how many observations meet this criteria:
<- cooks.distance(model)
cooksD <- cooksD[(cooksD > (3 * mean(cooksD, na.rm = TRUE)))]
influential <- names(influential)
names_of_influential influential
## -Andre Dawson -Bill Buckner -Darrell Evans -Don Mattingly
## 0.02470675 0.04312019 0.04419006 0.02758502
## -Dale Murphy -Eddie Murray -Gary Carter -Graig Nettles
## 0.03663747 0.05908460 0.04216671 0.03421525
## -Jim Rice -Jim Sundberg -Mike Schmidt -Ozzie Smith
## 0.04803117 0.04271726 0.15287842 0.12734079
## -Pete Rose -Rickey Henderson -Reggie Jackson -Rafael Ramirez
## 0.35918447 0.05635707 0.20012923 0.02313200
## -Steve Sax -Terry Kennedy
## 0.03654366 0.03188851
We see that 18 players have a Cook’s distance greater than 3x the mean. Let’s exclude these players and rerun the model to see if we have a better fit.
<- Hitters[names_of_influential,]
df <- Hitters %>% anti_join(df)
hitters_without_outliers
<- lm(Salary ~ ., data = hitters_without_outliers)
model2 summary(model2)
##
## Call:
## lm(formula = Salary ~ ., data = hitters_without_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462.27 -137.97 -4.58 112.51 575.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.78811 67.33285 -0.041 0.967007
## AtBat -1.08775 0.49317 -2.206 0.028422 *
## Hits 6.25552 1.93935 3.226 0.001444 **
## HmRun -0.64803 4.72412 -0.137 0.891016
## Runs -3.61168 2.19010 -1.649 0.100524
## RBI -0.16825 1.98430 -0.085 0.932504
## Walks 5.33193 1.36745 3.899 0.000127 ***
## Years 5.57046 9.03971 0.616 0.538372
## CAtBat -0.19421 0.10849 -1.790 0.074773 .
## CHits 0.65510 0.54946 1.192 0.234417
## CHmRun 2.52817 1.31093 1.929 0.055048 .
## CRuns 0.93499 0.56677 1.650 0.100406
## CRBI -0.38499 0.55105 -0.699 0.485498
## CWalks -0.49660 0.26829 -1.851 0.065480 .
## LeagueN 59.98186 58.52210 1.025 0.306490
## DivisionW -71.47544 29.09752 -2.456 0.014790 *
## PutOuts 0.19302 0.05954 3.242 0.001368 **
## Assists 0.15762 0.15848 0.995 0.320999
## Errors -2.73880 3.15543 -0.868 0.386339
## NewLeagueN -30.98821 58.40277 -0.531 0.596223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219.3 on 225 degrees of freedom
## Multiple R-squared: 0.6721, Adjusted R-squared: 0.6445
## F-statistic: 24.28 on 19 and 225 DF, p-value: < 2.2e-16
Our model fit has improved substantially. We have improved from an Adjusted \(R^2\) of 0.5106 to 0.6445 with the removal of only 18 observations. This demonstrates how influential outliers can be in regression models. Let’s look at the diagnostic plots for our new model.
par(mfrow = c(2, 2))
plot(model2)
In comparing with our previous diagnostic plots, these plots are greatly improved. Looking again at the Residuals vs Leverage plot, we see that we don’t have any remaining points with significant leverage, leading to a better fit for our model.
The example above was for demonstration purposes only. One should never just remove outliers without doing a deep and thorough analysis of the points in question. Among other things, doing so might lead to a good fit on the training data, but poor predictions on unseen data.
Cook’s Distance is an excellent tool to add to your regression analysis toolbox! You now have a meaningful way to investigate outliers in your models. Happy modeling!