Outliers, do we care?

In this our fourth post we will have another look at outliers in a dataset. We did this in post 2, where we discussed cross validation as a technique to handle the presence of outliers in the dataset when comparing different models. Here we will look at how to handle outliers at the moment of designing our mode.

So let’s frame the situations: we are given a dataset which might contain outliers. And really for starters we need to determine what exactly is an outlier. If point is not in the cloud of the majority of the datasets point and outlier, or it have information we should consider as part of the model. We also need to be aware of the influence of the data point in questions. Depending on the leverage of given point, they might execute their influence differently, and affect our model with higher or lower amounts. So what really constitutes an outlier?

There are two basic approaches to handling so called outliers. The first is to identify them and delete them from the dataset. A second is to add weighing to the data points used in the regression and use the product of the weight and the data itself. We will look at both approaches and we will compare results. A variant of the first approach could be to replace outliers with given values such as the mean or median of the data. This variant will not be presented here, but it will be obvious with this discussion that the same issues we have with eliminating point will apply to replacing points with the given value

We wiil use the same synthetic dataset we built for our cross-validation analysis in our second blog. To add to our analysis and discussion we will add afew more outlier points with high influence and different levels of leverage.

All points welcomed

As start, we can simply model the data with a lineal regression that includes all point in the dataset. We can use this model as a reference in our two approaches to handle outliers. Here we see a model that is highly affected by the influence and leverage of outliers.

## 
## Call:
## lm(formula = Y2 ~ X2, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -498.37 -258.99 -136.39  -23.75 1956.10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -137.70     173.55  -0.793    0.431    
## X2             12.27       1.95   6.293 8.92e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 552.3 on 48 degrees of freedom
## Multiple R-squared:  0.4521, Adjusted R-squared:  0.4407 
## F-statistic: 39.61 on 1 and 48 DF,  p-value: 8.92e-08

From the model summary we notice that the X2 predictor is in fact significant, as expected, as well as the model overall, also expected. But we straight away see that the \(R^2\) is rather low telling us the model only explains roughly just over 60% of the variance of our data. But this is also rather expected as we know we have several points that are outliers.

Outliers not invited

For our second set of models lets simply eliminate outliers. But to do this simple delete, we first need to identify which points are outliers. Many times having subject knowledge can help us in this step. But without that here, we need to look at the data itself.And this is why deleted outlier points so so dangerous to modeling. No technique of looking at the data in isolation will really ganarantee that the selected point are truly outliers. But for the sake of our discussion lets look at some options.

We can start with a simple box plot of our data.

From the plot we see at least one outlier. With this approach, anypoint greater or smaller than 1.5 times the Inter Quartile Range is considered an outlier, and plotted on the Boxplot outside the horizontal bar representing this limit. SO we proceed ahead with eliminating point outside the defined 1.5 time IQR range.After doing this we look at the box plot once again and notice that now no points are outside the range.

We can now plot again our scattered plot and asses what points have been eliminated.

After doing this we see that only the point with high influence and leverage has been eliminated. But other point with also high influence but which do not exercise it and have low leverage are still part of the dataset. So when we build our model, these points are taken into account.

## 
## Call:
## lm(formula = Y2 ~ X3, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -294.32 -219.76 -163.56  -85.41 2004.96 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   84.252    188.843   0.446 0.657581    
## X3             8.885      2.313   3.842 0.000372 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 529.4 on 46 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.2429, Adjusted R-squared:  0.2265 
## F-statistic: 14.76 on 1 and 46 DF,  p-value: 0.0003722

The resulting regression model now actually shows a lower \(R^2\), and we have eliminated points with high leverage which we aren’t sure to be outliers, and left point in the dataset with high influence and low leverage which might in truth be outliers.

We use IQR to identify outliers, another technique uses Standard Deviations. Here we define outliers as point outside 2 or 3 standard deviations from the mean. Results with such technique are similar to IQR as they simple define the inclusion range using either mean or median statistics.

Weighted Regression

So what if we didn’t eliminate outliers, kept them in the model, but simply treated them differently from other points. To do this we define weights associated to each point. These weights are calculated using a distance measure, such as Cook’s Distance. With this approach, each point will have a measure of its “proximity” to the data, such that outliers are assigned lower weights. With these weights in hand, we use the value of the data point, multiplied by its weight, as the input to the regression model.

We calculate and plot the Cooks distance for each data point.

On the plot we see that as expected, the points we identified earlier as outliers show in fact high Cooks distance. But we also see other points the different values of distance. We can use these values to determine the weights in our regression. A common regression technique is the Huber method. This can be easily executed in R. From it we obtain our models intercept and X2 coefficients.

rr.huber<-rlm(Y2 ~ X2, data=df)
summary(rr.huber)
## 
## Call: rlm(formula = Y2 ~ X2, data = df)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -198.08  -82.23  -27.01   80.05 2135.79 
## 
## Coefficients:
##             Value     Std. Error t value  
## (Intercept) -106.9175   46.0995    -2.3193
## X2             9.6391    0.5179    18.6123
## 
## Residual standard error: 123 on 48 degrees of freedom

As can be seen, these are different from our previous models. But from this model in R we can also extract the weights used in the regression, which we can plot to inspect.

As expected we see many weights clustered around 1, the maximum weight a point can get in the regression. We also find our outlier points with low weights. Still contributing to the model though. But we also see other points with different weights, points we did not account for in our outlier elimination approach before.

What did we learn?

From this analysis we can see that identifying outliers is a tricky proposition. Surely having some subject knowledge would help selecting which points to keep and which to delete, but even then, points which might still be relevant to the model could be eliminated. Using regression techniques that include all points in the dataset, but account for data differences using weight is a much more straight forward approach. Resulting models are therefore more robust and less subject to subjective judgement made by the analysts.

A final plot with all models in our discussion shows the differences and how the way we handle outliers can change our final result.

In heighseight

With this discussion in mind, one might be tempted to simple say, well we could eliminate outliers by simple observation. A subject expert might be willing to stand behind the selection of what points to eliminate. So with this in mind let’s build a final model eliminating all points that show distant from all our models. With that we obtain the following result.

The result is basically the same model as the one where we used the IQR range to eliminate outliers. This isn’t the final weighted model, and it requires manual intervention to delete the points. This kind of selection by the analyst might be very complicated and prone to error, and in multivariate cases very difficult to impossible in cases where the model has many predictors.