A Strategy for Dealing with Outliers

Motivation:

Outliers (unusual data points) can have a strong influence on analysis in terms of affecting the slope and/or y-intercept in the regression equation. A single point can change the analysis in such a way as to give a conclusion that may not be accurate. We must decide how to handle these outlying observations. A strategy for dealing with outliers will be discussed in this lesson.

What you need to know from this lesson: After completing this lesson, you should be able to

decide if a data point is an outlier
explain when an outlier is influential
decide what strategy to use with outliers

To accomplish the above “What You Need to Know”, do the following:

1. Attend lecture and answer the questions on the following pages of this lesson.
2. Section 2.6 in the text has some discussion about outliers (Regression Caution #3 on page 128)
3. Do the Lesson 15 questions at the end of the lesson notes

The Lesson

The Long-Jump Example:

The long jump is a track and field event in which a competitor attempts to jump a maximum distance into a sand pit after a running start. At the edge of the sand pit is a takeoff board. Jumpers usually try to plant their toes at the front edge of this board to maximize jumping distance. The absolute distance between the front edge of the takeoff board and the spot where the toe actually lands on the board prior to jumping is called “takeoff error.” Is takeoff error in the long jump linearly related to best jumping distance? To answer this question, let’s use the data below collected on 18 novice long jumpers at a high school track meet. The average takeoff error and best jumping distance (out of 3 jumps) for each jumper are recorded.

### LONG JUMP DATA ###
best_jump<-c(5.30, 5.55, 5.47, 5.45, 5.07, 5.32, 6.15, 4.70, 5.22, 
             5.77, 5.12, 5.77, 6.22, 5.82, 5.15, 4.92, 5.20, 5.42)
avg_takeoff<-c(.09, .17, .19, .24, .16, .22, .09, .12, .09,
               .09, .13, .16, .03, .50, .13, .04, .07, .04)

1. What is the response variable? What is the explanatory variable?

Response: Best jump distance (meters)
Explanatory: Average takeoff error (meters)

2. Below is the scatterplot of best jump distance vs. average takeoff error. Describe the relationship between a jumper’s best jump distance and their average takeoff error. What do you notice about one of the data points?

### SCATTERPLOT ###
plot(avg_takeoff, best_jump,pch=16, 
     xlab="Average Takeoff Error (meters)", 
     ylab="Best Jump Distance (meters)", 
     main="Scatterplot of Takeoff Error vs Jump Distance")

plot of chunk unnamed-chunk-2

3. Below is the Residual Plot. Do you notice the outlier?

### MODEL AND RESIDUAL PLOT ###
jump_mod<-lm(best_jump~avg_takeoff)

plot(avg_takeoff, resid(jump_mod),pch=16, 
     xlab="Average Takeoff Error (meters)", 
     ylab="Residual", 
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-3

4. The scatterplot with the least-squares regression line drawn in and R output from the simple linear regression analysis are shown below.

### SUMMARY ###
summary(jump_mod)

## 
## Call:
## lm(formula = best_jump ~ avg_takeoff)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7116 -0.2464 -0.0604  0.1833  0.8561 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.3480     0.1635  32.711 4.39e-16 ***
## avg_takeoff   0.5299     0.9254   0.573    0.575    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4115 on 16 degrees of freedom
## Multiple R-squared:  0.02008,    Adjusted R-squared:  -0.04116 
## F-statistic: 0.3279 on 1 and 16 DF,  p-value: 0.5749

### FITTED LINE ###
plot(avg_takeoff, best_jump,pch=16, 
     xlab="Average Takeoff Error (meters)", 
     ylab="Best Jump Distance (meters)", 
     main="Scatterplot of Takeoff Error vs Jump Distance")
abline(coefficients(jump_mod),lwd=2, lty=2, 
       col="red" )

plot of chunk unnamed-chunk-4

a. What is the slope of the line?

0.5299

b. Explain what the slope means in terms of this problem.

For every one meter increase in average takeoff error, the best jump distance increases by 0.5299.

c. Does this interpretation of the slope seem to be representative of the majority of the data points? (That is, does this regression line seem to “fit” the data well?)

No, it looks like there should be a negative relationship for more of the points.

d. Calculate the correlation coefficient.

### CORRELATION ###
cor(avg_takeoff, best_jump)

## [1] 0.141707

5. When is an outlier an “influential”?

When it deviates greatly from the overall pattern.

6. Are all outliers influential? If not, draw a scatterplot with an outlier that is not influential.

No, if the outlier is in line with the trend from the rest of the data points then it will not be influential.

7. Let’s see if the outlier in the Long Jump example is influential by removing it from the analysis.

a. The scatterplot and residual plot of the data without the outlier are given below. Describe the relationship between a jumper’s best jump distance and their average takeoff error without the outlier.

### REMOVE OUTLIER ###
avg_takeoff_no<-avg_takeoff[-14]
best_jump_no<-best_jump[-14]

mod_no<-lm(best_jump_no~avg_takeoff_no)

par(mfrow=c(1,2))
### SCATTERPLOT (OUTLIER REMOVED)###
plot(avg_takeoff_no, best_jump_no,pch=16, 
     xlab="Average Takeoff Error (meters)", 
     ylab="Best Jump Distance (meters)", 
     main="Scatterplot of Takeoff Error vs Jump Distance (OUTLIER REMOVED)")
abline(coefficients(mod_no),lwd=2, lty=2, 
       col="red" )

### RESIDUAL ####
plot(avg_takeoff_no, resid(mod_no),pch=16, 
     xlab="Average Takeoff Error (meters)", 
     ylab="Residual", 
     main="Residual Plot (OUTLIER REMOVED")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-6

Is your answer the same as in number 2? If not, how has your answer changed? Why? Attempt to draw the least-squares regression line without looking at the output on the next page

b. Below is the output from a simple linear regression without the outlier. Write the equation of the least-squares regression line. How has the slope changed?

### SUMMARY ###
summary(mod_no)

## 
## Call:
## lm(formula = best_jump_no ~ avg_takeoff_no)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7009 -0.2435 -0.0394  0.1857  0.7533 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.4887     0.2246  24.440  1.7e-13 ***
## avg_takeoff_no  -0.7318     1.6583  -0.441    0.665    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4135 on 15 degrees of freedom
## Multiple R-squared:  0.01282,    Adjusted R-squared:  -0.053 
## F-statistic: 0.1947 on 1 and 15 DF,  p-value: 0.6653

Yes, it changed the direction of slope.

c. Is the outlier influential? Explain.

Yes, it completely changed the direction of our understanding of the effect of takeoff error.

8. Do you think it is legitimate to remove the outlier from the analysis? Why or why not?

If there was an error in the data entry, then yes, it is legitimate to move. However, if no error was made it is not legitimate to remove the data point to fit out understanding.

*9. Strategy for dealing with outliers *

There are two points to this exercise:

a. Outliers are unusual data points. They can be unusual in the sense that they have large residuals (unusual response (or y) values) or they can be unusual in the sense that their explanatory (or x) values are much different from the rest. (It is possible to have both). But, what makes an outlier a concern is if it is influential. An influential observation is a point where removal of that point would substantially affect the regression line (change the slope of the line substantially). An outlier may or may not be influential, but outliers with unusual values of the explanatory variable have more of a potential to be influential. When it is influential, it is a concern.

b. So, the question is what should we do with these outliers? This brings us to the second point of how to deal with outliers.

First, you should check to see if there was a data entry error. If so, change the value to the correct value and proceed with the analysis.
Second, check for measurement error. That is an error made in “measuring” the observation. It may be hard to prove that an actual measurement error was made, but an unusual observation may be an indication of a measurement error. If there is strong evidence that the outlier is a result of measurement error, you have a valid reason to remove it from the analysis.
If no data entry error or measurement error was made, we might consider removing the outlier from the analysis. However, outliers should be removed from the analysis only if there is a NON-STATISTICAL reason to do so. That is, does that outlying point come from an individual that is not representative of the population? For example, can you think of any explanation of how this individual in the long-jump example would be from a different population than all the rest of the jumpers? If so, removal of the outlying data point is justified.
But, if not, you do not have any justification for removing the outlying data point, no matter how influential the observation is!! What you should do in that case is two separate analyses, one with the outlier in the analysis and one with the outlier removed from the analysis. Then report both analyses and let the reader make the decision as to which one should be used.