KITADA

Lesson #15

A Strategy for Dealing with Outliers

Motivation:

Outliers (unusual data points) can have a strong influence on analysis in terms of affecting the slope and/or y-intercept in the regression equation. A single point can change the analysis in such a way as to give a conclusion that may not be accurate. We must decide how to handle these outlying observations. A strategy for dealing with outliers will be discussed in this lesson.

What you need to know from this lesson: After completing this lesson, you should be able to

To accomplish the above “What You Need to Know”, do the following:

The Lesson

The Long-Jump Example:

The long jump is a track and field event in which a competitor attempts to jump a maximum distance into a sand pit after a running start. At the edge of the sand pit is a takeoff board. Jumpers usually try to plant their toes at the front edge of this board to maximize jumping distance. The absolute distance between the front edge of the takeoff board and the spot where the toe actually lands on the board prior to jumping is called “takeoff error.” Is takeoff error in the long jump linearly related to best jumping distance? To answer this question, let’s use the data below collected on 18 novice long jumpers at a high school track meet. The average takeoff error and best jumping distance (out of 3 jumps) for each jumper are recorded.

### LONG JUMP DATA ###
best_jump<-c(5.30, 5.55, 5.47, 5.45, 5.07, 5.32, 6.15, 4.70, 5.22, 
             5.77, 5.12, 5.77, 6.22, 5.82, 5.15, 4.92, 5.20, 5.42)
avg_takeoff<-c(.09, .17, .19, .24, .16, .22, .09, .12, .09,
               .09, .13, .16, .03, .50, .13, .04, .07, .04)

1. What is the response variable? What is the explanatory variable?

2. Below is the scatterplot of best jump distance vs. average takeoff error. Describe the relationship between a jumper’s best jump distance and their average takeoff error. What do you notice about one of the data points?

### SCATTERPLOT ###
plot(avg_takeoff, best_jump,pch=16, 
     xlab="Average Takeoff Error (meters)", 
     ylab="Best Jump Distance (meters)", 
     main="Scatterplot of Takeoff Error vs Jump Distance")

plot of chunk unnamed-chunk-2

3. Below is the Residual Plot. Do you notice the outlier?

### MODEL AND RESIDUAL PLOT ###
jump_mod<-lm(best_jump~avg_takeoff)

plot(avg_takeoff, resid(jump_mod),pch=16, 
     xlab="Average Takeoff Error (meters)", 
     ylab="Residual", 
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-3

4. The scatterplot with the least-squares regression line drawn in and R output from the simple linear regression analysis are shown below.

### SUMMARY ###
summary(jump_mod)
## 
## Call:
## lm(formula = best_jump ~ avg_takeoff)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7116 -0.2464 -0.0604  0.1833  0.8561 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.3480     0.1635  32.711 4.39e-16 ***
## avg_takeoff   0.5299     0.9254   0.573    0.575    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4115 on 16 degrees of freedom
## Multiple R-squared:  0.02008,    Adjusted R-squared:  -0.04116 
## F-statistic: 0.3279 on 1 and 16 DF,  p-value: 0.5749
### FITTED LINE ###
plot(avg_takeoff, best_jump,pch=16, 
     xlab="Average Takeoff Error (meters)", 
     ylab="Best Jump Distance (meters)", 
     main="Scatterplot of Takeoff Error vs Jump Distance")
abline(coefficients(jump_mod),lwd=2, lty=2, 
       col="red" )

plot of chunk unnamed-chunk-4

a. What is the slope of the line?

0.5299

b. Explain what the slope means in terms of this problem.

For every one meter increase in average takeoff error, the best jump distance increases by 0.5299.

c. Does this interpretation of the slope seem to be representative of the majority of the data points? (That is, does this regression line seem to “fit” the data well?)

No, it looks like there should be a negative relationship for more of the points.

d. Calculate the correlation coefficient.

### CORRELATION ###
cor(avg_takeoff, best_jump)
## [1] 0.141707

5. When is an outlier an “influential”?

When it deviates greatly from the overall pattern.

6. Are all outliers influential? If not, draw a scatterplot with an outlier that is not influential.

No, if the outlier is in line with the trend from the rest of the data points then it will not be influential.

7. Let’s see if the outlier in the Long Jump example is influential by removing it from the analysis.

a. The scatterplot and residual plot of the data without the outlier are given below. Describe the relationship between a jumper’s best jump distance and their average takeoff error without the outlier.

### REMOVE OUTLIER ###
avg_takeoff_no<-avg_takeoff[-14]
best_jump_no<-best_jump[-14]

mod_no<-lm(best_jump_no~avg_takeoff_no)

par(mfrow=c(1,2))
### SCATTERPLOT (OUTLIER REMOVED)###
plot(avg_takeoff_no, best_jump_no,pch=16, 
     xlab="Average Takeoff Error (meters)", 
     ylab="Best Jump Distance (meters)", 
     main="Scatterplot of Takeoff Error vs Jump Distance (OUTLIER REMOVED)")
abline(coefficients(mod_no),lwd=2, lty=2, 
       col="red" )

### RESIDUAL ####
plot(avg_takeoff_no, resid(mod_no),pch=16, 
     xlab="Average Takeoff Error (meters)", 
     ylab="Residual", 
     main="Residual Plot (OUTLIER REMOVED")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-6

b. Below is the output from a simple linear regression without the outlier. Write the equation of the least-squares regression line. How has the slope changed?

### SUMMARY ###
summary(mod_no)
## 
## Call:
## lm(formula = best_jump_no ~ avg_takeoff_no)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7009 -0.2435 -0.0394  0.1857  0.7533 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.4887     0.2246  24.440  1.7e-13 ***
## avg_takeoff_no  -0.7318     1.6583  -0.441    0.665    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4135 on 15 degrees of freedom
## Multiple R-squared:  0.01282,    Adjusted R-squared:  -0.053 
## F-statistic: 0.1947 on 1 and 15 DF,  p-value: 0.6653

Yes, it changed the direction of slope.

c. Is the outlier influential? Explain.

Yes, it completely changed the direction of our understanding of the effect of takeoff error.

8. Do you think it is legitimate to remove the outlier from the analysis? Why or why not?

If there was an error in the data entry, then yes, it is legitimate to move. However, if no error was made it is not legitimate to remove the data point to fit out understanding.

*9. Strategy for dealing with outliers *

There are two points to this exercise:

a. Outliers are unusual data points. They can be unusual in the sense that they have large residuals (unusual response (or y) values) or they can be unusual in the sense that their explanatory (or x) values are much different from the rest. (It is possible to have both). But, what makes an outlier a concern is if it is influential. An influential observation is a point where removal of that point would substantially affect the regression line (change the slope of the line substantially). An outlier may or may not be influential, but outliers with unusual values of the explanatory variable have more of a potential to be influential. When it is influential, it is a concern.

b. So, the question is what should we do with these outliers? This brings us to the second point of how to deal with outliers.