Residuals and Residual Plots

Motivation:

In Lesson 13, we learned how to describe the linear relationship between the response and explanatory variable with an equation. The least-squares regression line fits between the points, but not all of the data points fall directly on the line. The vertical distance between a point and the least-squares regression line is called a residual. In this lesson, we’ll learn how to calculate residuals and make a graph that examines the relationship between the residuals and the values of the explanatory variable, which is called a residual plot. The residual plot is used extensively throughout regression to make sure certain conditions exist in the data to ensure that conclusions made from the sample data are valid to a population of interest.

What you need to know from this lesson: After completing this lesson, you should be able to

explain what a residual is
calculate any case’s residual
construct a residual plot by hand and using R
interpret a residual plot

To accomplish the above “What You Need to Know”, do the following:

1. Attend lecture and answer the questions on the following pages of this lesson.
2. Read Section 2.6 in the text (focusing on the discussion related to residuals)
3. Do the Lesson 14 questions at the end of the lesson notes

The Lesson

The “Clear-Cut” Example:

Landslides are common events in tree-growing regions of the Pacific Northwest, so their effect on timber growth is of special concern to foresters. The article “Effects of Landslide Erosion on Subsequent Douglas Fir Growth and Stocking Levels in the Western Cascades, Oregon” (Soil Science Society of American Journals (1984)) reported on the results of a study in which growth in a landslide area was compared with growth in a previously clear-cut area. Here we consider clear-cut growth areas only. Data on the age of the tree and its most recent 5-year height growth (in cm) are given below:

### CLEAR CUT EXAMPLE ###
age<-c(5, 9, 9, 10, 10, 11, 11, 12,
       13, 13, 14, 14, 15, 15, 18, 18)
five_year<-c(70, 150, 260, 230, 255, 165, 225, 340, 
             305, 335, 290, 340, 225, 300, 380, 400)

mod<-lm(five_year~age)

1. Residuals

a. What was the 5-year growth of the 12-year old tree in this sample?

340 cm

b. Using the equation of the least-squares regression line from Lesson 13 (\( \hat{y}=4.352+21.322 \times 10 \)) predict the most recent 5-year growth of the 12-year tree.

### PREDICT FOR 12 YEARS ###
4.352+21.322*12 # CM

## [1] 260.216

c. The difference between the observed growth and the predicted growth (or “expected” growth) for any tree is called the residual. That is:

            residual = observed value – predicted value

Calculate the residual for the 12-year old tree.

### RESIDUAL FOR 12 YEAR OLD TREE ###
340-(4.352+21.322*12)

## [1] 79.784

d. Residuals can be calculated for each of the trees in the sample. Below is a table of the age of the tree (years), the 5-year growth of the tree (cm), the predicted 5-year growth (“Fit”), and their residuals (also in cm).

### TABLE OF PREDICTED AND RESIDUALS ###
tree_table<-cbind(age, 
                  five_year, 
                  predicted=round(fitted(mod),2),
                  residual=round(resid(mod),2))

tree_table

##    age five_year predicted residual
## 1    5        70    110.96   -40.96
## 2    9       150    196.25   -46.25
## 3    9       260    196.25    63.75
## 4   10       230    217.57    12.43
## 5   10       255    217.57    37.43
## 6   11       165    238.89   -73.89
## 7   11       225    238.89   -13.89
## 8   12       340    260.21    79.79
## 9   13       305    281.53    23.47
## 10  13       335    281.53    53.47
## 11  14       290    302.86   -12.86
## 12  14       340    302.86    37.14
## 13  15       225    324.18   -99.18
## 14  15       300    324.18   -24.18
## 15  18       380    388.14    -8.14
## 16  18       400    388.14    11.86

e. What would you expect the sum of all 16 residuals to be? Therefore, what would you expect the average of the residuals to be? Verify this.

We would expect for this sum to be zero (or very close to it). So the average residual would be zero.

### SUM OF RESID ###
mean(resid(mod))

## [1] 3.663736e-15

### MEAN OF RESID ###
sum(resid(mod))

## [1] 5.861978e-14

2. Residual plot

a. A residual plot is a scatter plot of the values of the explanatory variable and their residuals, with the residuals on the y-axis and the explanatory variable (age) on the x-axis. Because the average of the residuals is 0 (see #1 part (e) above), we place 0 in the middle of the y-axis.

Using the data on the previous page, construct the residual plot by hand.

b. Compare your residual plot to the residual plot from R:

### RESIDUAL PLOT ###
plot(age, resid(mod),pch=16, 
     xlab="Age (years)",
     ylab="Residuals",
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-6

A “good” residual plot will look as if the points (residuals) are randomly scattered. That is, there will be no patterns to the residuals. Does that appear to be the case here?

c. The uses of the Residual Plot

1) To check to see if a linear relationship exists between the explanatory and response variable. If the residuals form a non-linear pattern, the relationship between the explanatory and response variable will be non-linear.
2) Identify possible outliers.
3) To check to see if the variation in the residuals remains constant over the entire range of the explanatory variable – more on this in Lesson 17.

d. Other examples. Comment on each of the following residual plots.

Is there any curved pattern among the residuals, or are they “randomly scattered”? Therefore, is the relationship between the response and explanatory variables linear?
Are there any outliers?
Are any there any other “deviations from the pattern”?

1) The baseball example (wins versus batting average for all Major League baseball teams in 2011)

### BASEBALL ###
base_mod<-lm(BASEBALL$WINS~BASEBALL$AVG)

par(mfrow=c(1,2))
plot(BASEBALL$AVG, BASEBALL$WINS, pch=16, 
     xlab="Team Batting Average",
     ylab="Games Won", 
     main="Batting Average vs Games Won")

abline(coefficients(base_mod), lwd=2, lty=2, 
       col="red")

plot(BASEBALL$AVG, resid(base_mod), pch=16, 
     xlab="Team Batting Average",
     ylab="Residual",
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-8

2) The orange juice example (sweetness index versus pectin)

### ORANGE JUICE ###
oj_mod<- with(OJUICE, lm(sweetness.index~pectin.ppm))

par(mfrow=c(1,2))
plot(OJUICE$pectin.ppm, OJUICE$sweetness.index, pch=16, 
       xlab="Pectin (ppm)",
     ylab="Sweetness", 
     main="Pectin vs Sweetness")

abline(coefficients(oj_mod), lwd=2, lty=2, 
       col="red")

plot(OJUICE$pectin.ppm, resid(oj_mod), pch=16, 
     xlab="Pectin (ppm)",
     ylab="Residual",
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-10

3) The Old Faithful Geyser example (interval between eruptions versus duration of last eruption)

### OLD FAITHFUL ###
of_mod <- with(OLDFAITHFUL, lm(interval~duration))

par(mfrow=c(1,2))
plot(OLDFAITHFUL$duration, OLDFAITHFUL$interval, pch=16, 
       xlab="Duration of Eruption",
     ylab="Interval", 
     main="Duration vs Interval")

abline(coefficients(of_mod), lwd=2, lty=2, 
       col="red")

plot(OLDFAITHFUL$duration, resid(of_mod), pch=16, 
     xlab="Duration of Eruption",
     ylab="Residual",
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-12

4) Y versus X example (the non-linear pattern)

### QUADRATIC ###
quad_mod <- with(XYQUADRATIC, lm(y~x))

par(mfrow=c(1,2))
plot(XYQUADRATIC$x, XYQUADRATIC$y, pch=16, 
       xlab="X",
     ylab="Y", 
     main="X vs Y")

abline(coefficients(quad_mod), lwd=2, lty=2, 
       col="red")

plot(XYQUADRATIC$x, resid(quad_mod), pch=16, 
     xlab="X",
     ylab="Residual",
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-14