KITADA
Lesson #14
Motivation:
In Lesson 13, we learned how to describe the linear relationship between the response and explanatory variable with an equation. The least-squares regression line fits between the points, but not all of the data points fall directly on the line. The vertical distance between a point and the least-squares regression line is called a residual. In this lesson, we’ll learn how to calculate residuals and make a graph that examines the relationship between the residuals and the values of the explanatory variable, which is called a residual plot. The residual plot is used extensively throughout regression to make sure certain conditions exist in the data to ensure that conclusions made from the sample data are valid to a population of interest.
What you need to know from this lesson: After completing this lesson, you should be able to
To accomplish the above “What You Need to Know”, do the following:
The Lesson
The “Clear-Cut” Example:
Landslides are common events in tree-growing regions of the Pacific Northwest, so their effect on timber growth is of special concern to foresters. The article “Effects of Landslide Erosion on Subsequent Douglas Fir Growth and Stocking Levels in the Western Cascades, Oregon” (Soil Science Society of American Journals (1984)) reported on the results of a study in which growth in a landslide area was compared with growth in a previously clear-cut area. Here we consider clear-cut growth areas only. Data on the age of the tree and its most recent 5-year height growth (in cm) are given below:
### CLEAR CUT EXAMPLE ###
age<-c(5, 9, 9, 10, 10, 11, 11, 12,
13, 13, 14, 14, 15, 15, 18, 18)
five_year<-c(70, 150, 260, 230, 255, 165, 225, 340,
305, 335, 290, 340, 225, 300, 380, 400)
mod<-lm(five_year~age)
1. Residuals
a. What was the 5-year growth of the 12-year old tree in this sample?
340 cm
b. Using the equation of the least-squares regression line from Lesson 13 (\( \hat{y}=4.352+21.322 \times 10 \)) predict the most recent 5-year growth of the 12-year tree.
### PREDICT FOR 12 YEARS ###
4.352+21.322*12 # CM
## [1] 260.216
c. The difference between the observed growth and the predicted growth (or “expected” growth) for any tree is called the residual. That is:
residual = observed value – predicted value
Calculate the residual for the 12-year old tree.
### RESIDUAL FOR 12 YEAR OLD TREE ###
340-(4.352+21.322*12)
## [1] 79.784
d. Residuals can be calculated for each of the trees in the sample. Below is a table of the age of the tree (years), the 5-year growth of the tree (cm), the predicted 5-year growth (“Fit”), and their residuals (also in cm).
### TABLE OF PREDICTED AND RESIDUALS ###
tree_table<-cbind(age,
five_year,
predicted=round(fitted(mod),2),
residual=round(resid(mod),2))
tree_table
## age five_year predicted residual
## 1 5 70 110.96 -40.96
## 2 9 150 196.25 -46.25
## 3 9 260 196.25 63.75
## 4 10 230 217.57 12.43
## 5 10 255 217.57 37.43
## 6 11 165 238.89 -73.89
## 7 11 225 238.89 -13.89
## 8 12 340 260.21 79.79
## 9 13 305 281.53 23.47
## 10 13 335 281.53 53.47
## 11 14 290 302.86 -12.86
## 12 14 340 302.86 37.14
## 13 15 225 324.18 -99.18
## 14 15 300 324.18 -24.18
## 15 18 380 388.14 -8.14
## 16 18 400 388.14 11.86
e. What would you expect the sum of all 16 residuals to be? Therefore, what would you expect the average of the residuals to be? Verify this.
We would expect for this sum to be zero (or very close to it). So the average residual would be zero.
### SUM OF RESID ###
mean(resid(mod))
## [1] 3.663736e-15
### MEAN OF RESID ###
sum(resid(mod))
## [1] 5.861978e-14
2. Residual plot
a. A residual plot is a scatter plot of the values of the explanatory variable and their residuals, with the residuals on the y-axis and the explanatory variable (age) on the x-axis. Because the average of the residuals is 0 (see #1 part (e) above), we place 0 in the middle of the y-axis.
Using the data on the previous page, construct the residual plot by hand.
b. Compare your residual plot to the residual plot from R:
### RESIDUAL PLOT ###
plot(age, resid(mod),pch=16,
xlab="Age (years)",
ylab="Residuals",
main="Residual Plot")
abline(h=0, lwd=2, lty=2,
col="blue")
A “good” residual plot will look as if the points (residuals) are randomly scattered. That is, there will be no patterns to the residuals. Does that appear to be the case here?
c. The uses of the Residual Plot
1) To check to see if a linear relationship exists between the explanatory and response variable. If the residuals form a non-linear pattern, the relationship between the explanatory and response variable will be non-linear.
2) Identify possible outliers.
3) To check to see if the variation in the residuals remains constant over the entire range of the explanatory variable – more on this in Lesson 17.
d. Other examples. Comment on each of the following residual plots.
1) The baseball example (wins versus batting average for all Major League baseball teams in 2011)
### BASEBALL ###
base_mod<-lm(BASEBALL$WINS~BASEBALL$AVG)
par(mfrow=c(1,2))
plot(BASEBALL$AVG, BASEBALL$WINS, pch=16,
xlab="Team Batting Average",
ylab="Games Won",
main="Batting Average vs Games Won")
abline(coefficients(base_mod), lwd=2, lty=2,
col="red")
plot(BASEBALL$AVG, resid(base_mod), pch=16,
xlab="Team Batting Average",
ylab="Residual",
main="Residual Plot")
abline(h=0, lwd=2, lty=2,
col="blue")
2) The orange juice example (sweetness index versus pectin)
### ORANGE JUICE ###
oj_mod<- with(OJUICE, lm(sweetness.index~pectin.ppm))
par(mfrow=c(1,2))
plot(OJUICE$pectin.ppm, OJUICE$sweetness.index, pch=16,
xlab="Pectin (ppm)",
ylab="Sweetness",
main="Pectin vs Sweetness")
abline(coefficients(oj_mod), lwd=2, lty=2,
col="red")
plot(OJUICE$pectin.ppm, resid(oj_mod), pch=16,
xlab="Pectin (ppm)",
ylab="Residual",
main="Residual Plot")
abline(h=0, lwd=2, lty=2,
col="blue")
3) The Old Faithful Geyser example (interval between eruptions versus duration of last eruption)
### OLD FAITHFUL ###
of_mod <- with(OLDFAITHFUL, lm(interval~duration))
par(mfrow=c(1,2))
plot(OLDFAITHFUL$duration, OLDFAITHFUL$interval, pch=16,
xlab="Duration of Eruption",
ylab="Interval",
main="Duration vs Interval")
abline(coefficients(of_mod), lwd=2, lty=2,
col="red")
plot(OLDFAITHFUL$duration, resid(of_mod), pch=16,
xlab="Duration of Eruption",
ylab="Residual",
main="Residual Plot")
abline(h=0, lwd=2, lty=2,
col="blue")
4) Y versus X example (the non-linear pattern)
### QUADRATIC ###
quad_mod <- with(XYQUADRATIC, lm(y~x))
par(mfrow=c(1,2))
plot(XYQUADRATIC$x, XYQUADRATIC$y, pch=16,
xlab="X",
ylab="Y",
main="X vs Y")
abline(coefficients(quad_mod), lwd=2, lty=2,
col="red")
plot(XYQUADRATIC$x, resid(quad_mod), pch=16,
xlab="X",
ylab="Residual",
main="Residual Plot")
abline(h=0, lwd=2, lty=2,
col="blue")