KITADA

Lesson #13

The Least-Square Regression Line and Equation

Motivation:

In the past two lessons, we’ve mentioned fitting a line between the points. In this lesson, we’ll discuss how to best “fit” a line between the points if the relationship between the response and explanatory variable is linear. This “best-fitting” line is called the least-squares regression line and can be described by an equation. We’ll learn how to write the equation of the least-squares regression line, interpret the equation, and how to use the equation to make predictions.

What you need to know from this lesson: After completing this lesson, you should be able to

To accomplish the above “What You Need to Know”, do the following:

The Lesson

The “Clear-Cut” Example:

Landslides are common events in tree-growing regions of the Pacific Northwest, so their effect on timber growth is of special concern to foresters. The article “Effects of Landslide Erosion on Subsequent Douglas Fir Growth and Stocking Levels in the Western Cascades, Oregon” (Soil Science Society of American Journals (1984)) reported on the results of a study in which growth in a landslide area was compared with growth in a previously clear-cut area. Here we consider clear-cut growth areas only. Data on the age of the tree and its most recent 5-year height growth (in cm) are given below:

### CLEAR CUT EXAMPLE ###
age<-c(5, 9, 9, 10, 10, 11, 11, 12,
       13, 13, 14, 14, 15, 15, 18, 18)
five_year<-c(70, 150, 260, 230, 255, 165, 225, 340, 
             305, 335, 290, 340, 225, 300, 380, 400)

1. Which of the above variables is the response variable and which is the explanatory variable?

2. Based on the scatterplot below, describe the relationship between age and 5-year growth:

### SCATTERPLOT ###
plot(age, five_year,pch=16, 
     xlab="Age (years)", 
     ylab="Five Year Growth (cm)", 
     main="Scatterplot of Age vs Five Year Growth")

plot of chunk unnamed-chunk-2

3. A straight line can be fit between the points using the least-squares method. To determine how the “best-fitting” line is determined, suppose a scatterplot had four points:

a. Which line is the “best-fitting” line?

The line that minimizes the vertical distance between the points and the line that fits them (aka the least-squares regression line).

b. The line, therefore, is called the least-squares regression line. The figure below is the same scatterplot on the previous page, but with the least-squares regression line “fit” to the data. It is sometimes called the “Plot of the Fitted Model.”

### FIT LSRL ####
mod<-lm(five_year~age)

plot(age, five_year,pch=16, 
               xlab="Age (years)", 
               ylab="Five Year Growth (cm)", 
               main="Scatterplot of Age vs Five Year Growth")
abline(coefficients(mod), lwd=2, lty=2, 
       col="red")

plot of chunk unnamed-chunk-3

4. Every line has an equation. There are two pieces of information that need to be known in order to write an equation of a line:

The equation of the least-squares regression line is: \( \hat{y}=a+bx \) ,where

5. Determining the values for the slope and y-intercept

There are formulas to determine the values of the slope and y-intercept. We will not use these formulas. Rather, we will rely on obtaining and interpreting output from R to determine the values of the slope and y-intercept. Even so, the formulas are included as an “appendix” to this lesson so that you are aware of how R determines these values.

Below is some of the R output from the regression analysis:

### SUMMARY ###
summary(mod)
## 
## Call:
## lm(formula = five_year ~ age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -99.177 -28.373   1.858  37.216  79.788 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.352     49.511   0.088    0.931    
## age           21.322      3.883   5.491 7.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.43 on 14 degrees of freedom
## Multiple R-squared:  0.6829, Adjusted R-squared:  0.6602 
## F-statistic: 30.15 on 1 and 14 DF,  p-value: 7.954e-05

Using the R output, write the equation of the least-squares regression line. Explain the meaning of the “terms” in the equation.

\( FIVE_YEAR_GROWTH = 4.352+21.322*AGE \)

6. Interpreting the slope and y-intercept

a. The slope of a line:

The slope tells us how much increase there will be in the y direction for a one unit increase in the x direction.

For every additional year the tree is alive the five year growth will be 21.322 more centimeters.

b. The y-intercept:

The y-intercept tells us the value that the model will take when x=0.

For a tree that is 0 years old, it should have a 4.352 cm five-year growth.

This doesn't make sense because if a tree is zero how can it have a 5 year growth.

7. Prediction

One of the uses of a regression analysis is prediction. That is, we might be interested in predicting the value of the response variable for a given value of the explanatory variable. Prediction is what we expect to happen on average! The least-squares regression line can be thought of as what is happening on average (which is why the least-squares regression line is sometimes called a prediction line). Therefore, to predict the value of the response variable for a particular value of the explanatory variable, simply substitute a value of the explanatory variable into the least-squares regression equation and solve for \( \hat{y} \).

a. Using the least-squares regression equation, predict the most recent 5-year height growth of a 10 year-old tree grown in this clear-cut area. (Don’t forget units!!!)

### PREDICT FOR 10 YEARS ###
4.352+21.322*10 # CM
## [1] 217.572

b. Visualizing the predicted value. Where on the plot of the fitted model is the predicted value?

Draw a virtical line up from 10 on the x-axis and see where it intersections the fitted line.

c. The least-squares regression line was calculated based on values of the age of a tree and the tree’s corresponding most recent 5-year growth from the sample data given at the beginning of this handout. Do you think you can you use the above least-squares regression equation to predict the most recent 5-year growth of a 25 year old tree in this clear cut area (assuming that the area was clear cut more than 25 years ago)? Why not? (This idea is called extrapolation.)

We can not because we dont have any data for ages larger than 18 years. Therefore, we cannot say what the pattern would be at 25 years.

8. R-square

In looking at the data and/or the scatter plot, not all of the 5-year growths are the same. Therefore, there is some variation in the response variable. The hope is that the least-squares regression line will fit between the data points in a manner that will “explain” quite a bit of that variation. The closer the data points are to the regression line, the higher proportion of the variation in the response variable that’s explained by the regression line.

What percent (or proportion) of the variation in the most recent 5-year tree growth is explained by the regression of tree growth on age? (This is called R-square, or R2.) (Understand how to get R-square from the R output.)

### R-SQUARED ###
R2<-summary(mod)$r.squared
R2
## [1] 0.682883

Use the value of R-square to determine the correlation coefficient.

### CORRELATION ###
sqrt(R2)
## [1] 0.8263674
cor(five_year, age)
## [1] 0.8263674