Regression

Goal

  • Our analyses up to now have concerned one variable.
  • Even two variable tests were in a sense about one random variable since they were, for example, the difference of calories. etc.
  • Now we want to look for a way to relate two variables.

Example

The following data set contains carbon dioxide levels for the years from 1990 to 2010. The two variables are years and CO\( _2 \) levels. The question is: Is there a relationship between year and CO\( _2 \) level?

library(resampledata)
head(Maunaloa)
  ID Year  Level
1  1 1990 357.08
2  2 1991 359.00
3  3 1992 359.45
4  4 1993 360.07
5  5 1994 361.48
6  6 1995 363.62

First

Plot CO\( _2 \) as a function of year:

plot(Maunaloa$Level~Maunaloa$Year)

plot of chunk unnamed-chunk-2

There appears to be a “strong positive linear relationship”.

Language

  • strong positive
  • weak positive
  • negative (if the slope is negative)

Another example

  • From the data set NBA1617 plot PercFG as a function of OffReb.
plot(NBA1617$PercFG~NBA1617$OffReb)

plot of chunk unnamed-chunk-3

There is a positive relationship, but not strong.

Quantify

  • We'd like to quantify this
    • in order to make “strong” quantifiable
    • in order to be able to predict

Correlation

  • Correlation is a measure of this relationship
  • correlation is a function of covariance - see pages 301 to 304
  • correlation is always a number between -1 and 1.
    • closer to 1, stronger positive
    • closer to -1, stronger negative
cor(Maunaloa$Year, Maunaloa$Level)
[1] 0.9951825
  • So this correlation is quite strong.

  • the correlation is usually denoted with \( r \)

Basketball correlation

  • Find the correlation for PercFG as a function of OffReb
cor(NBA1617$PercFG, NBA1617$OffReb)
[1] 0.4898055

Fit a line

  • The next natural question is to fit a line to the data.

plot of chunk unnamed-chunk-6

  • It's easiest to see with the basketball plot.

plot of chunk unnamed-chunk-7

  • Imagine the vertical line segments being adjustable,
  • but all having one end (open circle) on a line
  • Find the line that minimizes the sum of the squares of the lengths of the segments
  • The resulting line is called the least-squares regression line

R makes it easy:

lm(Level~Year, data=Maunaloa)

Call:
lm(formula = Level ~ Year, data = Maunaloa)

Coefficients:
(Intercept)         Year  
  -3279.593        1.826  

So the least-squares regression line is \[ Level=-3279.6+1.826\cdot Year \].

BB example

Find the regression line for the BB example.

lm(PercFG~OffReb, data=NBA1617)

Call:
lm(formula = PercFG ~ OffReb, data = NBA1617)

Coefficients:
(Intercept)       OffReb  
   42.82337      0.05762  

So, the least-squares regression line is \[ PercFG=42.823+0.058\cdot OffReb \]

Prediction

  • We can use regression lines to predict
levelregression <- lm(Level~Year, data=Maunaloa)
predict(levelregression, newdata=data.frame(Year=2015))
       1 
400.6084 
  • As with any model, you should be careful about the limits of the model. Predicting too far beyond the scope of the data is likely faulty.

predict in the NBA

  • The top 5 offensive rebounders in this data set are:
library(dplyr)
NBA1617 %>% 
  select(Name, PercFG, OffReb) %>% 
  top_n(n=5, OffReb) %>% 
  arrange(desc(OffReb))
                    Name PercFG OffReb
1       Tristan Thompson   60.0    286
2      LaMarcus Aldridge   47.7    172
3 Michael Kidd-Gilchrist   47.7    156
4              David Lee   59.0    149
5             Kevin Love   42.7    148
  • What would the predicted field goal percentage be for a player with 300 offensive rebounds?
bbregression <- lm(PercFG~OffReb, data=NBA1617)
predict(bbregression, newdata=data.frame(OffReb=300))
      1 
60.1099 
  • What are some contextual problems with this prediction?
  • A couple of the top offensive rebounders are not high percentage field goal shooters.

$r^2$

  • Often \( r^2 \) is reported instead of \( r \).
  • \( r^2 \) is the proportion of the variance (of the observed values) explained by the regression line
  • that is:
    • the observed values have some variance (the standard deviation squared)
    • a good model would explain/predict a high proportion of this variance
    • \( r^2 \) is a value between 0 and 1. Closer to 1, the better the model.