Regression

Goal

Our analyses up to now have concerned one variable.
Even two variable tests were in a sense about one random variable since they were, for example, the difference of calories. etc.
Now we want to look for a way to relate two variables.

Example

The following data set contains carbon dioxide levels for the years from 1990 to 2010. The two variables are years and CO$ _2 $ levels. The question is: Is there a relationship between year and CO$ _2 $ level?

library(resampledata)
head(Maunaloa)

  ID Year  Level
1  1 1990 357.08
2  2 1991 359.00
3  3 1992 359.45
4  4 1993 360.07
5  5 1994 361.48
6  6 1995 363.62

First

Plot CO$ _2 $ as a function of year:

plot(Maunaloa$Level~Maunaloa$Year)

plot of chunk unnamed-chunk-2

There appears to be a “strong positive linear relationship”.

Language

strong positive
weak positive
negative (if the slope is negative)

Another example

From the data set NBA1617 plot PercFG as a function of OffReb.

plot(NBA1617$PercFG~NBA1617$OffReb)

plot of chunk unnamed-chunk-3

There is a positive relationship, but not strong.

Quantify

We'd like to quantify this
- in order to make “strong” quantifiable
- in order to be able to predict

Correlation

Correlation is a measure of this relationship
correlation is a function of covariance - see pages 301 to 304
correlation is always a number between -1 and 1.
- closer to 1, stronger positive
- closer to -1, stronger negative

cor(Maunaloa$Year, Maunaloa$Level)

[1] 0.9951825

So this correlation is quite strong.
the correlation is usually denoted with $ r $

Basketball correlation

Find the correlation for PercFG as a function of OffReb

cor(NBA1617$PercFG, NBA1617$OffReb)

[1] 0.4898055

Fit a line

The next natural question is to fit a line to the data.

plot of chunk unnamed-chunk-6

It's easiest to see with the basketball plot.

plot of chunk unnamed-chunk-7

Imagine the vertical line segments being adjustable,
but all having one end (open circle) on a line
Find the line that minimizes the sum of the squares of the lengths of the segments
The resulting line is called the least-squares regression line

R makes it easy:

lm(Level~Year, data=Maunaloa)


Call:
lm(formula = Level ~ Year, data = Maunaloa)

Coefficients:
(Intercept)         Year  
  -3279.593        1.826

So the least-squares regression line is \[ Level=-3279.6+1.826\cdot Year \].

BB example

Find the regression line for the BB example.

lm(PercFG~OffReb, data=NBA1617)


Call:
lm(formula = PercFG ~ OffReb, data = NBA1617)

Coefficients:
(Intercept)       OffReb  
   42.82337      0.05762

So, the least-squares regression line is \[ PercFG=42.823+0.058\cdot OffReb \]

Prediction

We can use regression lines to predict

levelregression <- lm(Level~Year, data=Maunaloa)
predict(levelregression, newdata=data.frame(Year=2015))

       1 
400.6084

As with any model, you should be careful about the limits of the model. Predicting too far beyond the scope of the data is likely faulty.

predict in the NBA

The top 5 offensive rebounders in this data set are:

library(dplyr)
NBA1617 %>% 
  select(Name, PercFG, OffReb) %>% 
  top_n(n=5, OffReb) %>% 
  arrange(desc(OffReb))

                    Name PercFG OffReb
1       Tristan Thompson   60.0    286
2      LaMarcus Aldridge   47.7    172
3 Michael Kidd-Gilchrist   47.7    156
4              David Lee   59.0    149
5             Kevin Love   42.7    148

What would the predicted field goal percentage be for a player with 300 offensive rebounds?

bbregression <- lm(PercFG~OffReb, data=NBA1617)
predict(bbregression, newdata=data.frame(OffReb=300))

      1 
60.1099

What are some contextual problems with this prediction?

A couple of the top offensive rebounders are not high percentage field goal shooters.

$r^2$

Often $ r^2 $ is reported instead of $ r $.
$ r^2 $ is the proportion of the variance (of the observed values) explained by the regression line
that is:
- the observed values have some variance (the standard deviation squared)
- a good model would explain/predict a high proportion of this variance
- $ r^2 $ is a value between 0 and 1. Closer to 1, the better the model.