Regression in R

This is a simple overview of regression using R. It is intended as a quick intro to regression in R or a refresher, and it is not a comprehensive review. The intended audience should have some basic understanding of R. See the end of this document for more resources.

Also, I am not a trained statistician, so I rely heavily on people smarter than myself. See the end of the document for a list of some of these smart people and sources.

Finally, since this is a simple overview of regression I be dealing with ordinary least squares regression, which is the most common form of regression (although not always the best) and not with time series data (which can be much more complicated).

What is regression?

Simple version:

A method to fit a set of outcomes to a set of observations.

The outcome is commonly known as the y variable, dependent variable, response or regressand, while the observations are known as the x variables, independent variables, explanatory variable or regressors. Regression establishes a relationship to fit these outcomes to the observations.

Visual version

(from The Elements of Statistical Learning, by Hastie, Tibshirani and Friedman)

hyperplane regression

Math version

(from Wikipedia on Linear Regression)

regression formula

Advantages and disadvantages of linear regression:

Advantages

Disadvantages

Every advantage above can be a disadvantage:

And of course:

correlation

Running a regression in R

This section will walk you through a simple example of running a regression in R. This example will focus on a topic of great intrigue, thought and importance: fantasy football.

Step 0: Identify the question, the data and the desired outcome

Before setting about running a regression in R, it is important to identify the question at hand, the data available and what your final outcome will be.

The question:

Can we predict a player's performance, as measured in fantasy points, given some information about their performance in the prior season?

The data:

We will use a very basic data set, but a common one, that includes player performance metrics on receptions, yards, touch downs and games played as well as information on where that player is most frequently being drafted for that season.

The outcome:

Ideally we will have a model, which will highlight which performance variables are of most importance in predicting next season's performance. We do not need a model that accurately predicts a single player's performance, but are more concerned with aggregate performance

Step 1: packages

Load your packages: most of R's functionality is due to its packages, and there are many for regression. The basic linear model is implemented in the base stats package but you might want to load a few more:

library(lmtest)  ## diagnostic checking in regressions
library(car)  ## hypothesis testing
library(sandwich)  ## for robust regressions
library(AER)  ## loads lots of data sets and some functions
library(dynlm)  ## nice package for some time series regressions
library(ggplot2)  ## for better graphics
library(plyr)  ## for running lots of regressions

Step 2: load the data, look at it, and transform if necessary

Now load some data! You might want to use your own data or use some of the excellent data sets in those packages you loaded. Take a look at the data too.

The data we are using is from http://www.pro-football-reference.com/ with the performance data from 2011 (our independent variables) and the fantasy points from 2012 (our dependent variable, or outcome) as well as Average Draft Position in 2012 from http://fantasyfootballcalculator.com/adp.php?year=2012. This data is compiled and available for educational purposes only here: https://www.dropbox.com/s/pbv9h3ld71c3ft1/fantasy_football.csv

Finally, we will restrict our model to data on wide receivers (WR) and those who started at least 4 games (25% of season)

ff <- read.csv(file = "fantasy_football.csv", header = TRUE)
str(ff)
## 'data.frame':    589 obs. of  24 variables:
##  $ Fpoints.2012 : int  344 346 340 222 276 220 143 323 145 62 ...
##  $ Name         : Factor w/ 587 levels "A.J. Feeley",..: 5 194 557 472 410 87 487 88 374 411 ...
##  $ ProBowl      : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ Team         : Factor w/ 34 levels "2TM","3TM","ARI",..: 14 21 22 5 13 13 22 7 26 17 ...
##  $ Traded       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age          : int  28 32 34 24 23 26 22 22 23 26 ...
##  $ GamesPlayed  : int  15 16 16 16 16 16 16 16 15 16 ...
##  $ GamesStarted : int  15 16 16 16 16 16 16 16 15 16 ...
##  $ Passing.Cmp  : int  343 468 401 1 421 0 0 310 0 0 ...
##  $ Passing.Att  : int  502 657 611 1 663 0 0 517 0 0 ...
##  $ Passing.Yds  : int  4643 5476 5235 1 5038 0 0 4051 0 0 ...
##  $ Passing.TD   : int  45 46 39 1 41 0 0 21 0 0 ...
##  $ Passing.Int  : int  6 14 12 0 16 0 0 17 0 0 ...
##  $ Rushing.Att  : int  60 21 43 291 22 1 1 126 273 343 ...
##  $ Rushing.Yds  : int  257 86 109 1364 78 11 2 706 1309 1606 ...
##  $ Rushing.Y.A  : num  4.28 4.1 2.53 4.69 3.55 11 2 5.6 4.79 4.68 ...
##  $ Rushing.TD   : int  3 1 3 12 0 0 1 14 17 8 ...
##  $ Receiving.Rec: int  0 0 0 76 0 96 90 1 48 43 ...
##  $ Receiving.Yds: int  0 0 0 704 0 1681 1327 27 315 374 ...
##  $ Receiving.Y.R: num  NA NA NA 9.26 NA ...
##  $ Receiving.TD : int  0 0 0 3 0 16 17 0 3 3 ...
##  $ Fantasy.Pos  : Factor w/ 4 levels "QB","RB","TE",..: 1 1 1 2 1 4 3 1 2 2 ...
##  $ DraftOverall : num  2.1 7.3 4.6 2.3 12 5.9 19.7 16.5 5.3 14.9 ...
##  $ NotDrafted   : int  0 0 0 0 0 0 0 0 0 0 ...
summary(ff)
##   Fpoints.2012                Name        ProBowl            Team    
##  Min.   : -2.0   Alex Smith     :  2   Min.   :0.0000   2TM    : 24  
##  1st Qu.:  0.0   Zach Miller    :  2   1st Qu.:0.0000   WAS    : 21  
##  Median : 12.0   A.J. Feeley    :  1   Median :0.0000   ARI    : 20  
##  Mean   : 45.3   A.J. Green     :  1   Mean   :0.0645   NYG    : 20  
##  3rd Qu.: 69.0   Aaron Brown    :  1   3rd Qu.:0.0000   OAK    : 20  
##  Max.   :346.0   Aaron Hernandez:  1   Max.   :1.0000   STL    : 20  
##                  (Other)        :581                    (Other):464  
##      Traded           Age        GamesPlayed    GamesStarted  
##  Min.   :0.000   Min.   :21.0   Min.   : 0.0   Min.   : 0.00  
##  1st Qu.:0.000   1st Qu.:24.0   1st Qu.: 7.0   1st Qu.: 0.00  
##  Median :0.000   Median :26.0   Median :13.0   Median : 3.00  
##  Mean   :0.465   Mean   :26.4   Mean   :11.3   Mean   : 5.24  
##  3rd Qu.:1.000   3rd Qu.:28.0   3rd Qu.:16.0   3rd Qu.:10.00  
##  Max.   :1.000   Max.   :41.0   Max.   :16.0   Max.   :16.00  
##                                                               
##   Passing.Cmp     Passing.Att     Passing.Yds     Passing.TD   
##  Min.   :  0.0   Min.   :  0.0   Min.   :   0   Min.   : 0.00  
##  1st Qu.:  0.0   1st Qu.:  0.0   1st Qu.:   0   1st Qu.: 0.00  
##  Median :  0.0   Median :  0.0   Median :   0   Median : 0.00  
##  Mean   : 17.8   Mean   : 29.5   Mean   : 213   Mean   : 1.26  
##  3rd Qu.:  0.0   3rd Qu.:  0.0   3rd Qu.:   0   3rd Qu.: 0.00  
##  Max.   :468.0   Max.   :663.0   Max.   :5476   Max.   :46.00  
##                                                                
##   Passing.Int      Rushing.Att     Rushing.Yds    Rushing.Y.A   
##  Min.   : 0.000   Min.   :  0.0   Min.   :  -8   Min.   :-8.00  
##  1st Qu.: 0.000   1st Qu.:  0.0   1st Qu.:   0   1st Qu.: 2.60  
##  Median : 0.000   Median :  1.0   Median :   0   Median : 3.90  
##  Mean   : 0.859   Mean   : 23.7   Mean   : 102   Mean   : 4.13  
##  3rd Qu.: 0.000   3rd Qu.: 14.0   3rd Qu.:  56   3rd Qu.: 5.05  
##  Max.   :23.000   Max.   :343.0   Max.   :1606   Max.   :25.00  
##                                                  NA's   :282    
##    Rushing.TD     Receiving.Rec   Receiving.Yds  Receiving.Y.R  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  -8   Min.   :-8.00  
##  1st Qu.: 0.000   1st Qu.:  1.0   1st Qu.:   0   1st Qu.: 7.46  
##  Median : 0.000   Median :  7.0   Median :  62   Median :10.27  
##  Mean   : 0.676   Mean   : 17.7   Mean   : 212   Mean   :10.62  
##  3rd Qu.: 0.000   3rd Qu.: 27.0   3rd Qu.: 291   3rd Qu.:13.22  
##  Max.   :17.000   Max.   :122.0   Max.   :1681   Max.   :52.50  
##                                                  NA's   :147    
##   Receiving.TD   Fantasy.Pos  DraftOverall   NotDrafted   
##  Min.   : 0.00   QB: 78      Min.   :  2   Min.   :0.000  
##  1st Qu.: 0.00   RB:170      1st Qu.:200   1st Qu.:1.000  
##  Median : 0.00   TE:123      Median :200   Median :1.000  
##  Mean   : 1.26   WR:218      Mean   :172   Mean   :0.795  
##  3rd Qu.: 1.00               3rd Qu.:200   3rd Qu.:1.000  
##  Max.   :17.00               Max.   :200   Max.   :1.000  
## 
ff[1:4, ]
##   Fpoints.2012          Name ProBowl Team Traded Age GamesPlayed
## 1          344 Aaron Rodgers       1  GNB      0  28          15
## 2          346    Drew Brees       1  NOR      0  32          16
## 3          340     Tom Brady       1  NWE      0  34          16
## 4          222      Ray Rice       1  BAL      0  24          16
##   GamesStarted Passing.Cmp Passing.Att Passing.Yds Passing.TD Passing.Int
## 1           15         343         502        4643         45           6
## 2           16         468         657        5476         46          14
## 3           16         401         611        5235         39          12
## 4           16           1           1           1          1           0
##   Rushing.Att Rushing.Yds Rushing.Y.A Rushing.TD Receiving.Rec
## 1          60         257        4.28          3             0
## 2          21          86        4.10          1             0
## 3          43         109        2.53          3             0
## 4         291        1364        4.69         12            76
##   Receiving.Yds Receiving.Y.R Receiving.TD Fantasy.Pos DraftOverall
## 1             0            NA            0          QB          2.1
## 2             0            NA            0          QB          7.3
## 3             0            NA            0          QB          4.6
## 4           704          9.26            3          RB          2.3
##   NotDrafted
## 1          0
## 2          0
## 3          0
## 4          0
# restrict to only WR and GamesStarted at least 4
ff <- subset(ff, Fantasy.Pos == "WR" & GamesStarted >= 4)

Step 3: training and test set

For many problems you might want to split your data into a training and test test, but we will not for this example since we are looking simply for aggregate relationships.

It is important to partition your data into a training and test set if you want to evaluate the performance of your model in comparison to other models. This also helps reduce the risk of over fitting.

We won't run this, but it might be helpful to use in other applications

### not run
set.seed(1)
ff.length <- length(ff[, 1])
sample.rows <- sample(ff.length, 0.8 * ff.length, replace = FALSE)
ff.train <- ff[sample.rows, ]
ff.test <- ff[-sample.rows, ]

Step 4: setup a model–and run!

Which variable will be your outcome (dependent variable) and what will be your independent variables? How will you want to structure your model?

For this example, we'll pick Fpoints.2012 as the main dependent variable and see how a few simple performance variables like Receiving.Rec, Receiving.Yds, Receiving.TD and Age relate to Fpoints.2012

## call basic linear model with lm function and store to an object
ff.fit1 <- lm(Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD + 
    Age, data = ff)

Step 5: now the work begins!

Interpret

You've gone through a lot of steps and now you have a model stored in object ff.fit1. This is a linear model object and the object is stored as a list.

You can now access the results of this model in a variety of ways. The model is stored as a list and has many different stored values including the original observations, the fitted values and the residuals. We can access these later.

One of the most common ways to inspect the model is to run a summary

summary(ff.fit1)
## 
## Call:
## lm(formula = Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD + 
##     Age, data = ff)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -90.52 -34.78  -3.11  33.39 143.74 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   105.9973    44.5212    2.38   0.0194 * 
## Receiving.Rec  -0.3333     0.6074   -0.55   0.5846   
## Receiving.Yds   0.1245     0.0451    2.76   0.0069 **
## Receiving.TD    1.8856     2.6044    0.72   0.4709   
## Age            -3.9261     1.6596   -2.37   0.0201 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.8 on 90 degrees of freedom
## Multiple R-squared:  0.445,  Adjusted R-squared:  0.421 
## F-statistic: 18.1 on 4 and 90 DF,  p-value: 6.33e-11

If you have never encountered this type of output before you should be sure to check out some of the references below, but here's the 101:

anscome_quartet

Analyze

This model indicates that Age and Yards are important explanatory variables–but there's too much variability in Rec and TD to make any meaningful conclusions. This model would help me in choosing a player for the following year: I would prefer younger players with more yards and I would not be very concerned with the number of TDs they scored or how many catches they had.

Improve (or try to improve)

One of the advantages of regressions in R is that they are easy to change. Perhaps I would want to throw in another variable like, “ProBowl” to indicate if they made the Pro Bowl in 2011.

ff.fit2 <- update(ff.fit1, . ~ . + ProBowl)
summary(ff.fit2)
## 
## Call:
## lm(formula = Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD + 
##     Age + ProBowl, data = ff)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85.88 -34.38  -3.96  30.16 145.81 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   118.3801    45.2566    2.62    0.010 *
## Receiving.Rec  -0.2427     0.6084   -0.40    0.691  
## Receiving.Yds   0.1084     0.0464    2.34    0.022 *
## Receiving.TD    1.7996     2.5933    0.69    0.490  
## Age            -4.2371     1.6681   -2.54    0.013 *
## ProBowl        24.8501    18.3950    1.35    0.180  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.6 on 89 degrees of freedom
## Multiple R-squared:  0.457,  Adjusted R-squared:  0.426 
## F-statistic:   15 on 5 and 89 DF,  p-value: 1.22e-10

No big change here, although my my R2 improved on this model, I should check the AIC or BIC of both my models and pick the one with the lowest score. The AIC/BIC are information measures that offer a balance of model fit with model simplicity.

AIC(ff.fit1, ff.fit2)
##         df  AIC
## ff.fit1  6 1015
## ff.fit2  7 1015
BIC(ff.fit1, ff.fit2)
##         df  BIC
## ff.fit1  6 1031
## ff.fit2  7 1033

It appears that my second model worse overall as measured by the BIC, so I should prefer the first model.

I've got one final model to test, this one incorporates the average draft position in mock drafts for 2012. This data is readily available before the fantasy football season. Perhaps that is an important variable as well.

ff.fit3 <- lm(Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD + 
    Age + DraftOverall, data = ff)
summary(ff.fit3)
## 
## Call:
## lm(formula = Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD + 
##     Age + DraftOverall, data = ff)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -84.82 -29.84  -0.49  28.24 107.15 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   213.2861    40.7866    5.23  1.1e-06 ***
## Receiving.Rec  -0.0646     0.5083   -0.13     0.90    
## Receiving.Yds   0.0309     0.0404    0.77     0.45    
## Receiving.TD   -1.8798     2.2513   -0.83     0.41    
## Age            -2.0419     1.4155   -1.44     0.15    
## DraftOverall   -0.6522     0.1026   -6.36  8.6e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.7 on 89 degrees of freedom
## Multiple R-squared:  0.619,  Adjusted R-squared:  0.597 
## F-statistic: 28.9 on 5 and 89 DF,  p-value: <2e-16
AIC(ff.fit1, ff.fit3)
##         df    AIC
## ff.fit1  6 1015.2
## ff.fit3  7  981.6

Wow, this model makes a big difference. My R2 has improved, as has my BIC, but also now all of my independent variables, except DraftOverall are now insignificant. This changes a lot for me, but is very helpful. It indicates that I should pay attention to the mock drafts before the season since seem to correspond very well to performance in 2012. The interpretation is, for every point lower draft value, the expected performance in 2012 falls by 0.65 points. This is a pretty big number.

Let's try one more model, much simpler this time that only uses DraftOverall

ff.fit4 <- lm(Fpoints.2012 ~ DraftOverall, data = ff)
summary(ff.fit4)
## 
## Call:
## lm(formula = Fpoints.2012 ~ DraftOverall, data = ff)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -95.40 -31.92  -3.42  27.55 119.58 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   176.255      9.152    19.3   <2e-16 ***
## DraftOverall   -0.704      0.059   -11.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.5 on 93 degrees of freedom
## Multiple R-squared:  0.605,  Adjusted R-squared:   0.6 
## F-statistic:  142 on 1 and 93 DF,  p-value: <2e-16
AIC(ff.fit3, ff.fit4)
##         df   AIC
## ff.fit3  7 981.6
## ff.fit4  3 977.0

Inspect & evaluate

I have a simple model that I'm pretty happy with, however I should do some inspection before I consider this problem solved.

Actual vs Fitted

What does my model “look like”? It's important to evaluate how the model fits the predicted values to the actuals. If your model doesn't pass the basic smell test at this level you might need to reconsider your model construction, the data and even if regression is the right tool for your questions.

# model fit
ggplot(data = ff, aes(DraftOverall, Fpoints.2012)) + geom_point() + geom_smooth(method = lm) + 
    ggtitle("DraftOverall & Fantasy Points 2012")

plot of chunk actual_fitted


# actual vs fitted values
qplot(ff.fit4$fitted.values, ff.fit4$model$Fpoints.2012, main = "Fitted vs Actual") + 
    geom_smooth(method = lm, se = FALSE)

plot of chunk actual_fitted

All of this looks 'okay' to me. Not perfect because there's a lot of values where DraftOverall = 200 and my model isn't helping with that. But this is acceptable for my purposes. Let's move forward:

Residuals and Regression Diagnostics

Regressions have a number of important assumptions and my residuals are good indicators if I have violated any of these assumptions. Below is code for some of the areas to evaluate.

This part of model evaluation can be tricky, but it's worth spending some time learning these functions.

## standard views to evaluate residuals
par(mfrow = c(2, 2))
plot(ff.fit4)

plot of chunk evaluate_residuals


## what are some of my biggest errors from above?
ff[row.names(ff) %in% c("41", "134", "197"), 1:2]
##     Fpoints.2012          Name
## 41            61 Greg Jennings
## 134          155 Mike Williams
## 197            1 Austin Collie
# Jennings and Collie were injured in 2012 Mike Williams highly
# outperformed his draft position, 2011 was his second year and he
# regressed from his 2010 rookie season, so people probably weren't
# expecing much from him in 2012, but he performed very well

## do my residuals display heteroskedasticity? Not much.
qplot(ff.fit4$fitted.values, ff.fit4$residuals)

plot of chunk evaluate_residuals

# p values below 0.05 are worth investigating
bptest(ff.fit4)
## 
##  studentized Breusch-Pagan test
## 
## data:  ff.fit4
## BP = 0.3745, df = 1, p-value = 0.5406

## autocorrelation isn't an issue in this setup, but if so we could test
## with dwtest values that depart significantly from 2 are worth
## investigating
dwtest(ff.fit4)
## 
##  Durbin-Watson test
## 
## data:  ff.fit4
## DW = 2.08, p-value = 0.6256
## alternative hypothesis: true autocorrelation is greater than 0

## is multicollinearity an issue? no, our model only has one variable, but
## it is with our third model (VIF > 10)
vif(ff.fit3)
## Receiving.Rec Receiving.Yds  Receiving.TD           Age  DraftOverall 
##         7.093        11.279         2.911         1.064         2.996

Multicollinearity doesn't violate the assumptions of regression, but it can inflate standard errors and lead to some misinterpretation.

Beware!

A few caveats:

Conclusion

We now have a model that seems to work well and helps me answer my question: “Can I predict performance in one year, given performance from the year prior?” The answer seems to be, “Yes, but it's not great, use DraftOrder when in doubt, but there's probably more work to be done–especially on lower drafted players”

Resources and references

Textbooks I have used:

Websites:

Other resources:

R Markdown code available here: https://gist.github.com/cmiller01/5954938