This is a simple overview of regression using R. It is intended as a quick intro to regression in R or a refresher, and it is not a comprehensive review. The intended audience should have some basic understanding of R. See the end of this document for more resources.
Also, I am not a trained statistician, so I rely heavily on people smarter than myself. See the end of the document for a list of some of these smart people and sources.
Finally, since this is a simple overview of regression I be dealing with ordinary least squares regression, which is the most common form of regression (although not always the best) and not with time series data (which can be much more complicated).
A method to fit a set of outcomes to a set of observations.
The outcome is commonly known as the y variable, dependent variable, response or regressand, while the observations are known as the x variables, independent variables, explanatory variable or regressors. Regression establishes a relationship to fit these outcomes to the observations.
(from The Elements of Statistical Learning, by Hastie, Tibshirani and Friedman)
(from Wikipedia on Linear Regression)

Every advantage above can be a disadvantage:

This section will walk you through a simple example of running a regression in R. This example will focus on a topic of great intrigue, thought and importance: fantasy football.
Before setting about running a regression in R, it is important to identify the question at hand, the data available and what your final outcome will be.
Can we predict a player's performance, as measured in fantasy points, given some information about their performance in the prior season?
We will use a very basic data set, but a common one, that includes player performance metrics on receptions, yards, touch downs and games played as well as information on where that player is most frequently being drafted for that season.
Ideally we will have a model, which will highlight which performance variables are of most importance in predicting next season's performance. We do not need a model that accurately predicts a single player's performance, but are more concerned with aggregate performance
Load your packages: most of R's functionality is due to its packages, and there are many for regression. The basic linear model is implemented in the base stats package but you might want to load a few more:
library(lmtest) ## diagnostic checking in regressions
library(car) ## hypothesis testing
library(sandwich) ## for robust regressions
library(AER) ## loads lots of data sets and some functions
library(dynlm) ## nice package for some time series regressions
library(ggplot2) ## for better graphics
library(plyr) ## for running lots of regressions
Now load some data! You might want to use your own data or use some of the excellent data sets in those packages you loaded. Take a look at the data too.
The data we are using is from http://www.pro-football-reference.com/ with the performance data from 2011 (our independent variables) and the fantasy points from 2012 (our dependent variable, or outcome) as well as Average Draft Position in 2012 from http://fantasyfootballcalculator.com/adp.php?year=2012. This data is compiled and available for educational purposes only here: https://www.dropbox.com/s/pbv9h3ld71c3ft1/fantasy_football.csv
Finally, we will restrict our model to data on wide receivers (WR) and those who started at least 4 games (25% of season)
ff <- read.csv(file = "fantasy_football.csv", header = TRUE)
str(ff)
## 'data.frame': 589 obs. of 24 variables:
## $ Fpoints.2012 : int 344 346 340 222 276 220 143 323 145 62 ...
## $ Name : Factor w/ 587 levels "A.J. Feeley",..: 5 194 557 472 410 87 487 88 374 411 ...
## $ ProBowl : int 1 1 1 1 0 1 1 1 1 1 ...
## $ Team : Factor w/ 34 levels "2TM","3TM","ARI",..: 14 21 22 5 13 13 22 7 26 17 ...
## $ Traded : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Age : int 28 32 34 24 23 26 22 22 23 26 ...
## $ GamesPlayed : int 15 16 16 16 16 16 16 16 15 16 ...
## $ GamesStarted : int 15 16 16 16 16 16 16 16 15 16 ...
## $ Passing.Cmp : int 343 468 401 1 421 0 0 310 0 0 ...
## $ Passing.Att : int 502 657 611 1 663 0 0 517 0 0 ...
## $ Passing.Yds : int 4643 5476 5235 1 5038 0 0 4051 0 0 ...
## $ Passing.TD : int 45 46 39 1 41 0 0 21 0 0 ...
## $ Passing.Int : int 6 14 12 0 16 0 0 17 0 0 ...
## $ Rushing.Att : int 60 21 43 291 22 1 1 126 273 343 ...
## $ Rushing.Yds : int 257 86 109 1364 78 11 2 706 1309 1606 ...
## $ Rushing.Y.A : num 4.28 4.1 2.53 4.69 3.55 11 2 5.6 4.79 4.68 ...
## $ Rushing.TD : int 3 1 3 12 0 0 1 14 17 8 ...
## $ Receiving.Rec: int 0 0 0 76 0 96 90 1 48 43 ...
## $ Receiving.Yds: int 0 0 0 704 0 1681 1327 27 315 374 ...
## $ Receiving.Y.R: num NA NA NA 9.26 NA ...
## $ Receiving.TD : int 0 0 0 3 0 16 17 0 3 3 ...
## $ Fantasy.Pos : Factor w/ 4 levels "QB","RB","TE",..: 1 1 1 2 1 4 3 1 2 2 ...
## $ DraftOverall : num 2.1 7.3 4.6 2.3 12 5.9 19.7 16.5 5.3 14.9 ...
## $ NotDrafted : int 0 0 0 0 0 0 0 0 0 0 ...
summary(ff)
## Fpoints.2012 Name ProBowl Team
## Min. : -2.0 Alex Smith : 2 Min. :0.0000 2TM : 24
## 1st Qu.: 0.0 Zach Miller : 2 1st Qu.:0.0000 WAS : 21
## Median : 12.0 A.J. Feeley : 1 Median :0.0000 ARI : 20
## Mean : 45.3 A.J. Green : 1 Mean :0.0645 NYG : 20
## 3rd Qu.: 69.0 Aaron Brown : 1 3rd Qu.:0.0000 OAK : 20
## Max. :346.0 Aaron Hernandez: 1 Max. :1.0000 STL : 20
## (Other) :581 (Other):464
## Traded Age GamesPlayed GamesStarted
## Min. :0.000 Min. :21.0 Min. : 0.0 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:24.0 1st Qu.: 7.0 1st Qu.: 0.00
## Median :0.000 Median :26.0 Median :13.0 Median : 3.00
## Mean :0.465 Mean :26.4 Mean :11.3 Mean : 5.24
## 3rd Qu.:1.000 3rd Qu.:28.0 3rd Qu.:16.0 3rd Qu.:10.00
## Max. :1.000 Max. :41.0 Max. :16.0 Max. :16.00
##
## Passing.Cmp Passing.Att Passing.Yds Passing.TD
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0.00
## Median : 0.0 Median : 0.0 Median : 0 Median : 0.00
## Mean : 17.8 Mean : 29.5 Mean : 213 Mean : 1.26
## 3rd Qu.: 0.0 3rd Qu.: 0.0 3rd Qu.: 0 3rd Qu.: 0.00
## Max. :468.0 Max. :663.0 Max. :5476 Max. :46.00
##
## Passing.Int Rushing.Att Rushing.Yds Rushing.Y.A
## Min. : 0.000 Min. : 0.0 Min. : -8 Min. :-8.00
## 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 2.60
## Median : 0.000 Median : 1.0 Median : 0 Median : 3.90
## Mean : 0.859 Mean : 23.7 Mean : 102 Mean : 4.13
## 3rd Qu.: 0.000 3rd Qu.: 14.0 3rd Qu.: 56 3rd Qu.: 5.05
## Max. :23.000 Max. :343.0 Max. :1606 Max. :25.00
## NA's :282
## Rushing.TD Receiving.Rec Receiving.Yds Receiving.Y.R
## Min. : 0.000 Min. : 0.0 Min. : -8 Min. :-8.00
## 1st Qu.: 0.000 1st Qu.: 1.0 1st Qu.: 0 1st Qu.: 7.46
## Median : 0.000 Median : 7.0 Median : 62 Median :10.27
## Mean : 0.676 Mean : 17.7 Mean : 212 Mean :10.62
## 3rd Qu.: 0.000 3rd Qu.: 27.0 3rd Qu.: 291 3rd Qu.:13.22
## Max. :17.000 Max. :122.0 Max. :1681 Max. :52.50
## NA's :147
## Receiving.TD Fantasy.Pos DraftOverall NotDrafted
## Min. : 0.00 QB: 78 Min. : 2 Min. :0.000
## 1st Qu.: 0.00 RB:170 1st Qu.:200 1st Qu.:1.000
## Median : 0.00 TE:123 Median :200 Median :1.000
## Mean : 1.26 WR:218 Mean :172 Mean :0.795
## 3rd Qu.: 1.00 3rd Qu.:200 3rd Qu.:1.000
## Max. :17.00 Max. :200 Max. :1.000
##
ff[1:4, ]
## Fpoints.2012 Name ProBowl Team Traded Age GamesPlayed
## 1 344 Aaron Rodgers 1 GNB 0 28 15
## 2 346 Drew Brees 1 NOR 0 32 16
## 3 340 Tom Brady 1 NWE 0 34 16
## 4 222 Ray Rice 1 BAL 0 24 16
## GamesStarted Passing.Cmp Passing.Att Passing.Yds Passing.TD Passing.Int
## 1 15 343 502 4643 45 6
## 2 16 468 657 5476 46 14
## 3 16 401 611 5235 39 12
## 4 16 1 1 1 1 0
## Rushing.Att Rushing.Yds Rushing.Y.A Rushing.TD Receiving.Rec
## 1 60 257 4.28 3 0
## 2 21 86 4.10 1 0
## 3 43 109 2.53 3 0
## 4 291 1364 4.69 12 76
## Receiving.Yds Receiving.Y.R Receiving.TD Fantasy.Pos DraftOverall
## 1 0 NA 0 QB 2.1
## 2 0 NA 0 QB 7.3
## 3 0 NA 0 QB 4.6
## 4 704 9.26 3 RB 2.3
## NotDrafted
## 1 0
## 2 0
## 3 0
## 4 0
# restrict to only WR and GamesStarted at least 4
ff <- subset(ff, Fantasy.Pos == "WR" & GamesStarted >= 4)
For many problems you might want to split your data into a training and test test, but we will not for this example since we are looking simply for aggregate relationships.
It is important to partition your data into a training and test set if you want to evaluate the performance of your model in comparison to other models. This also helps reduce the risk of over fitting.
We won't run this, but it might be helpful to use in other applications
### not run
set.seed(1)
ff.length <- length(ff[, 1])
sample.rows <- sample(ff.length, 0.8 * ff.length, replace = FALSE)
ff.train <- ff[sample.rows, ]
ff.test <- ff[-sample.rows, ]
Which variable will be your outcome (dependent variable) and what will be your independent variables? How will you want to structure your model?
For this example, we'll pick Fpoints.2012 as the main dependent variable and see how a few simple performance variables like Receiving.Rec, Receiving.Yds, Receiving.TD and Age relate to Fpoints.2012
## call basic linear model with lm function and store to an object
ff.fit1 <- lm(Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD +
Age, data = ff)
You've gone through a lot of steps and now you have a model stored in object ff.fit1. This is a linear model object and the object is stored as a list.
You can now access the results of this model in a variety of ways. The model is stored as a list and has many different stored values including the original observations, the fitted values and the residuals. We can access these later.
One of the most common ways to inspect the model is to run a summary
summary(ff.fit1)
##
## Call:
## lm(formula = Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD +
## Age, data = ff)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.52 -34.78 -3.11 33.39 143.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105.9973 44.5212 2.38 0.0194 *
## Receiving.Rec -0.3333 0.6074 -0.55 0.5846
## Receiving.Yds 0.1245 0.0451 2.76 0.0069 **
## Receiving.TD 1.8856 2.6044 0.72 0.4709
## Age -3.9261 1.6596 -2.37 0.0201 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.8 on 90 degrees of freedom
## Multiple R-squared: 0.445, Adjusted R-squared: 0.421
## F-statistic: 18.1 on 4 and 90 DF, p-value: 6.33e-11
If you have never encountered this type of output before you should be sure to check out some of the references below, but here's the 101:
Call: This section describes what model you are summarizingResiduals: Residuals are the difference between the fitted outcome (prediction of FPoints.2012) and the actual value (FPoints.2012). This dialog tells you how far your model missed in the worst case (Min and Max), but also in the heart of the data (1Q, Median and 3Q).
Coefficients: this is heart of the summary. We will discuss each column from left to right:
(Intercept) is added by default, this is essentially a starting point for the model and is usually important to have, but doesn't usually offer much explanatory power.Estimate is the coefficient of your independent variable. This is the value that describes the average relationship between independent and dependent variable. For instance, the coefficient of 106.99725 for (Intercept) indicates that the model each player starts with a base of roughly 107 points. This is more interesting for variables that you selected. For instance, for Age the value of -3.92610 indicates that as a player moves from 23 to 24, their performance will on average fall about 3.9 fantasy points the following season.Std. Error is an estimate of the standard deviation of the coefficient value. More plainly, it's how much variation there is in the coefficient. More variation indicates that the variable you selected has less explanatory power overall.t value is simply the Estimate divided by the Std. Error, this matters because it starts to get into the “significance” of our variables. Look for absolute values around 2 or greater. These are good signs that the variable you selected is significant.Pr(>|t|) or p-value is the column that does all the dirty work for us (again, see the refs). Look here for values less than 0.05 or lower, preferably less than 0.01. This indicates the “significance” of the coefficient on our independent variable. Significance is the probability of a Type I error, which is a rejection of the null hypothesis, when it is true. More colloquially it is “the probability that the value would occur by chance”Multiple R-squared and Adjusted R-squared or the correlation coefficients are useful but frequently misused values of the “fit” of the model, where a 0 indicates that the model explains none of the variance and 1 indicates the model explains 100% of the variance. Although it is easy to explain and interpret, this can be very misleading figure and it should not be relied upon. Instead, use AIC or BIC when choosing between models. The following models all have the same R2, but they do not all tell the same story.F-statistic is a measure of the joint significance of your regression. The p-value here should usually be less than 0.01This model indicates that Age and Yards are important explanatory variables–but there's too much variability in Rec and TD to make any meaningful conclusions. This model would help me in choosing a player for the following year: I would prefer younger players with more yards and I would not be very concerned with the number of TDs they scored or how many catches they had.
One of the advantages of regressions in R is that they are easy to change. Perhaps I would want to throw in another variable like, “ProBowl” to indicate if they made the Pro Bowl in 2011.
ff.fit2 <- update(ff.fit1, . ~ . + ProBowl)
summary(ff.fit2)
##
## Call:
## lm(formula = Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD +
## Age + ProBowl, data = ff)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.88 -34.38 -3.96 30.16 145.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 118.3801 45.2566 2.62 0.010 *
## Receiving.Rec -0.2427 0.6084 -0.40 0.691
## Receiving.Yds 0.1084 0.0464 2.34 0.022 *
## Receiving.TD 1.7996 2.5933 0.69 0.490
## Age -4.2371 1.6681 -2.54 0.013 *
## ProBowl 24.8501 18.3950 1.35 0.180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.6 on 89 degrees of freedom
## Multiple R-squared: 0.457, Adjusted R-squared: 0.426
## F-statistic: 15 on 5 and 89 DF, p-value: 1.22e-10
No big change here, although my my R2 improved on this model, I should check the AIC or BIC of both my models and pick the one with the lowest score. The AIC/BIC are information measures that offer a balance of model fit with model simplicity.
AIC(ff.fit1, ff.fit2)
## df AIC
## ff.fit1 6 1015
## ff.fit2 7 1015
BIC(ff.fit1, ff.fit2)
## df BIC
## ff.fit1 6 1031
## ff.fit2 7 1033
It appears that my second model worse overall as measured by the BIC, so I should prefer the first model.
I've got one final model to test, this one incorporates the average draft position in mock drafts for 2012. This data is readily available before the fantasy football season. Perhaps that is an important variable as well.
ff.fit3 <- lm(Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD +
Age + DraftOverall, data = ff)
summary(ff.fit3)
##
## Call:
## lm(formula = Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD +
## Age + DraftOverall, data = ff)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.82 -29.84 -0.49 28.24 107.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.2861 40.7866 5.23 1.1e-06 ***
## Receiving.Rec -0.0646 0.5083 -0.13 0.90
## Receiving.Yds 0.0309 0.0404 0.77 0.45
## Receiving.TD -1.8798 2.2513 -0.83 0.41
## Age -2.0419 1.4155 -1.44 0.15
## DraftOverall -0.6522 0.1026 -6.36 8.6e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.7 on 89 degrees of freedom
## Multiple R-squared: 0.619, Adjusted R-squared: 0.597
## F-statistic: 28.9 on 5 and 89 DF, p-value: <2e-16
AIC(ff.fit1, ff.fit3)
## df AIC
## ff.fit1 6 1015.2
## ff.fit3 7 981.6
Wow, this model makes a big difference. My R2 has improved, as has my BIC, but also now all of my independent variables, except DraftOverall are now insignificant. This changes a lot for me, but is very helpful. It indicates that I should pay attention to the mock drafts before the season since seem to correspond very well to performance in 2012. The interpretation is, for every point lower draft value, the expected performance in 2012 falls by 0.65 points. This is a pretty big number.
Let's try one more model, much simpler this time that only uses DraftOverall
ff.fit4 <- lm(Fpoints.2012 ~ DraftOverall, data = ff)
summary(ff.fit4)
##
## Call:
## lm(formula = Fpoints.2012 ~ DraftOverall, data = ff)
##
## Residuals:
## Min 1Q Median 3Q Max
## -95.40 -31.92 -3.42 27.55 119.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 176.255 9.152 19.3 <2e-16 ***
## DraftOverall -0.704 0.059 -11.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.5 on 93 degrees of freedom
## Multiple R-squared: 0.605, Adjusted R-squared: 0.6
## F-statistic: 142 on 1 and 93 DF, p-value: <2e-16
AIC(ff.fit3, ff.fit4)
## df AIC
## ff.fit3 7 981.6
## ff.fit4 3 977.0
I have a simple model that I'm pretty happy with, however I should do some inspection before I consider this problem solved.
Actual vs Fitted
What does my model “look like”? It's important to evaluate how the model fits the predicted values to the actuals. If your model doesn't pass the basic smell test at this level you might need to reconsider your model construction, the data and even if regression is the right tool for your questions.
# model fit
ggplot(data = ff, aes(DraftOverall, Fpoints.2012)) + geom_point() + geom_smooth(method = lm) +
ggtitle("DraftOverall & Fantasy Points 2012")
# actual vs fitted values
qplot(ff.fit4$fitted.values, ff.fit4$model$Fpoints.2012, main = "Fitted vs Actual") +
geom_smooth(method = lm, se = FALSE)
All of this looks 'okay' to me. Not perfect because there's a lot of values where DraftOverall = 200 and my model isn't helping with that. But this is acceptable for my purposes. Let's move forward:
Residuals and Regression Diagnostics
Regressions have a number of important assumptions and my residuals are good indicators if I have violated any of these assumptions. Below is code for some of the areas to evaluate.
This part of model evaluation can be tricky, but it's worth spending some time learning these functions.
## standard views to evaluate residuals
par(mfrow = c(2, 2))
plot(ff.fit4)
## what are some of my biggest errors from above?
ff[row.names(ff) %in% c("41", "134", "197"), 1:2]
## Fpoints.2012 Name
## 41 61 Greg Jennings
## 134 155 Mike Williams
## 197 1 Austin Collie
# Jennings and Collie were injured in 2012 Mike Williams highly
# outperformed his draft position, 2011 was his second year and he
# regressed from his 2010 rookie season, so people probably weren't
# expecing much from him in 2012, but he performed very well
## do my residuals display heteroskedasticity? Not much.
qplot(ff.fit4$fitted.values, ff.fit4$residuals)
# p values below 0.05 are worth investigating
bptest(ff.fit4)
##
## studentized Breusch-Pagan test
##
## data: ff.fit4
## BP = 0.3745, df = 1, p-value = 0.5406
## autocorrelation isn't an issue in this setup, but if so we could test
## with dwtest values that depart significantly from 2 are worth
## investigating
dwtest(ff.fit4)
##
## Durbin-Watson test
##
## data: ff.fit4
## DW = 2.08, p-value = 0.6256
## alternative hypothesis: true autocorrelation is greater than 0
## is multicollinearity an issue? no, our model only has one variable, but
## it is with our third model (VIF > 10)
vif(ff.fit3)
## Receiving.Rec Receiving.Yds Receiving.TD Age DraftOverall
## 7.093 11.279 2.911 1.064 2.996
Multicollinearity doesn't violate the assumptions of regression, but it can inflate standard errors and lead to some misinterpretation.
Beware!
A few caveats:
We now have a model that seems to work well and helps me answer my question: “Can I predict performance in one year, given performance from the year prior?” The answer seems to be, “Yes, but it's not great, use DraftOrder when in doubt, but there's probably more work to be done–especially on lower drafted players”
Textbooks I have used:
Websites:
Other resources:
lmtest, stats and knitrR Markdown code available here: https://gist.github.com/cmiller01/5954938