Regression in R

This is a simple overview of regression using R. It is intended as a quick intro to regression in R or a refresher, and it is not a comprehensive review. The intended audience should have some basic understanding of R. See the end of this document for more resources.

Also, I am not a trained statistician, so I rely heavily on people smarter than myself. See the end of the document for a list of some of these smart people and sources.

Finally, since this is a simple overview of regression I be dealing with ordinary least squares regression, which is the most common form of regression (although not always the best) and not with time series data (which can be much more complicated).

What is regression?

Simple version:

A method to fit a set of outcomes to a set of observations.

The outcome is commonly known as the y variable, dependent variable, response or regressand, while the observations are known as the x variables, independent variables, explanatory variable or regressors. Regression establishes a relationship to fit these outcomes to the observations.

Visual version

(from The Elements of Statistical Learning, by Hastie, Tibshirani and Friedman)

hyperplane regression

Math version

(from Wikipedia on Linear Regression)

$regression formula$

Advantages and disadvantages of linear regression:

Advantages

A well structured model creates understandable and communicable results
It is common in academics and business and has a big body of knowledge to draw upon
It usually works well

Disadvantages

Every advantage above can be a disadvantage:

A poorly structured model will give results that are easy to misinterpret and may be misleading, also regression requires specifying a structure (in contrast to some other approaches)
The large body of research can be confusing and there can be different ways to approach the same problem
Sometimes people can be misled by thinking regression works too well and there are times when regression is not the right approach

And of course:

correlation

Running a regression in R

This section will walk you through a simple example of running a regression in R. This example will focus on a topic of great intrigue, thought and importance: fantasy football.

Step 0: Identify the question, the data and the desired outcome

Before setting about running a regression in R, it is important to identify the question at hand, the data available and what your final outcome will be.

The question:

Can we predict a player's performance, as measured in fantasy points, given some information about their performance in the prior season?

The data:

We will use a very basic data set, but a common one, that includes player performance metrics on receptions, yards, touch downs and games played as well as information on where that player is most frequently being drafted for that season.

The outcome:

Ideally we will have a model, which will highlight which performance variables are of most importance in predicting next season's performance. We do not need a model that accurately predicts a single player's performance, but are more concerned with aggregate performance

Step 1: packages

Load your packages: most of R's functionality is due to its packages, and there are many for regression. The basic linear model is implemented in the base stats package but you might want to load a few more:

library(lmtest)  ## diagnostic checking in regressions
library(car)  ## hypothesis testing
library(sandwich)  ## for robust regressions
library(AER)  ## loads lots of data sets and some functions
library(dynlm)  ## nice package for some time series regressions
library(ggplot2)  ## for better graphics
library(plyr)  ## for running lots of regressions

Step 2: load the data, look at it, and transform if necessary

Now load some data! You might want to use your own data or use some of the excellent data sets in those packages you loaded. Take a look at the data too.

The data we are using is from http://www.pro-football-reference.com/ with the performance data from 2011 (our independent variables) and the fantasy points from 2012 (our dependent variable, or outcome) as well as Average Draft Position in 2012 from http://fantasyfootballcalculator.com/adp.php?year=2012. This data is compiled and available for educational purposes only here: https://www.dropbox.com/s/pbv9h3ld71c3ft1/fantasy_football.csv

Finally, we will restrict our model to data on wide receivers (WR) and those who started at least 4 games (25% of season)

ff <- read.csv(file = "fantasy_football.csv", header = TRUE)
str(ff)

## 'data.frame':    589 obs. of  24 variables:
##  $ Fpoints.2012 : int  344 346 340 222 276 220 143 323 145 62 ...
##  $ Name         : Factor w/ 587 levels "A.J. Feeley",..: 5 194 557 472 410 87 487 88 374 411 ...
##  $ ProBowl      : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ Team         : Factor w/ 34 levels "2TM","3TM","ARI",..: 14 21 22 5 13 13 22 7 26 17 ...
##  $ Traded       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age          : int  28 32 34 24 23 26 22 22 23 26 ...
##  $ GamesPlayed  : int  15 16 16 16 16 16 16 16 15 16 ...
##  $ GamesStarted : int  15 16 16 16 16 16 16 16 15 16 ...
##  $ Passing.Cmp  : int  343 468 401 1 421 0 0 310 0 0 ...
##  $ Passing.Att  : int  502 657 611 1 663 0 0 517 0 0 ...
##  $ Passing.Yds  : int  4643 5476 5235 1 5038 0 0 4051 0 0 ...
##  $ Passing.TD   : int  45 46 39 1 41 0 0 21 0 0 ...
##  $ Passing.Int  : int  6 14 12 0 16 0 0 17 0 0 ...
##  $ Rushing.Att  : int  60 21 43 291 22 1 1 126 273 343 ...
##  $ Rushing.Yds  : int  257 86 109 1364 78 11 2 706 1309 1606 ...
##  $ Rushing.Y.A  : num  4.28 4.1 2.53 4.69 3.55 11 2 5.6 4.79 4.68 ...
##  $ Rushing.TD   : int  3 1 3 12 0 0 1 14 17 8 ...
##  $ Receiving.Rec: int  0 0 0 76 0 96 90 1 48 43 ...
##  $ Receiving.Yds: int  0 0 0 704 0 1681 1327 27 315 374 ...
##  $ Receiving.Y.R: num  NA NA NA 9.26 NA ...
##  $ Receiving.TD : int  0 0 0 3 0 16 17 0 3 3 ...
##  $ Fantasy.Pos  : Factor w/ 4 levels "QB","RB","TE",..: 1 1 1 2 1 4 3 1 2 2 ...
##  $ DraftOverall : num  2.1 7.3 4.6 2.3 12 5.9 19.7 16.5 5.3 14.9 ...
##  $ NotDrafted   : int  0 0 0 0 0 0 0 0 0 0 ...

summary(ff)

##   Fpoints.2012                Name        ProBowl            Team    
##  Min.   : -2.0   Alex Smith     :  2   Min.   :0.0000   2TM    : 24  
##  1st Qu.:  0.0   Zach Miller    :  2   1st Qu.:0.0000   WAS    : 21  
##  Median : 12.0   A.J. Feeley    :  1   Median :0.0000   ARI    : 20  
##  Mean   : 45.3   A.J. Green     :  1   Mean   :0.0645   NYG    : 20  
##  3rd Qu.: 69.0   Aaron Brown    :  1   3rd Qu.:0.0000   OAK    : 20  
##  Max.   :346.0   Aaron Hernandez:  1   Max.   :1.0000   STL    : 20  
##                  (Other)        :581                    (Other):464  
##      Traded           Age        GamesPlayed    GamesStarted  
##  Min.   :0.000   Min.   :21.0   Min.   : 0.0   Min.   : 0.00  
##  1st Qu.:0.000   1st Qu.:24.0   1st Qu.: 7.0   1st Qu.: 0.00  
##  Median :0.000   Median :26.0   Median :13.0   Median : 3.00  
##  Mean   :0.465   Mean   :26.4   Mean   :11.3   Mean   : 5.24  
##  3rd Qu.:1.000   3rd Qu.:28.0   3rd Qu.:16.0   3rd Qu.:10.00  
##  Max.   :1.000   Max.   :41.0   Max.   :16.0   Max.   :16.00  
##                                                               
##   Passing.Cmp     Passing.Att     Passing.Yds     Passing.TD   
##  Min.   :  0.0   Min.   :  0.0   Min.   :   0   Min.   : 0.00  
##  1st Qu.:  0.0   1st Qu.:  0.0   1st Qu.:   0   1st Qu.: 0.00  
##  Median :  0.0   Median :  0.0   Median :   0   Median : 0.00  
##  Mean   : 17.8   Mean   : 29.5   Mean   : 213   Mean   : 1.26  
##  3rd Qu.:  0.0   3rd Qu.:  0.0   3rd Qu.:   0   3rd Qu.: 0.00  
##  Max.   :468.0   Max.   :663.0   Max.   :5476   Max.   :46.00  
##                                                                
##   Passing.Int      Rushing.Att     Rushing.Yds    Rushing.Y.A   
##  Min.   : 0.000   Min.   :  0.0   Min.   :  -8   Min.   :-8.00  
##  1st Qu.: 0.000   1st Qu.:  0.0   1st Qu.:   0   1st Qu.: 2.60  
##  Median : 0.000   Median :  1.0   Median :   0   Median : 3.90  
##  Mean   : 0.859   Mean   : 23.7   Mean   : 102   Mean   : 4.13  
##  3rd Qu.: 0.000   3rd Qu.: 14.0   3rd Qu.:  56   3rd Qu.: 5.05  
##  Max.   :23.000   Max.   :343.0   Max.   :1606   Max.   :25.00  
##                                                  NA's   :282    
##    Rushing.TD     Receiving.Rec   Receiving.Yds  Receiving.Y.R  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  -8   Min.   :-8.00  
##  1st Qu.: 0.000   1st Qu.:  1.0   1st Qu.:   0   1st Qu.: 7.46  
##  Median : 0.000   Median :  7.0   Median :  62   Median :10.27  
##  Mean   : 0.676   Mean   : 17.7   Mean   : 212   Mean   :10.62  
##  3rd Qu.: 0.000   3rd Qu.: 27.0   3rd Qu.: 291   3rd Qu.:13.22  
##  Max.   :17.000   Max.   :122.0   Max.   :1681   Max.   :52.50  
##                                                  NA's   :147    
##   Receiving.TD   Fantasy.Pos  DraftOverall   NotDrafted   
##  Min.   : 0.00   QB: 78      Min.   :  2   Min.   :0.000  
##  1st Qu.: 0.00   RB:170      1st Qu.:200   1st Qu.:1.000  
##  Median : 0.00   TE:123      Median :200   Median :1.000  
##  Mean   : 1.26   WR:218      Mean   :172   Mean   :0.795  
##  3rd Qu.: 1.00               3rd Qu.:200   3rd Qu.:1.000  
##  Max.   :17.00               Max.   :200   Max.   :1.000  
##

ff[1:4, ]

##   Fpoints.2012          Name ProBowl Team Traded Age GamesPlayed
## 1          344 Aaron Rodgers       1  GNB      0  28          15
## 2          346    Drew Brees       1  NOR      0  32          16
## 3          340     Tom Brady       1  NWE      0  34          16
## 4          222      Ray Rice       1  BAL      0  24          16
##   GamesStarted Passing.Cmp Passing.Att Passing.Yds Passing.TD Passing.Int
## 1           15         343         502        4643         45           6
## 2           16         468         657        5476         46          14
## 3           16         401         611        5235         39          12
## 4           16           1           1           1          1           0
##   Rushing.Att Rushing.Yds Rushing.Y.A Rushing.TD Receiving.Rec
## 1          60         257        4.28          3             0
## 2          21          86        4.10          1             0
## 3          43         109        2.53          3             0
## 4         291        1364        4.69         12            76
##   Receiving.Yds Receiving.Y.R Receiving.TD Fantasy.Pos DraftOverall
## 1             0            NA            0          QB          2.1
## 2             0            NA            0          QB          7.3
## 3             0            NA            0          QB          4.6
## 4           704          9.26            3          RB          2.3
##   NotDrafted
## 1          0
## 2          0
## 3          0
## 4          0

# restrict to only WR and GamesStarted at least 4
ff <- subset(ff, Fantasy.Pos == "WR" & GamesStarted >= 4)

Step 3: training and test set

For many problems you might want to split your data into a training and test test, but we will not for this example since we are looking simply for aggregate relationships.

It is important to partition your data into a training and test set if you want to evaluate the performance of your model in comparison to other models. This also helps reduce the risk of over fitting.

We won't run this, but it might be helpful to use in other applications

### not run
set.seed(1)
ff.length <- length(ff[, 1])
sample.rows <- sample(ff.length, 0.8 * ff.length, replace = FALSE)
ff.train <- ff[sample.rows, ]
ff.test <- ff[-sample.rows, ]

Step 4: setup a model–and run!

Which variable will be your outcome (dependent variable) and what will be your independent variables? How will you want to structure your model?

For this example, we'll pick Fpoints.2012 as the main dependent variable and see how a few simple performance variables like Receiving.Rec, Receiving.Yds, Receiving.TD and Age relate to Fpoints.2012

## call basic linear model with lm function and store to an object
ff.fit1 <- lm(Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD + 
    Age, data = ff)

Step 5: now the work begins!

Interpret

You've gone through a lot of steps and now you have a model stored in object ff.fit1. This is a linear model object and the object is stored as a list.

You can now access the results of this model in a variety of ways. The model is stored as a list and has many different stored values including the original observations, the fitted values and the residuals. We can access these later.

One of the most common ways to inspect the model is to run a summary

summary(ff.fit1)

## 
## Call:
## lm(formula = Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD + 
##     Age, data = ff)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -90.52 -34.78  -3.11  33.39 143.74 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   105.9973    44.5212    2.38   0.0194 * 
## Receiving.Rec  -0.3333     0.6074   -0.55   0.5846   
## Receiving.Yds   0.1245     0.0451    2.76   0.0069 **
## Receiving.TD    1.8856     2.6044    0.72   0.4709   
## Age            -3.9261     1.6596   -2.37   0.0201 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.8 on 90 degrees of freedom
## Multiple R-squared:  0.445,  Adjusted R-squared:  0.421 
## F-statistic: 18.1 on 4 and 90 DF,  p-value: 6.33e-11

If you have never encountered this type of output before you should be sure to check out some of the references below, but here's the 101:

Call: This section describes what model you are summarizing
Residuals: Residuals are the difference between the fitted outcome (prediction of FPoints.2012) and the actual value (FPoints.2012). This dialog tells you how far your model missed in the worst case (Min and Max), but also in the heart of the data (1Q, Median and 3Q).
- In our case, we didn't do great, some of the residuals are around 100 points and the total Max value in the data is only 220. However, 50% of the residuals are within 50 points of 0, so there's some hope that this is a viable model
Coefficients: this is heart of the summary. We will discuss each column from left to right:
- Column 1: This is a list of the independent variables used in the model. The (Intercept) is added by default, this is essentially a starting point for the model and is usually important to have, but doesn't usually offer much explanatory power.
- Column 2: Estimate is the coefficient of your independent variable. This is the value that describes the average relationship between independent and dependent variable. For instance, the coefficient of 106.99725 for (Intercept) indicates that the model each player starts with a base of roughly 107 points. This is more interesting for variables that you selected. For instance, for Age the value of -3.92610 indicates that as a player moves from 23 to 24, their performance will on average fall about 3.9 fantasy points the following season.
- Column 3: Std. Error is an estimate of the standard deviation of the coefficient value. More plainly, it's how much variation there is in the coefficient. More variation indicates that the variable you selected has less explanatory power overall.
- Column 4: t value is simply the Estimate divided by the Std. Error, this matters because it starts to get into the “significance” of our variables. Look for absolute values around 2 or greater. These are good signs that the variable you selected is significant.
- Column 5: Pr(>|t|) or p-value is the column that does all the dirty work for us (again, see the refs). Look here for values less than 0.05 or lower, preferably less than 0.01. This indicates the “significance” of the coefficient on our independent variable. Significance is the probability of a Type I error, which is a rejection of the null hypothesis, when it is true. More colloquially it is “the probability that the value would occur by chance”
Multiple R-squared and Adjusted R-squared or the correlation coefficients are useful but frequently misused values of the “fit” of the model, where a 0 indicates that the model explains none of the variance and 1 indicates the model explains 100% of the variance. Although it is easy to explain and interpret, this can be very misleading figure and it should not be relied upon. Instead, use AIC or BIC when choosing between models. The following models all have the same R^2, but they do not all tell the same story.

anscome_quartet

F-statistic is a measure of the joint significance of your regression. The p-value here should usually be less than 0.01

Analyze

This model indicates that Age and Yards are important explanatory variables–but there's too much variability in Rec and TD to make any meaningful conclusions. This model would help me in choosing a player for the following year: I would prefer younger players with more yards and I would not be very concerned with the number of TDs they scored or how many catches they had.

Improve (or try to improve)

One of the advantages of regressions in R is that they are easy to change. Perhaps I would want to throw in another variable like, “ProBowl” to indicate if they made the Pro Bowl in 2011.

ff.fit2 <- update(ff.fit1, . ~ . + ProBowl)
summary(ff.fit2)

## 
## Call:
## lm(formula = Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD + 
##     Age + ProBowl, data = ff)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85.88 -34.38  -3.96  30.16 145.81 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   118.3801    45.2566    2.62    0.010 *
## Receiving.Rec  -0.2427     0.6084   -0.40    0.691  
## Receiving.Yds   0.1084     0.0464    2.34    0.022 *
## Receiving.TD    1.7996     2.5933    0.69    0.490  
## Age            -4.2371     1.6681   -2.54    0.013 *
## ProBowl        24.8501    18.3950    1.35    0.180  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.6 on 89 degrees of freedom
## Multiple R-squared:  0.457,  Adjusted R-squared:  0.426 
## F-statistic:   15 on 5 and 89 DF,  p-value: 1.22e-10

No big change here, although my my R² improved on this model, I should check the AIC or BIC of both my models and pick the one with the lowest score. The AIC/BIC are information measures that offer a balance of model fit with model simplicity.

AIC(ff.fit1, ff.fit2)

##         df  AIC
## ff.fit1  6 1015
## ff.fit2  7 1015

BIC(ff.fit1, ff.fit2)

##         df  BIC
## ff.fit1  6 1031
## ff.fit2  7 1033

It appears that my second model worse overall as measured by the BIC, so I should prefer the first model.

I've got one final model to test, this one incorporates the average draft position in mock drafts for 2012. This data is readily available before the fantasy football season. Perhaps that is an important variable as well.

ff.fit3 <- lm(Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD + 
    Age + DraftOverall, data = ff)
summary(ff.fit3)

## 
## Call:
## lm(formula = Fpoints.2012 ~ Receiving.Rec + Receiving.Yds + Receiving.TD + 
##     Age + DraftOverall, data = ff)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -84.82 -29.84  -0.49  28.24 107.15 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   213.2861    40.7866    5.23  1.1e-06 ***
## Receiving.Rec  -0.0646     0.5083   -0.13     0.90    
## Receiving.Yds   0.0309     0.0404    0.77     0.45    
## Receiving.TD   -1.8798     2.2513   -0.83     0.41    
## Age            -2.0419     1.4155   -1.44     0.15    
## DraftOverall   -0.6522     0.1026   -6.36  8.6e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.7 on 89 degrees of freedom
## Multiple R-squared:  0.619,  Adjusted R-squared:  0.597 
## F-statistic: 28.9 on 5 and 89 DF,  p-value: <2e-16

AIC(ff.fit1, ff.fit3)

##         df    AIC
## ff.fit1  6 1015.2
## ff.fit3  7  981.6

Wow, this model makes a big difference. My R² has improved, as has my BIC, but also now all of my independent variables, except DraftOverall are now insignificant. This changes a lot for me, but is very helpful. It indicates that I should pay attention to the mock drafts before the season since seem to correspond very well to performance in 2012. The interpretation is, for every point lower draft value, the expected performance in 2012 falls by 0.65 points. This is a pretty big number.

Let's try one more model, much simpler this time that only uses DraftOverall

ff.fit4 <- lm(Fpoints.2012 ~ DraftOverall, data = ff)
summary(ff.fit4)

## 
## Call:
## lm(formula = Fpoints.2012 ~ DraftOverall, data = ff)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -95.40 -31.92  -3.42  27.55 119.58 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   176.255      9.152    19.3   <2e-16 ***
## DraftOverall   -0.704      0.059   -11.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.5 on 93 degrees of freedom
## Multiple R-squared:  0.605,  Adjusted R-squared:   0.6 
## F-statistic:  142 on 1 and 93 DF,  p-value: <2e-16

AIC(ff.fit3, ff.fit4)

##         df   AIC
## ff.fit3  7 981.6
## ff.fit4  3 977.0

Inspect & evaluate

I have a simple model that I'm pretty happy with, however I should do some inspection before I consider this problem solved.

Actual vs Fitted

What does my model “look like”? It's important to evaluate how the model fits the predicted values to the actuals. If your model doesn't pass the basic smell test at this level you might need to reconsider your model construction, the data and even if regression is the right tool for your questions.

# model fit
ggplot(data = ff, aes(DraftOverall, Fpoints.2012)) + geom_point() + geom_smooth(method = lm) + 
    ggtitle("DraftOverall & Fantasy Points 2012")

plot of chunk actual_fitted


# actual vs fitted values
qplot(ff.fit4$fitted.values, ff.fit4$model$Fpoints.2012, main = "Fitted vs Actual") + 
    geom_smooth(method = lm, se = FALSE)

plot of chunk actual_fitted

All of this looks 'okay' to me. Not perfect because there's a lot of values where DraftOverall = 200 and my model isn't helping with that. But this is acceptable for my purposes. Let's move forward:

Residuals and Regression Diagnostics

Regressions have a number of important assumptions and my residuals are good indicators if I have violated any of these assumptions. Below is code for some of the areas to evaluate.

This part of model evaluation can be tricky, but it's worth spending some time learning these functions.

## standard views to evaluate residuals
par(mfrow = c(2, 2))
plot(ff.fit4)

plot of chunk evaluate_residuals


## what are some of my biggest errors from above?
ff[row.names(ff) %in% c("41", "134", "197"), 1:2]

##     Fpoints.2012          Name
## 41            61 Greg Jennings
## 134          155 Mike Williams
## 197            1 Austin Collie

# Jennings and Collie were injured in 2012 Mike Williams highly
# outperformed his draft position, 2011 was his second year and he
# regressed from his 2010 rookie season, so people probably weren't
# expecing much from him in 2012, but he performed very well

## do my residuals display heteroskedasticity? Not much.
qplot(ff.fit4$fitted.values, ff.fit4$residuals)

plot of chunk evaluate_residuals

# p values below 0.05 are worth investigating
bptest(ff.fit4)

## 
##  studentized Breusch-Pagan test
## 
## data:  ff.fit4
## BP = 0.3745, df = 1, p-value = 0.5406


## autocorrelation isn't an issue in this setup, but if so we could test
## with dwtest values that depart significantly from 2 are worth
## investigating
dwtest(ff.fit4)

## 
##  Durbin-Watson test
## 
## data:  ff.fit4
## DW = 2.08, p-value = 0.6256
## alternative hypothesis: true autocorrelation is greater than 0


## is multicollinearity an issue? no, our model only has one variable, but
## it is with our third model (VIF > 10)
vif(ff.fit3)

## Receiving.Rec Receiving.Yds  Receiving.TD           Age  DraftOverall 
##         7.093        11.279         2.911         1.064         2.996

Multicollinearity doesn't violate the assumptions of regression, but it can inflate standard errors and lead to some misinterpretation.

Beware!

A few caveats:

If your model exhibits heteroskedasticity, autocorrelation or multicollinearity you need to do more work! (see refs)
Beware 'spurious regressions' where two numbers seem to be related, but there is no real relationship: e.g. Australian wine sales & Number of passengers on flights from SFO to NYC (both go up–but this isn't a relationship)
In selecting your model do not run stepwise through the variables and find one regression that works–this is bad science (thanks again to XKCD) http://xkcd.com/882/

Conclusion

We now have a model that seems to work well and helps me answer my question: “Can I predict performance in one year, given performance from the year prior?” The answer seems to be, “Yes, but it's not great, use DraftOrder when in doubt, but there's probably more work to be done–especially on lower drafted players”

Resources and references

Textbooks I have used:

Hastie, T., Tibshirani, R. and Friedman, J. Elements of Statistical Learning: Data Mining, Inference and Prediction, Second Edition. Springer Series in Statistics. 2009. Available here: http://www-stat.stanford.edu/~tibs/ElemStatLearn/
Kleiber, C., Zeileis, A. Applied Econometrics with R. Springer Science+Business Media, LLC. 2008.
Ramanathan, R. Introductory Econometircs with Applications. Harcourt Brace Jovanovich, Inc. 1989.
Wooldridge, J. Introductory Econometrics, Fourth Edition. South-Western. 2009.

Websites:

UCLA Institute for Digital Research and Education. http://www.ats.ucla.edu/stat/r/
CRAN Task View: Computational Econometrics: http://cran.r-project.org/web/views/Econometrics.html
Wikipedia on linear regression: http://en.wikipedia.org/wiki/Linear_regression

Other resources:

man pages from lmtest, stats and knitr
Faraway, J. Practical Regression and Anova using R. 2002. http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
RStudio
Cookbook for R: http://cookbook-r.com
http://www.pro-football-reference.com/ & http://fantasyfootballcalculator.com/adp.php?year=2012
XKCD

R Markdown code available here: https://gist.github.com/cmiller01/5954938