Data Project: Predicting A Run or a Pass on NFL Third Downs

Introduction

Being the defensive coordinator of a team in the National Football League is an incredibly difficult job. Before every game, a defensive coordinator has to create a general game plan for the defensive unit of a football team, and during the game, on every play, they are responsible for deciding what the team runs to counter the opposing offense’s attempts to gain yards and score points. Defensive play calling becomes especially crucial on third downs, where a successful defense can stop the offense from getting a first down and force them to either punt or kick a field goal, both of which are non-ideal outcomes for an offensive drive. Obviously, in the heat of the game-time situation, the defensive coordinator has no idea what play the offense will call, but they do have to make an educated guess on what that play may be in order to be able to counter it. The goal of this study is to help a defensive coordinator make third down play calls by using the given circumstances in a logistic regression model to predict whether or not the offensive team will run or pass the ball in this situation.

The data set used to create this model is a set of every single play run in the National Football League in the 2015-2016 season. It has been filtered to include only plays run on third down, and it has also been filtered to remove all variables that are recorded as a result of the play (yards gained, tackler, pass location, run location, etc.). The variable that the model will predict is called PlayType, and all plays except for Run and Pass plays have been filtered (penalties, field goals that may occur at the end of halves, etc.), so this variable has become binary. The data has been split into a training and a testing data set, where 80% of the data is used to train a model and 20% is used to test the model.

Exploratory Analysis & Plots/Visualizations

Some of the potential explanatory variables include:

Variable	Type	Description
TimeSecs	int	the amount of time remaining in the game in seconds
qtr	int	the quarter of the game
Drive	int	the amount of possessions the offensive team has had until that point
yrdline100	int	how far the offensive team is from the goal line
SideofField	Factor	which team’s side of the field the ball is on
ydstogo	int	how far the offensive team is from the first down line
GoalToGo	int	binary variable, 1 if first down line is the goal line, 0 otherwise
ScoreDiff	int	difference in score between the offensive and defensive teams

One noticable thing about this data is that multiple variables may provide the same effect on a potential model. For example, the TimeSecs, qtr, and Drive variables all represent how far along a game may be, and that could help determine what kind of play is run in certain situations. Thus, using all of these in a model is useless, and we should only use the most specific of these variables (which is TimeSecs). The same is also true with the yrdline100 and GoalToGo variables (both display how close to the goal line the team is), and the yrdline100 is clearly the most specific of the two. We confirm these suspicions by looking at a correlation matrix of our explanatory variables and denoting which variables have high correlation, which would imply multicollinearity (this occurs when some of the predictors are highly correlated).

We will then create some plots to see whether any of the above variables has a clear and significant impact on play type. First, let us plot the ydstogo variable against the PlayType variable to see whether the distance from the first down marker has any implications on the play type.

We can see based on this plot that although it is more common to pass the ball at farther distances, there is no clear distinction between when the offense will run and when it will pass the ball based on this variable alone. Let us now look at a histogram of the ydstogo variable to see why this may be the case.

Ignoring the spike at 10 yards (which logically makes sense because after every first down is acheived, there are 10 yards until the next first down), we can see that there is an approximately logistic distribution of this variable. Thus, in our analyses, we will examine the log_ydstogo variable, which is just the logarithm of the ydstogo variable, instead of the ydstogo variable itself to account for the shape of the distribution. Plotting any of the other variables will not yield very telling results, as there is too much data, and no one variable displays a clear connection with PlayType.

Model Selection

Now, we can start to create our initial model with the three most impactful variables we have so far: TimeSecs, yrdline100, and log_ydstogo.

Let us create a pairs plot for these three variables and determine whether there are any potential interactions between variables to look at and add to the model.

We see that yrdline100 and ydstogo may have an interaction, as both of these terms inclue a yardage distance (one from goal and the other from the down marker), so it makes sense that there could potentially be a relationship between those log_ydstogo and yrdline100 to explore. Let us see if accounting for this our model increases our ability to predict PlayType.

## Analysis of Deviance Table
## 
## Model 1: PlayType ~ TimeSecs + yrdline100 + log_ydstogo
## Model 2: PlayType ~ TimeSecs + log_ydstogo + yrdline100 + log_ydstogo * 
##     yrdline100
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      5044     4530.7                          
## 2      5043     4512.0  1   18.702 1.528e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is apparent, based on the low p-value given by this analysis (which signifies the low probability there is of these terms having no additional predictive effect on PlayType), that this model does greatly improve our ability to predict the outcome, so we will continue to include this term.

A common phenomenon that occurs in games is running or passing being dependent on whether a team is ahead or behind late in the game. For example, a team that is losing with little time left in the game is more likely to pass the ball, as throwing the ball allows teams to gain larger amounts of yardage while taking little time off the clock. On the flip side, a team that is winning when the game is about to end is more likely to run the ball, as it constantly keeps time running off the clock and is less likely to result in turnovers. Thus, we can explore a potential interaction term between TimeSecs and ScoreDiff in our model.

## Analysis of Deviance Table
## 
## Model 1: PlayType ~ TimeSecs + log_ydstogo + yrdline100 + log_ydstogo * 
##     yrdline100
## Model 2: PlayType ~ TimeSecs + log_ydstogo + yrdline100 + ScoreDiff + 
##     log_ydstogo * yrdline100 + ScoreDiff * TimeSecs
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      5043     4512.0                          
## 2      5041     4405.4  2    106.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This addition will significantly improve our ability to predict the outcome, so we should clearly include it in our model. Now, we have to account for the personel that different teams possess. For example, the New England Patriots have Tom Brady, who is one of the game’s all time greatest players, at the quarterback position, and they do not have a significantly productive runningback. On the other hand, the Dallas Cowboys have an incredible offensive line, so their runningback, Ezekiel Elliot, is more likely to gain larger chunks of yardage, as he has clearer running lanes. Therefore, the Patriots may be more likely to throw the ball and the Cowboys may be more likely to run it, purely based on the players on their respective rosters. This logic also works on the defensive side of the ball, as some teams may be more equipped to protect against the run and others may be more equpped to protecct against the pass.

In light of these facts, we can create four new potentially explanatory variables representing a team’s abilities in the following four departments: running, passing, run defense, and pass defense. To do this, we determine whether each team was in the upper or lower halves of the league for each of these categories, and each of these categories has their own binary categorical variable to represent whether or not a team was “good” or “bad” in these departments over the 2015-16 NFL season. To determine whether a team was “good” or “bad”, we used statistics called Offensive and Defensive DVOA (Defense-adjusted Value Over Average). This is a comprehensive statistic that takes into account all factors of every single play a team makes and gives a percentage score that is positive if the offense has an advantage, and negative if the defense has one. Thus, the best offenses have positive percentage DVOAs, and the best defenses have negative percentage ones. For each of the four new variables, the teams ranked 1 through 16 for the DVOA statistic for that facet of the game were denoted to be “good”, and those ranked 17 through 32 were said to be “bad”. Thus, we can add these variables to our model and run statistical tests to see which improves our predictive ability.

## Analysis of Deviance Table
## 
## Model 1: PlayType ~ TimeSecs + log_ydstogo + yrdline100 + ScoreDiff + 
##     log_ydstogo * yrdline100 + ScoreDiff * TimeSecs
## Model 2: PlayType ~ TimeSecs + log_ydstogo + yrdline100 + ScoreDiff + 
##     log_ydstogo * yrdline100 + ScoreDiff * TimeSecs + runORank
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1      5041     4405.4                        
## 2      5040     4394.7  1   10.635  0.00111 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After testing each of these variables, we see that only the rank of the rushing offense significantly augments our ability to predict whether a team will run or pass on third down. Finally, before we proceed with this model, let us analyze the variance inflation factors (VIF), which will show how much explanatory variables inflate each other’s variances (higher values are bad), of each term in our model. If we have VIF values that are larger than 10, we want to remove those terms from our model, as it means that too much of the variance in the term is dictated by the variance in other explanatory variables, so it confounds the effect of that term on the predictor. Let’s look at the variance inflation factors now.

##               TimeSecs            log_ydstogo             yrdline100 
##               1.019964               5.761692               3.432212 
##              ScoreDiff               runORank log_ydstogo:yrdline100 
##               3.114611               1.036244               9.224781 
##     TimeSecs:ScoreDiff 
##               3.061404

Since all of our terms have passed the VIF bar that we set, let us take a look at the summary output of our model (to ensure that all variables remain significant), and then we can proceed to analyze our testing data on this model.

## 
## Call:
## glm(formula = PlayType ~ TimeSecs + log_ydstogo + yrdline100 + 
##     ScoreDiff + log_ydstogo * yrdline100 + ScoreDiff * TimeSecs + 
##     runORank, family = "binomial", data = plays.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0358  -0.6434  -0.4611  -0.2914   3.0565  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             1.091e+00  1.690e-01   6.458 1.06e-10 ***
## TimeSecs               -1.950e-04  3.604e-05  -5.411 6.25e-08 ***
## log_ydstogo            -1.564e+00  1.078e-01 -14.508  < 2e-16 ***
## yrdline100             -1.047e-02  2.808e-03  -3.728 0.000193 ***
## ScoreDiff               5.307e-02  6.272e-03   8.461  < 2e-16 ***
## runORankgood            2.516e-01  7.722e-02   3.257 0.001124 ** 
## log_ydstogo:yrdline100  7.977e-03  1.830e-03   4.359 1.30e-05 ***
## TimeSecs:ScoreDiff     -2.121e-05  4.563e-06  -4.649 3.33e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5272.3  on 5047  degrees of freedom
## Residual deviance: 4394.7  on 5040  degrees of freedom
## AIC: 4410.7
## 
## Number of Fisher Scoring iterations: 5

Model Verification w/ Test Data

The first thing we must do is run the model with our testing data set to see if each variable remains significant, as in order to proceed with our model verification, this must be the case.

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8808  -0.6350  -0.4534  -0.2654   2.9296  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               1.539e+00  3.339e-01   4.609 4.05e-06 ***
## TimeSecs                 -2.134e-04  7.422e-05  -2.875  0.00404 ** 
## log_ydstogo              -2.000e+00  2.210e-01  -9.050  < 2e-16 ***
## yrdline100               -1.359e-02  5.572e-03  -2.439  0.01474 *  
## ScoreDiff                 4.051e-02  1.321e-02   3.065  0.00217 ** 
## runORankgood              1.897e-02  1.556e-01   0.122  0.90296    
## `log_ydstogo:yrdline100`  1.429e-02  3.652e-03   3.914 9.10e-05 ***
## `TimeSecs:ScoreDiff`     -1.200e-05  9.338e-06  -1.285  0.19869    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1314.5  on 1259  degrees of freedom
## Residual deviance: 1070.4  on 1252  degrees of freedom
## AIC: 1086.4
## 
## Number of Fisher Scoring iterations: 5

We see here that in our testing data, the runORank term and TimeSecs and ScoreDiff interaction term become insignificant (as their p-values are not small). Thus, we will run a new model that removes those two terms, and we will analyze both of these models to see which one yields better results. Now, with our two models in question, let us run a confusion matrix for each one. A confusion matrix shows the number of correctly and incorrectly predicted outcomes by a model based on the testing data, and it also shows what percentage of the data the model will be accurate for (and a confidence interval for this value). Let us run this for our original model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Pass Run
##       Pass  956 167
##       Run    32 105
##                                           
##                Accuracy : 0.8421          
##                  95% CI : (0.8207, 0.8618)
##     No Information Rate : 0.7841          
##     P-Value [Acc > NIR] : 1.362e-07       
##                                           
##                   Kappa : 0.4312          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9676          
##             Specificity : 0.3860          
##          Pos Pred Value : 0.8513          
##          Neg Pred Value : 0.7664          
##              Prevalence : 0.7841          
##          Detection Rate : 0.7587          
##    Detection Prevalence : 0.8913          
##       Balanced Accuracy : 0.6768          
##                                           
##        'Positive' Class : Pass            
##

We see here that we can use this model to predict whether or not a the play is a run or a pass 84.21% of the time. The high sensitivity indicates that there are few false negatives, but the low specificity indicates that there are many false positives (where negative means that it is a run play and positive means that it is a pass play). We can use a chi-square test to see if the confusion matrix indicates that this model has statistically singificant classification power.

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  confuse.original$table
## X-squared = 275.25, df = NA, p-value = 0.0004998

Due to the miniscule p-value of the above output, we know that the classification power of this model is very high. Let us now see if we can improve upon these numbers with our updated model (removing the two terms that were insignificant when run on the testing data).

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Pass Run
##       Pass  956 168
##       Run    32 104
##                                          
##                Accuracy : 0.8413         
##                  95% CI : (0.8199, 0.861)
##     No Information Rate : 0.7841         
##     P-Value [Acc > NIR] : 2.013e-07      
##                                          
##                   Kappa : 0.4274         
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9676         
##             Specificity : 0.3824         
##          Pos Pred Value : 0.8505         
##          Neg Pred Value : 0.7647         
##              Prevalence : 0.7841         
##          Detection Rate : 0.7587         
##    Detection Prevalence : 0.8921         
##       Balanced Accuracy : 0.6750         
##                                          
##        'Positive' Class : Pass           
##

We see here that there is almost no change in model accuracy, sensitivity, or specificity (.8413 compared to .8421, .9676 compared to .9676, and .3824 compared to .3860) between this updated model and our original model. Let us see if the chi-square test on the confusion matrix yields any interesting results.

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  confuse.updated$table
## X-squared = 271.29, df = NA, p-value = 0.0004998

Since the p-values here are identical, we cannot ascertain which model is superior based on that output. To look further for differences, we can look at the Receiver Operating Characteristics (ROC) curve of each model, which maps the model’s true positive rate against its false positive rate. Let us first examine this for our original model.

We see here that the true positive rate is always larger than the false positive rate, which is a very good sign. Let us compare this to the ROC curve for our updated model to determine which is more useful in this way.

There is almost no difference in the two ROC curves, so we again see that the verification these two models displays essentially equal accuracy. WE earlier logically added those two terms to our model, and despite their losing of significance in the testing data set, they do not confound any variables or detract from the predictive power of the model; thus, I am inclined to tout our original model as the most holistic and apt predictive model.

Inferences/Conclusions

Now, we can discuss what these numbers all mean and what we can truly draw from our logistic regression model. If we look back at the initial summary output for our model, we see a column called estimates. What this column of data tells us is that the odds of predicting a pass play are estimated to increase by a factor of that estimate for a one unit increase in that variable. For example, the estimate for our log_ydstogo variable is -1.564. Thus, the odds of predicting a pass play are estimated to increase by a factor of -1.564 (or the odds of predicting a run play are estimated to increase by a factor of 1.564) for a one unit increase in the logarithm of yards remaining before the first down line.

The scope of inference for this study is quite wide, as we are examining data across the entire 2015 - 2016 season. However, there may be need to account for changes in the skill levels of teams in four created fields (passing offense, running offense, passing defense, running defense) based on improvements or decline for this season and future seasons to come. This cannot be done until the data exists for that, so the model may be better suited to predicting plays from teams that remained at around an equal skill level than the previous year. This uncertainty extends to the coaching staff as well, since if there is a first year coaching staff in place, they will obviously have different tendencies than the coaching staff from previous years, which could be detrimental to the results of the model. In spite of this, we do have the ability to predict at slightly over 84% accuracy, and that number is most likely between 82% and 86%. This high predictive power shows that this model can have a certain level of success in circumstances, and it would be a very nice complimentary piece to the gameplan of a defensive coordinator scheming against an opposing team’s third down offense.