As cliché as it sounds, American football is a game of inches. Given the fact that the NFL is a multi-billion dollar industry, these inches and the decisions that precede them have an immense monetary impact. In an attempt to glean additional insight into the impact of decision-making within the sport, our group set out to determine which in-game factors significantly impact a team’s WPA (“Win Probability Added”) for a given play. Specifically, we wanted to look at fourth down plays exclusively, as these are typically the most “high-leverage” decisions within a game.

Dataset

For our dataset, we were able to scrape data for every NFL regular season play from 2009 to 2020. We then cleaned the data to solely look at fourth down plays within that time frame, which totaled around 43,950 observations. After dealing with a few outliers, we ended up with 43,905 fourth down plays over the course of 12 NFL seasons. Included below is a table explaining each variable:

Variable Description
yardline_100 How far the offensive team is from the other team’s endzone. For example, if the offense is at the 75 ‘yardline_100’, they are 75 yards from the opponent’s endzone (or technically on their own 25 yard line)
game_date Calendar date in which the play occured
game_seconds_remaining How many seconds are remaining in the game
game_half Which half of the game the play happens within (football has two halves, so values are ‘Half1’ and ‘Half 2’)
ydstogo How many yards the offensive team needs for a first down on this particular fourth down play
play_type What action the offense took during this situation
field_goal_result If the offense attempted a field goal, this variable gives insight into the result of the attempt
score_differential The difference in score between the team in possession versus the defensive team. For example, if the defensive team is winning by 14 points before the fourth-down play is run, the ‘score_differential’ value is -14
ep Expected Points. The value of the current distance, field position, and down situation in terms of future expected net point advantage. Essentially, it is the net point value a team can expect given a particular combination of down, distance, and field position
epa Expected Points Added. The difference between the Expected Points (EP) at the start of a play and the EP at the end of the play. EPA is a measure of a play’s impact on the score of the game
wpa Win Probability Added. The difference between a team’s Win Probability (WP) at the start of a play and the WP at the end of the play. WPA is another measure of a play’s impact on the outcome of a game. Measured in percentages (e.g. a value of 6 is a 6 percent increase in WP for a play)
fourth_down_converted Whether a team that ‘goes for it’ in a fourth down situation (doesn’t punt or kick a field goal) achieves a first down. Coded as 1 for a conversion, 0 for failure (or not applicable).
fourth_down_failed Whether a team that ‘goes for it’ in a fourth down situation (doesn’t punt or kick a field goal) fails to get a first down. Coded as 1 for failure, 0 for success (or not applicable)
punt Whether a team decides to punt in a fourth down situation. Coded as 1 for punt, 0 for no punt
field_goal_missed Whether a team that kicks a field goal in a fourth down situation misses the kick. Coded as 1 for a missed field goal, 0 for a made field goal (or not applicable)
field_goal_good Whether a team that kicks a field goal in a fourthdown situation makes the kick. Coded as 1 for a made field goal, 0 for a missed field goal (or not applicable)

Exploration of Variable Distributions

The following section contains analyses of variable distributions within our dataset:

Categorical Variables

Play Type & Game Half

Distribution Table:
Play Type
Var1 Freq
field_goal 10630
pass 2923
punt 28296
run 2056
Game Half
Var1 Freq
Half1 23111
Half2 20794
Box and bar plots:

The majority of our data comes from punts and field goals. We have less data for runs and passes which may skew the results of any future models. We also see that our data is relatively evenly spread across the two halves. Outliers as shown by the boxplot could have a disproportionate impact on the model.

Numeric Variables

Yardline_100 summary and faceted by play type
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   31.00   55.00   51.08   72.00   99.00

As one would expect, most of the punt decisions come further away from an opponent’s endzone (far left), while the spread for the other decisions are found closer to the opponent’s endzone. As shown by the summary, we have a relatively normal distribution overall for the variable.

Yards to Go Summary & Distribution
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   7.000   7.782  11.000  48.000

The distribution of our yards to go data shows that for 75% of our plays, teams were within 11 yards of a 1st down. This data is positively skewed, meaning that the majority of our data comes from when teams were close to reaching a first down/touchdown.

Score Differential Summary & Distribution
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -59.000  -7.000   0.000  -1.076   4.000  59.000

Looking at the distribution of score differential, it is evident that the majority of the plays occurred when teams were within 7 points of one another (7 points down OR 7 points up).

Game Seconds Remaining Summary and Distribution
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     121     931    1834    1803    2659    3588

Looking at the summary and histogram of the “game_seconds_remaining” variable, there seems to be a relatively even distribution for the observations.

WPA Summary & Distribution
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -49.9524  -1.4213   0.0357   0.2273   2.1444  47.5490

Exploration of Variable Correlations

20 random observations are selected for the scatter plot and when compared to the ellipse matrix, the significant correlations are still visible

yardline_100 game_seconds_remaining ydstogo score_differential ep epa wpa fourth_down_converted fourth_down_failed punt field_goal_missed field_goal_good
yardline_100 1.0000 0.0593 0.2410 -0.0340 -0.9651 0.0142 0.0439 -0.1606 -0.1562 0.7971 -0.1885 -0.6751
game_seconds_remaining 0.0593 1.0000 -0.0206 -0.0075 -0.0819 0.0329 0.0453 -0.0869 -0.1461 0.1189 0.0023 -0.0098
ydstogo 0.2410 -0.0206 1.0000 -0.0537 -0.3027 0.0244 0.0847 -0.2218 -0.1190 0.2167 0.0152 -0.0670
score_differential -0.0340 -0.0075 -0.0537 1.0000 0.0367 -0.0020 -0.0241 -0.1114 -0.1653 0.0566 0.0277 0.0775
ep -0.9651 -0.0819 -0.3027 0.0367 1.0000 -0.0295 -0.0353 0.2050 0.1788 -0.8283 0.1933 0.6712
epa 0.0142 0.0329 0.0244 -0.0020 -0.0295 1.0000 0.7036 0.4601 -0.4569 -0.0288 -0.4253 0.2147
wpa 0.0439 0.0453 0.0847 -0.0241 -0.0353 0.7036 1.0000 0.3285 -0.2672 -0.0280 -0.3365 0.1446
fourth_down_converted -0.1606 -0.0869 -0.2218 -0.1114 0.2050 0.4601 0.3285 1.0000 -0.0600 -0.3397 -0.0488 -0.1285
fourth_down_failed -0.1562 -0.1461 -0.1190 -0.1653 0.1788 -0.4569 -0.2672 -0.0600 1.0000 -0.3198 -0.0460 -0.1211
punt 0.7971 0.1189 0.2167 0.0566 -0.8283 -0.0288 -0.0280 -0.3397 -0.3198 1.0000 -0.2606 -0.6858
field_goal_missed -0.1885 0.0023 0.0152 0.0277 0.1933 -0.4253 -0.3365 -0.0488 -0.0460 -0.2606 1.0000 -0.0986
field_goal_good -0.6751 -0.0098 -0.0670 0.0775 0.6712 0.2147 0.1446 -0.1285 -0.1211 -0.6858 -0.0986 1.0000

Punting and field position seem to be strongly positively correlated, as is WPA and EPA (which makes sense given that they are relatively similar measures). We can also see that field position and EP are strongly negatively correlated, as are punts and EP.

Linear Models

To better suit our research question we split our data set into three separate data frames and models, one for each potential “action” that a team could take on a fourth down play. For each action, we fit a linear model to attempt to determine what situations and actions dictate a fourth down play’s WPA (Win Probability Added) for the offensive team. As you will see below, those actions are as follows: punt, field goal, and going for it.

Punt Model

## 
## Call:
## lm(formula = wpa ~ yardline_100 + ydstogo + score_differential + 
##     game_seconds_remaining, data = punt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.591  -1.425  -0.042   1.439  36.097 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.731e+00  1.000e-01 -27.305   <2e-16 ***
## yardline_100            2.374e-02  1.355e-03  17.514   <2e-16 ***
## ydstogo                 7.234e-02  3.343e-03  21.636   <2e-16 ***
## score_differential      2.515e-03  1.872e-03   1.344    0.179    
## game_seconds_remaining  3.576e-04  1.974e-05  18.116   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.263 on 28291 degrees of freedom
## Multiple R-squared:  0.04014,    Adjusted R-squared:  0.04001 
## F-statistic: 295.8 on 4 and 28291 DF,  p-value: < 2.2e-16

Using the P Values, it appears all variables are significant other than “score_differential”. In this model, we will remove that variable.

## 
## Call:
## lm(formula = wpa ~ yardline_100 + ydstogo + game_seconds_remaining, 
##     data = punt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.569  -1.432  -0.044   1.441  36.090 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.716e+00  9.938e-02  -27.33   <2e-16 ***
## yardline_100            2.361e-02  1.352e-03   17.46   <2e-16 ***
## ydstogo                 7.195e-02  3.331e-03   21.60   <2e-16 ***
## game_seconds_remaining  3.552e-04  1.966e-05   18.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.263 on 28292 degrees of freedom
## Multiple R-squared:  0.04008,    Adjusted R-squared:  0.03998 
## F-statistic: 393.8 on 3 and 28292 DF,  p-value: < 2.2e-16

Removing score_differential from the model increased our F statistic (393.8) and increased the significance of the model as a whole. With an R-squared of 0.04, only 4% of the variance in WPA is accounted for within our “Punt” model. With the ydstogo coefficient as an example, for each ~ 13.9 yards-to-go increase there is an expected increase of 1 in WPA added for a punt.

Residual and Linear Assumption Testing

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 62.68649, Df = 1, p = 2.4237e-15

With a high P value, it is evident that this model presents a violation of the linear regression assumption regarding consistency of variance.

## [1]  1500 21843
## Error in shapiro.test(linear$residuals): sample size must be between 3 and 5000

We see numerous outliers within the qqplot, and our dataset is too large to perform the Shapiro-Wilk test. The maximum number of rows for this test is 5000.

Field Goal Model

## 
## Call:
## lm(formula = wpa ~ ydstogo + yardline_100 + epa + game_seconds_remaining, 
##     data = fd_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.347  -1.602  -0.025   1.532  46.709 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.513e+00  1.047e-01 -14.449   <2e-16 ***
## ydstogo                 1.414e-01  7.651e-03  18.479   <2e-16 ***
## yardline_100            3.815e-02  3.617e-03  10.549   <2e-16 ***
## epa                     2.265e+00  2.228e-02 101.676   <2e-16 ***
## game_seconds_remaining -8.486e-05  3.666e-05  -2.315   0.0207 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.981 on 7436 degrees of freedom
## Multiple R-squared:  0.5984, Adjusted R-squared:  0.5982 
## F-statistic:  2770 on 4 and 7436 DF,  p-value: < 2.2e-16
##   RMSE_model RMSE_test
## 1   2.979561  2.931066

All coefficients are appropriately positive or negative, and the RMSE’s for the testing and model track relatively well.

59.84% of the variance in wpa is explained by the predictor variables in this model. With football there are an infinite number of human and environmental factors that can change the outcome of a play, which leaves wide room for error. This makes it essentially impossible to perfectly predict WPA within a model context.

The F-statistic of 2770 is the largest of all the attempted models and shows evidence against the null hypothesis. The p values for ydstogo, yardline_100, and epa are all highly significant, the p-value for game_seconds_remaining, .025 is still below .05, so it is significant, but not as significant as the other variables. The model worked less well when we removed it, so it still provides valuable information. With field position as an example, an increase in about 29 yards away from the opponent’s endzone would have an expected increase of 1 WPA if the team decides to kick a field goal on fourth down.

Residual and Linear Assumption Testing
##                ydstogo           yardline_100                    epa 
##               1.113177               1.113888               1.001212 
## game_seconds_remaining 
##               1.000545

All vif and gvif values below 10, which showed no multicollinearity at each iteration

There are a few outliers here, and the relationship appears to be non-linear

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 444.1782, Df = 1, p = < 2.22e-16

Our current model here has a Chisquare of 414.5 with a very low p value - this was another signal that this data set might not be appropriate for regression analysis.

## [1]  159 4262

There are certainly some outliers outside the bounds of this plot, as shown by the black dots outside the blue lines.

## Error in shapiro.test(fd_fit$residuals): sample size must be between 3 and 5000

As shown by the error, the data set is too large for the Shapiro test, so we’re unable to test for normality in this manner.

After the first test (and several iterations) we decided to re-assess all variables including ep and epa. Note: after a few more iterations, we discovered that ep was a key predictor variable within the “field goal” model.

“Go For It” Model

## 
## Call:
## lm(formula = wpa ~ yardline_100 + ydstogo + score_differential + 
##     game_seconds_remaining, data = goforit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.165  -3.514  -0.101   4.312  44.257 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -0.1315508  0.2848813  -0.462    0.644    
## yardline_100            0.0240009  0.0049738   4.825 1.44e-06 ***
## ydstogo                -0.1109493  0.0262418  -4.228 2.40e-05 ***
## score_differential     -0.0028924  0.0084565  -0.342    0.732    
## game_seconds_remaining  0.0004442  0.0001094   4.062 4.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 4974 degrees of freedom
## Multiple R-squared:  0.01194,    Adjusted R-squared:  0.01115 
## F-statistic: 15.03 on 4 and 4974 DF,  p-value: 3.252e-12

Our linear model for “go for it” shows that the score_differential is not a statistically significant variable that increases a team’s WPA (the p-value has no significance). However, the coefficients for all of our predictor variables make logical sense– we would expect a negative relationship between number of yards a team has to go before a 1st down and an increase in WPA because teams would have further to go before having a chance of scoring.

Because of this, we are going to create a new linear model without the score_differential variable.

## 
## Call:
## lm(formula = wpa ~ yardline_100 + ydstogo + game_seconds_remaining, 
##     data = goforit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.184  -3.512  -0.086   4.313  44.234 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -0.1045708  0.2737182  -0.382    0.702    
## yardline_100            0.0240404  0.0049720   4.835 1.37e-06 ***
## ydstogo                -0.1096648  0.0259694  -4.223 2.46e-05 ***
## game_seconds_remaining  0.0004347  0.0001058   4.110 4.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 4975 degrees of freedom
## Multiple R-squared:  0.01192,    Adjusted R-squared:  0.01132 
## F-statistic:    20 on 3 and 4975 DF,  p-value: 6.973e-13

Our new linear model made all of our predictor variables statistically significant, but our adjusted R^2 is very small (0.01132), indicating that only 1.13% of the variation in WPA is caused by game seconds remaining, a team’s yard line position, and how many yards a team has to go before a 1st down. With yardline as an example, for each 41.6 field position increase, there is an expected increase of 1 WPA when going for it on fourth down.

Residual and Linear Assumption Testing

The above plot shows that the model that was created has a wide variance in residual values. The values fall within a wide range of residual values towards the middle of our residual line which is an indicator of a problem with our model.

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.5249524, Df = 1, p = 0.46874

The large p-value indicates that the ncvTest found that there is no constant variance for the residuals of this model (i.e., our residuals do not follow a pattern).

## [1]  278 3942

The problem with this model is also seen in the qqPlot. Whereas we would hope that our residuals would fall within the paramaters of the red lines, there appears to be a lot of outliers and other residual values that fall outside of the red bounds (i.e., the residuals from our model are all over the graph).

## 
##  Shapiro-Wilk normality test
## 
## data:  linear_gfi$residuals
## W = 0.97738, p-value < 2.2e-16

As shown by the miniscule p-value within the Shapiro-Wilk test, our model seems to violate the test for normality.

From the statistical tests run so far on the model created from the “goforit” data, it is evident that a linear model does not fit the data for “go for it” fourth down plays. Given that all three of our attempted linear models violate assumptions, we thought to transform the “ydstogo” variable, as this variable is the only one that we found to be significantly skewed.

## 
## Call:
## lm(formula = wpa ~ yardline_100 + log(ydstogo) + score_differential + 
##     game_seconds_remaining, data = goforit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.368  -3.434  -0.060   4.246  44.443 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -0.0512686  0.2874628  -0.178 0.858457    
## yardline_100            0.0245044  0.0049742   4.926 8.65e-07 ***
## log(ydstogo)           -0.5723206  0.1219946  -4.691 2.79e-06 ***
## score_differential     -0.0051723  0.0085133  -0.608 0.543506    
## game_seconds_remaining  0.0004163  0.0001102   3.779 0.000159 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.163 on 4974 degrees of freedom
## Multiple R-squared:  0.01276,    Adjusted R-squared:  0.01196 
## F-statistic: 16.07 on 4 and 4974 DF,  p-value: 4.426e-13
## 
## Call:
## lm(formula = wpa ~ yardline_100 + log(ydstogo) + score_differential + 
##     game_seconds_remaining, data = punt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.739  -1.414  -0.059   1.437  36.198 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -3.272e+00  1.048e-01 -31.234   <2e-16 ***
## yardline_100            2.403e-02  1.346e-03  17.845   <2e-16 ***
## log(ydstogo)            6.099e-01  2.387e-02  25.549   <2e-16 ***
## score_differential      2.848e-03  1.865e-03   1.527    0.127    
## game_seconds_remaining  3.580e-04  1.966e-05  18.204   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.253 on 28291 degrees of freedom
## Multiple R-squared:  0.04627,    Adjusted R-squared:  0.04613 
## F-statistic: 343.1 on 4 and 28291 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = wpa ~ yardline_100 + log(ydstogo) + epa + game_seconds_remaining, 
##     data = fd_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.478  -1.612  -0.047   1.523  46.610 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.222e+00  1.170e-01  -18.99  < 2e-16 ***
## yardline_100            3.672e-02  3.571e-03   10.28  < 2e-16 ***
## log(ydstogo)            1.022e+00  4.833e-02   21.14  < 2e-16 ***
## epa                     2.267e+00  2.213e-02  102.42  < 2e-16 ***
## game_seconds_remaining -9.391e-05  3.641e-05   -2.58  0.00991 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.961 on 7436 degrees of freedom
## Multiple R-squared:  0.6038, Adjusted R-squared:  0.6036 
## F-statistic:  2833 on 4 and 7436 DF,  p-value: < 2.2e-16
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 4.123907, Df = 1, p = 0.042281
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 49.02377, Df = 1, p = 2.5288e-12
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 490.8003, Df = 1, p = < 2.22e-16

As shown above, a logarithmic transformation of our “ydstogo” variable doesn’t have much of an impact (if any). In fact, it causes the “goforit” model to violate the non-constant variance assumption.

Due to all three of our “action” models still violating multiple linear assumptions, we find that none of them are necessarily equipped to predict WPA. This was somewhat surprising, especially since we had already split up the linear models into the different actions. A next step in research would be to apply a highly predictive model and combine it with situational action probabilities to create a “suggested” course of action for an offensive team for any fourth down scenario. In the long run, teams would potentially try to maximize WPA for each of their play calls, and this next step would go a long way towards satisfying that. However, given that our linear models aren’t a good fit, we thought to incorporate our question (predicting WPA) into a Random Forest machine-learning framework:

Random Forest Model

Feature selection

##                         meanImp  decision
## yardline_100           33.33682 Confirmed
## ydstogo                29.13630 Confirmed
## score_differential     66.52185 Confirmed
## game_seconds_remaining 30.69461 Confirmed
## result                 82.84026 Confirmed

Each independent variable is confirmed as significant in our modeling of WPA.

Model

Validation

## [1] 2.528733

Our RMSE for the train model is 2.52, which means that for a given prediction of WPA, there is a typical variance (or error) of 2.52 WPA either way. We are very happy with these results, as this works well with the spread in WPA in our data.

Application to all data

Final Validation

## [1] 2.55113

Our RMSE for the final model is 2.55, which means that for a given prediction of WPA, there is a typical variance (or error) of 2.55 WPA either way. As this tracks well with the train model, we are very happy with these results, as this also works well with the spread in WPA in our data.

As mentioned before, there are a lot of factors unaccounted for in our models. There are environmental factors, personnel factors, and in-game offensive and defensive strategy factors to consider when trying to determine the best course of action on a given fourth down play. However, our research does a great job of opening a new door into the analytics-based decision making that go into each play call in the football world. As mentioned previously, next steps could be to create a model that “suggests” the action with the highest predicted WPA given a certain offensive fourth down situation.