Data Dive Week 10 - Logistic Regression

Start by setting up the packages to manipulate data.

suppressPackageStartupMessages({
  library(tidyverse)
  library(rio)
  library(boot)
  library(broom)
  library(car)
  library(GGally)
  library(ggrepel)
  library(lindia)
  library(performance)
  source("aptheme.R") #Code that helps format graphs
  })

Import data and create additonal variables to model.

data <- import("plays.csv") %>%
  mutate(converted = yardsGained >= yardsToGo, 
         pass = !(is.na(passLength)),
         zone = pff_manZone == "Zone")

Potentially the most important thing in football is whether the offense converts the existing downs into a fresh set and can continue the drive. Here we seek to model the probability of this conversion happening and use a binary variable pass that is true if the offense passes the ball and is false if the offense runs the ball. We control for the down (since there is more pressure to convert on third down than on first down), the number of yards to get a fresh set of downs and finally whether or not it is a dropback pass.

model <- glm(converted ~ pass + down + yardsToGo + isDropback , data = data,  family = "binomial")
summary(model)
## 
## Call:
## glm(formula = converted ~ pass + down + yardsToGo + isDropback, 
##     family = "binomial", data = data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.105853   0.081965   1.291 0.196552    
## passTRUE        1.033678   0.089935  11.494  < 2e-16 ***
## down            0.095414   0.026271   3.632 0.000281 ***
## yardsToGo      -0.191887   0.006212 -30.889  < 2e-16 ***
## isDropbackTRUE -0.375009   0.093303  -4.019 5.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19482  on 16123  degrees of freedom
## Residual deviance: 17543  on 16119  degrees of freedom
## AIC: 17553
## 
## Number of Fisher Scoring iterations: 4
with(summary(model), 1 - deviance / null.deviance)
## [1] 0.09952782
vif(model)
##       pass       down  yardsToGo isDropback 
##   5.715213   1.414640   1.415619   5.871764

Looking here we can see that there is a multicolinearity issue with the binary pass variable and the binary dropback variable, so we will drop the drop back variable for the model and try a binary variable for whether the defense is running zone or man.

model <- glm(converted ~ pass + zone + down + yardsToGo , data = data,  family = "binomial")
summary(model)
## 
## Call:
## glm(formula = converted ~ pass + zone + down + yardsToGo, family = "binomial", 
##     data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.088948   0.085778   1.037  0.29975    
## passTRUE     0.696330   0.039226  17.752  < 2e-16 ***
## zoneTRUE     0.083884   0.040854   2.053  0.04005 *  
## down         0.077449   0.025964   2.983  0.00285 ** 
## yardsToGo   -0.196454   0.006304 -31.164  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19343  on 15931  degrees of freedom
## Residual deviance: 17467  on 15927  degrees of freedom
##   (192 observations deleted due to missingness)
## AIC: 17477
## 
## Number of Fisher Scoring iterations: 4
with(summary(model), 1 - deviance / null.deviance)
## [1] 0.09700333
vif(model)
##      pass      zone      down yardsToGo 
##  1.083830  1.089128  1.381107  1.448844

Looking here, the zone variable does not improve the adjusted R squared of the model (and actually reduces it), and while there is a reason to believe that the defensive scheme has an effect on the likelihood of converting a down. However, there is equally likelihood to believe that a team is going to call a run or a pass based on the coverage they see. Therefore we will remove the zone variable.

model <- glm(converted ~ pass  + down + yardsToGo , data = data,  family = "binomial")
summary(model)
## 
## Call:
## glm(formula = converted ~ pass + down + yardsToGo, family = "binomial", 
##     data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.119348   0.082060   1.454  0.14583    
## passTRUE     0.712664   0.039074  18.239  < 2e-16 ***
## down         0.074573   0.025747   2.896  0.00377 ** 
## yardsToGo   -0.195172   0.006179 -31.587  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19482  on 16123  degrees of freedom
## Residual deviance: 17560  on 16120  degrees of freedom
## AIC: 17568
## 
## Number of Fisher Scoring iterations: 4
with(summary(model), 1 - deviance / null.deviance)
## [1] 0.09865687
vif(model)
##      pass      down yardsToGo 
##  1.081001  1.363870  1.397190

Interpretation

Next up we can exponentiate the values to help interperate the model

exp(model$coefficients)
## (Intercept)    passTRUE        down   yardsToGo 
##   1.1267623   2.0394177   1.0774243   0.8226934

This tells us that the offense throwing a pass is the best predictor of a team converting the down, and there is a 100% increase in the odds of a fresh set of downs. This makes sense, considering pass plays generally pick up the most yards, so would have the best chance of converting the down. Not all plays can be a pass play, a team needs to establish the run to open up the pass game, however, all else be equal, a pass play is a better call in a must have third or fourth down.

Both the zone variable and down variable have the weakest effect on the odds of converting a down. The defense running zone increases the odds of a converted down by about 9%. It’s not a huge effect, however, a defensive coordinator should call man coverage if there is ever a must have stop on third or fourth down.

As the number of downs increase, the odds of a converted down increase by 9%. This is included in the model as a control variable, and follows conventional wisdom. Teams will take a shot down field on first or second down, but it becomes far more important to pick up the yardage needed on third and fourth down.

Finally, the for every additional yard needed, the odds of converting the down decreases by 18%. This also follows conventional wisdom that the more yards a team needs to cover decrease the likelihood they will convert the down.

exp(confint.default(model))
##                 2.5 %   97.5 %
## (Intercept) 0.9593625 1.323372
## passTRUE    1.8890625 2.201740
## down        1.0244038 1.133189
## yardsToGo   0.8127903 0.832717

Here we can see the confidence intervals for each of our explanatory variables. Here there is not really a good reason to select any value of alpha, however by selecting two tailed 95% confidence intervals, we can see which variables have the greatest range on the effect. While the passing variable had the largest effect on the odds of converting a down, it also has the widest confidence interval, so there may be some fuzz to that variable’s effect.