The goal of this particular project was to evaluate the likelihood on any individual play of a conversion into the endzone (i.e. a touchdown or a two point conversion). We elected to examine all individual yardlines between the 20 and the 1, inclusive, for the offense and generate an expected success rate. The data set was pulled from advanced football analytics.com (archive.advancedfootballanalytics.com/2010/04/play-by-play-data.html). Some general scrubbing took place of the data in excel to get it into a workable format. Since I am planning to do some graphical analysis, I have loaded the ggplot2 library.

library(ggplot2)


mydata <- read.csv("c:/users/daniel/desktop/2012_nfl_pbp_data.1.csv")
afc <- subset(mydata, conference == "afc")
nfc <- subset(mydata, conference == "nfc")
ari <- subset(nfc, off == "ARI")
sea <- subset(nfc, off == "SEA")
stl <- subset(nfc, off == "STL")
sf <- subset(nfc, off == "SF")
atl <- subset(nfc, off == "ATL")
no <- subset(nfc, off == "NO")
tb <- subset(nfc, off == "TB")
car <- subset(nfc, off == "CAR")
gb <- subset(nfc, off == "GB")
chi <- subset(nfc, off == "CHI")
min <- subset(nfc, off == "MIN")
det <- subset(nfc, off == "DET")
phi <- subset(nfc, off == "PHI")
dal <- subset(nfc, off == "DAL")
was <- subset(nfc, off == "WAS")
nyg <- subset(nfc, off == "NYG")
nyj <- subset(afc, off == "NYJ")
ne <- subset(afc, off == "NE")
buf <- subset(afc, off == "BUF")
mia <- subset(afc, off == "MIA")
hou <- subset(afc, off == "HOU")
ind <- subset(afc, off == "IND")
jac <- subset(afc, off == "JAC")
ten <- subset(afc, off == "TEN")
cle <- subset(afc, off == "CLE")
bal <- subset(afc, off == "BAL")
cin <- subset(afc, off == "CIN")
pit <- subset(afc, off == "PIT")
den <- subset(afc, off == "DEN")
oak <- subset(afc, off == "OAK")
kc <- subset(afc, off == "KC")
sd <- subset(afc, off == "SD")

The goal of that first section was to subdivide the data into multiple data sets, both by conference and team. While no work was done on teams specifically in this project, it is helpful to have it subdivided already in the event that I need to do some calculations or different models.

afc.goal <- subset(afc, ydline<21 & (play_type == "1" | play_type == "2"))

afc.goal$conversion = factor(afc.goal$conversion)
afc.goal$ydline = factor(afc.goal$ydline)
afc.logit.conversion = glm(conversion ~ ydline + rating, data = afc.goal, family = "binomial")
summary(afc.logit.conversion)
## 
## Call:
## glm(formula = conversion ~ ydline + rating, family = "binomial", 
##     data = afc.goal)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.27408  -0.56318  -0.29210  -0.00014   2.67666  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.288086   0.754713  -0.382 0.702672    
## ydline2      -0.563587   0.321591  -1.752 0.079688 .  
## ydline3      -1.187504   0.403881  -2.940 0.003280 ** 
## ydline4      -1.472035   0.410826  -3.583 0.000340 ***
## ydline5      -1.459842   0.393559  -3.709 0.000208 ***
## ydline6      -1.496497   0.410278  -3.648 0.000265 ***
## ydline7      -1.816123   0.444684  -4.084 4.43e-05 ***
## ydline8      -2.360079   0.553048  -4.267 1.98e-05 ***
## ydline9      -3.741476   1.027439  -3.642 0.000271 ***
## ydline10     -3.141381   0.743678  -4.224 2.40e-05 ***
## ydline11     -2.745314   0.544744  -5.040 4.66e-07 ***
## ydline12     -3.513069   0.738597  -4.756 1.97e-06 ***
## ydline13     -1.948538   0.441568  -4.413 1.02e-05 ***
## ydline14     -2.689461   0.545705  -4.928 8.29e-07 ***
## ydline15     -2.630885   0.546783  -4.812 1.50e-06 ***
## ydline16     -3.467460   0.739148  -4.691 2.72e-06 ***
## ydline17     -3.357467   0.740504  -4.534 5.79e-06 ***
## ydline18     -3.521817   0.738459  -4.769 1.85e-06 ***
## ydline19    -18.685440 887.410336  -0.021 0.983201    
## ydline20    -18.691579 808.815356  -0.023 0.981563    
## rating        0.004867   0.008609   0.565 0.571860    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 971.78  on 1112  degrees of freedom
## Residual deviance: 740.31  on 1092  degrees of freedom
## AIC: 782.31
## 
## Number of Fisher Scoring iterations: 17
newdata <- with(afc.goal, data.frame(rating = mean(rating), ydline = factor(1:20)))
newdata$rankP = predict(afc.logit.conversion, newdata = newdata, type = "response")
newdata
##      rating ydline        rankP
## 1  84.29039      1 5.305024e-01
## 2  84.29039      2 3.914013e-01
## 3  84.29039      3 2.562897e-01
## 4  84.29039      4 2.058910e-01
## 5  84.29039      5 2.078917e-01
## 6  84.29039      6 2.019201e-01
## 7  84.29039      7 1.552555e-01
## 8  84.29039      8 9.639677e-02
## 9  84.29039      9 2.610148e-02
## 10 84.29039     10 4.656509e-02
## 11 84.29039     11 6.766313e-02
## 12 84.29039     12 3.258082e-02
## 13 84.29039     13 1.386704e-01
## 14 84.29039     14 7.127280e-02
## 15 84.29039     15 7.524885e-02
## 16 84.29039     16 3.404944e-02
## 17 84.29039     17 3.785849e-02
## 18 84.29039     18 3.230623e-02
## 19 84.29039     19 8.671028e-09
## 20 84.29039     20 8.617953e-09

This block created a goal line data set for the afc and then predicted touchdown probability at each yardline from the 1 to the 20. To get expected points, it is only necessary to multiply the percentage generated by 7 as extrapoints are largely automatic. Similarly, this number would be compared with an expected points of 3 as field goals that are snapped from the 20 in are also largely automatic. The quarterback rating used to predict success is based on a league average rating. Success rate by rating and yardline will be graphed later.

nfc.goal <- subset(nfc, ydline<21 & (play_type == "1" | play_type == "2"))
nfc.goal$conversion = factor(nfc.goal$conversion)
nfc.goal$ydline = factor(nfc.goal$ydline)
nfc.logit.conversion = glm(conversion ~ ydline + rating, data = nfc.goal, family = "binomial")
summary(nfc.logit.conversion)
## 
## Call:
## glm(formula = conversion ~ ydline + rating, family = "binomial", 
##     data = nfc.goal)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6144  -0.5389  -0.3490  -0.1852   2.7511  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.298344   0.790837  -1.642 0.100645    
## ydline2      -1.463068   0.351656  -4.161 3.18e-05 ***
## ydline3      -1.345209   0.353795  -3.802 0.000143 ***
## ydline4      -1.554167   0.405764  -3.830 0.000128 ***
## ydline5      -1.863095   0.361249  -5.157 2.50e-07 ***
## ydline6      -2.724171   0.477999  -5.699 1.20e-08 ***
## ydline7      -1.665375   0.390954  -4.260 2.05e-05 ***
## ydline8      -2.850902   0.563682  -5.058 4.24e-07 ***
## ydline9      -2.736659   0.477559  -5.731 1.00e-08 ***
## ydline10     -2.600545   0.452048  -5.753 8.78e-09 ***
## ydline11     -3.315719   0.633830  -5.231 1.68e-07 ***
## ydline12     -3.293192   0.630128  -5.226 1.73e-07 ***
## ydline13     -2.283332   0.434579  -5.254 1.49e-07 ***
## ydline14     -3.205377   0.557178  -5.753 8.77e-09 ***
## ydline15     -4.042986   0.748099  -5.404 6.50e-08 ***
## ydline16     -4.597774   1.028448  -4.471 7.80e-06 ***
## ydline17     -3.623489   0.624360  -5.804 6.49e-09 ***
## ydline18     -4.230102   0.743765  -5.687 1.29e-08 ***
## ydline19    -18.177357 540.910547  -0.034 0.973192    
## ydline20     -3.398443   0.556518  -6.107 1.02e-09 ***
## rating        0.021094   0.008762   2.407 0.016064 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1054.03  on 1151  degrees of freedom
## Residual deviance:  810.01  on 1131  degrees of freedom
## AIC: 852.01
## 
## Number of Fisher Scoring iterations: 16
newdata1 <- with(nfc.goal, data.frame(rating = mean(rating), ydline = factor(1:20)))
newdata1$rankP = predict(nfc.logit.conversion, newdata = newdata, type = "response")
newdata1
##      rating ydline        rankP
## 1  90.17917      1 6.176666e-01
## 2  90.17917      2 2.722153e-01
## 3  90.17917      3 2.961805e-01
## 4  90.17917      4 2.545461e-01
## 5  90.17917      5 2.004572e-01
## 6  90.17917      6 9.582364e-02
## 7  90.17917      7 2.340254e-01
## 8  90.17917      8 8.539175e-02
## 9  90.17917      9 9.474712e-02
## 10 90.17917     10 1.070831e-01
## 11 90.17917     11 5.540625e-02
## 12 90.17917     12 5.659706e-02
## 13 90.17917     13 1.414042e-01
## 14 90.17917     14 6.147257e-02
## 15 90.17917     15 2.756302e-02
## 16 90.17917     16 1.601448e-02
## 17 90.17917     17 4.133496e-02
## 18 90.17917     18 2.296735e-02
## 19 90.17917     19 2.060564e-08
## 20 90.17917     20 5.123264e-02

This block created a goal line data set for the nfc and then predicted touchdown probability at each yardline from the 1 to the 20. To get expected points, it is only necessary to multiply the percentage generated by 7 as extrapoints are largely automatic. Similarly, this number would be compared with an expected points of 3 as field goals that are snapped from the 20 in are also largely automatic. The quarterback rating used to predict success is based on a league average rating. Success rate by rating and yardline will be graphed later.

goal <- subset(mydata, ydline<21 & (play_type == "1" | play_type == "2"))
goal$conversion = factor(goal$conversion)
goal$ydline = factor(goal$ydline)
logit.conversion = glm(conversion ~ ydline + rating, data = goal, family = "binomial")
summary(logit.conversion)
## 
## Call:
## glm(formula = conversion ~ ydline + rating, family = "binomial", 
##     data = goal)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4476  -0.5528  -0.3366  -0.2245   2.7479  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.879433   0.523955  -1.678   0.0933 .  
## ydline2      -0.975063   0.235758  -4.136 3.54e-05 ***
## ydline3      -1.212412   0.261923  -4.629 3.68e-06 ***
## ydline4      -1.470259   0.286369  -5.134 2.83e-07 ***
## ydline5      -1.628700   0.263424  -6.183 6.30e-10 ***
## ydline6      -2.052082   0.308258  -6.657 2.79e-11 ***
## ydline7      -1.668872   0.288057  -5.794 6.89e-09 ***
## ydline8      -2.577247   0.394113  -6.539 6.18e-11 ***
## ydline9      -2.863259   0.413862  -6.918 4.57e-12 ***
## ydline10     -2.652898   0.372740  -7.117 1.10e-12 ***
## ydline11     -2.971288   0.412374  -7.205 5.79e-13 ***
## ydline12     -3.356310   0.475785  -7.054 1.74e-12 ***
## ydline13     -2.063779   0.307676  -6.708 1.98e-11 ***
## ydline14     -2.906284   0.388833  -7.474 7.76e-14 ***
## ydline15     -3.239936   0.439197  -7.377 1.62e-13 ***
## ydline16     -3.934607   0.598826  -6.571 5.01e-11 ***
## ydline17     -3.437374   0.474740  -7.241 4.47e-13 ***
## ydline18     -3.847120   0.523256  -7.352 1.95e-13 ***
## ydline19    -17.890229 381.655922  -0.047   0.9626    
## ydline20     -3.804100   0.524359  -7.255 4.02e-13 ***
## rating        0.013807   0.005884   2.347   0.0189 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2026.5  on 2264  degrees of freedom
## Residual deviance: 1571.2  on 2244  degrees of freedom
## AIC: 1613.2
## 
## Number of Fisher Scoring iterations: 16
newdata2 <- with(goal, data.frame(rating = mean(rating), ydline = factor(1:20)))
newdata2$rankP = predict(logit.conversion, newdata = newdata, type = "response")
newdata2
##      rating ydline        rankP
## 1  87.28547      1 5.706224e-01
## 2  87.28547      2 3.338836e-01
## 3  87.28547      3 2.833264e-01
## 4  87.28547      4 2.339988e-01
## 5  87.28547      5 2.068024e-01
## 6  87.28547      6 1.458297e-01
## 7  87.28547      7 2.002903e-01
## 8  87.28547      8 9.171640e-02
## 9  87.28547      9 7.051086e-02
## 10 87.28547     10 8.560591e-02
## 11 87.28547     11 6.375095e-02
## 12 87.28547     12 4.428054e-02
## 13 87.28547     13 1.443787e-01
## 14 87.28547     14 6.774265e-02
## 15 87.28547     15 4.947510e-02
## 16 87.28547     16 2.532738e-02
## 17 87.28547     17 4.097389e-02
## 18 87.28547     18 2.757914e-02
## 19 87.28547     19 2.258821e-08
## 20 87.28547     20 2.875661e-02

This block created a goal line data set for the nfl and then predicted touchdown probability at each yardline from the 1 to the 20. To get expected points, it is only necessary to multiply the percentage generated by 7 as extrapoints are largely automatic. Similarly, this number would be compared with an expected points of 3 as field goals that are snapped from the 20 in are also largely automatic. The quarterback rating used to predict success is based on a league average rating. Success rate by rating and yardline will be graphed later.

newdata21 = with(goal, data.frame(rating = rep(seq(from = 63, to = 108, length.out = 32), 20), ydline = factor(rep(1:20, each = 32))))
newdata211 = cbind(newdata21, predict(logit.conversion, newdata = newdata21, type = "link", se =TRUE))
newdata211 = within(newdata211, {PredictedProb = plogis(fit)
                                  LL = plogis(fit -(1.96*se.fit))
                                  UL = plogis(fit -(1.96*se.fit))})
ggplot(newdata211, aes(x = rating, y = PredictedProb)) + geom_ribbon(aes(ymin= LL, ymax = UL), alpha = 0.2) + geom_line(aes(colour = ydline), size = 1)

This block created a graph which can be used to evaluate success rates by quarterback rating at each yardline. Future plans include evaluating play call by quarterback rating and determining whether passing or running is more successful from any individual yardline. It is intuitive that the farther the offense is from the endzone, the more likely that a pass will work vs. a run. But it would be good to know exactly where the inflection point was.