The goal of this particular project was to evaluate the likelihood on any individual play of a conversion into the endzone (i.e. a touchdown or a two point conversion). We elected to examine all individual yardlines between the 20 and the 1, inclusive, for the offense and generate an expected success rate. The data set was pulled from advanced football analytics.com (archive.advancedfootballanalytics.com/2010/04/play-by-play-data.html). Some general scrubbing took place of the data in excel to get it into a workable format. Since I am planning to do some graphical analysis, I have loaded the ggplot2 library.
library(ggplot2)
mydata <- read.csv("c:/users/daniel/desktop/2012_nfl_pbp_data.1.csv")
afc <- subset(mydata, conference == "afc")
nfc <- subset(mydata, conference == "nfc")
ari <- subset(nfc, off == "ARI")
sea <- subset(nfc, off == "SEA")
stl <- subset(nfc, off == "STL")
sf <- subset(nfc, off == "SF")
atl <- subset(nfc, off == "ATL")
no <- subset(nfc, off == "NO")
tb <- subset(nfc, off == "TB")
car <- subset(nfc, off == "CAR")
gb <- subset(nfc, off == "GB")
chi <- subset(nfc, off == "CHI")
min <- subset(nfc, off == "MIN")
det <- subset(nfc, off == "DET")
phi <- subset(nfc, off == "PHI")
dal <- subset(nfc, off == "DAL")
was <- subset(nfc, off == "WAS")
nyg <- subset(nfc, off == "NYG")
nyj <- subset(afc, off == "NYJ")
ne <- subset(afc, off == "NE")
buf <- subset(afc, off == "BUF")
mia <- subset(afc, off == "MIA")
hou <- subset(afc, off == "HOU")
ind <- subset(afc, off == "IND")
jac <- subset(afc, off == "JAC")
ten <- subset(afc, off == "TEN")
cle <- subset(afc, off == "CLE")
bal <- subset(afc, off == "BAL")
cin <- subset(afc, off == "CIN")
pit <- subset(afc, off == "PIT")
den <- subset(afc, off == "DEN")
oak <- subset(afc, off == "OAK")
kc <- subset(afc, off == "KC")
sd <- subset(afc, off == "SD")
The goal of that first section was to subdivide the data into multiple data sets, both by conference and team. While no work was done on teams specifically in this project, it is helpful to have it subdivided already in the event that I need to do some calculations or different models.
afc.goal <- subset(afc, ydline<21 & (play_type == "1" | play_type == "2"))
afc.goal$conversion = factor(afc.goal$conversion)
afc.goal$ydline = factor(afc.goal$ydline)
afc.logit.conversion = glm(conversion ~ ydline + rating, data = afc.goal, family = "binomial")
summary(afc.logit.conversion)
##
## Call:
## glm(formula = conversion ~ ydline + rating, family = "binomial",
## data = afc.goal)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.27408 -0.56318 -0.29210 -0.00014 2.67666
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.288086 0.754713 -0.382 0.702672
## ydline2 -0.563587 0.321591 -1.752 0.079688 .
## ydline3 -1.187504 0.403881 -2.940 0.003280 **
## ydline4 -1.472035 0.410826 -3.583 0.000340 ***
## ydline5 -1.459842 0.393559 -3.709 0.000208 ***
## ydline6 -1.496497 0.410278 -3.648 0.000265 ***
## ydline7 -1.816123 0.444684 -4.084 4.43e-05 ***
## ydline8 -2.360079 0.553048 -4.267 1.98e-05 ***
## ydline9 -3.741476 1.027439 -3.642 0.000271 ***
## ydline10 -3.141381 0.743678 -4.224 2.40e-05 ***
## ydline11 -2.745314 0.544744 -5.040 4.66e-07 ***
## ydline12 -3.513069 0.738597 -4.756 1.97e-06 ***
## ydline13 -1.948538 0.441568 -4.413 1.02e-05 ***
## ydline14 -2.689461 0.545705 -4.928 8.29e-07 ***
## ydline15 -2.630885 0.546783 -4.812 1.50e-06 ***
## ydline16 -3.467460 0.739148 -4.691 2.72e-06 ***
## ydline17 -3.357467 0.740504 -4.534 5.79e-06 ***
## ydline18 -3.521817 0.738459 -4.769 1.85e-06 ***
## ydline19 -18.685440 887.410336 -0.021 0.983201
## ydline20 -18.691579 808.815356 -0.023 0.981563
## rating 0.004867 0.008609 0.565 0.571860
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 971.78 on 1112 degrees of freedom
## Residual deviance: 740.31 on 1092 degrees of freedom
## AIC: 782.31
##
## Number of Fisher Scoring iterations: 17
newdata <- with(afc.goal, data.frame(rating = mean(rating), ydline = factor(1:20)))
newdata$rankP = predict(afc.logit.conversion, newdata = newdata, type = "response")
newdata
## rating ydline rankP
## 1 84.29039 1 5.305024e-01
## 2 84.29039 2 3.914013e-01
## 3 84.29039 3 2.562897e-01
## 4 84.29039 4 2.058910e-01
## 5 84.29039 5 2.078917e-01
## 6 84.29039 6 2.019201e-01
## 7 84.29039 7 1.552555e-01
## 8 84.29039 8 9.639677e-02
## 9 84.29039 9 2.610148e-02
## 10 84.29039 10 4.656509e-02
## 11 84.29039 11 6.766313e-02
## 12 84.29039 12 3.258082e-02
## 13 84.29039 13 1.386704e-01
## 14 84.29039 14 7.127280e-02
## 15 84.29039 15 7.524885e-02
## 16 84.29039 16 3.404944e-02
## 17 84.29039 17 3.785849e-02
## 18 84.29039 18 3.230623e-02
## 19 84.29039 19 8.671028e-09
## 20 84.29039 20 8.617953e-09
This block created a goal line data set for the afc and then predicted touchdown probability at each yardline from the 1 to the 20. To get expected points, it is only necessary to multiply the percentage generated by 7 as extrapoints are largely automatic. Similarly, this number would be compared with an expected points of 3 as field goals that are snapped from the 20 in are also largely automatic. The quarterback rating used to predict success is based on a league average rating. Success rate by rating and yardline will be graphed later.
nfc.goal <- subset(nfc, ydline<21 & (play_type == "1" | play_type == "2"))
nfc.goal$conversion = factor(nfc.goal$conversion)
nfc.goal$ydline = factor(nfc.goal$ydline)
nfc.logit.conversion = glm(conversion ~ ydline + rating, data = nfc.goal, family = "binomial")
summary(nfc.logit.conversion)
##
## Call:
## glm(formula = conversion ~ ydline + rating, family = "binomial",
## data = nfc.goal)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6144 -0.5389 -0.3490 -0.1852 2.7511
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.298344 0.790837 -1.642 0.100645
## ydline2 -1.463068 0.351656 -4.161 3.18e-05 ***
## ydline3 -1.345209 0.353795 -3.802 0.000143 ***
## ydline4 -1.554167 0.405764 -3.830 0.000128 ***
## ydline5 -1.863095 0.361249 -5.157 2.50e-07 ***
## ydline6 -2.724171 0.477999 -5.699 1.20e-08 ***
## ydline7 -1.665375 0.390954 -4.260 2.05e-05 ***
## ydline8 -2.850902 0.563682 -5.058 4.24e-07 ***
## ydline9 -2.736659 0.477559 -5.731 1.00e-08 ***
## ydline10 -2.600545 0.452048 -5.753 8.78e-09 ***
## ydline11 -3.315719 0.633830 -5.231 1.68e-07 ***
## ydline12 -3.293192 0.630128 -5.226 1.73e-07 ***
## ydline13 -2.283332 0.434579 -5.254 1.49e-07 ***
## ydline14 -3.205377 0.557178 -5.753 8.77e-09 ***
## ydline15 -4.042986 0.748099 -5.404 6.50e-08 ***
## ydline16 -4.597774 1.028448 -4.471 7.80e-06 ***
## ydline17 -3.623489 0.624360 -5.804 6.49e-09 ***
## ydline18 -4.230102 0.743765 -5.687 1.29e-08 ***
## ydline19 -18.177357 540.910547 -0.034 0.973192
## ydline20 -3.398443 0.556518 -6.107 1.02e-09 ***
## rating 0.021094 0.008762 2.407 0.016064 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1054.03 on 1151 degrees of freedom
## Residual deviance: 810.01 on 1131 degrees of freedom
## AIC: 852.01
##
## Number of Fisher Scoring iterations: 16
newdata1 <- with(nfc.goal, data.frame(rating = mean(rating), ydline = factor(1:20)))
newdata1$rankP = predict(nfc.logit.conversion, newdata = newdata, type = "response")
newdata1
## rating ydline rankP
## 1 90.17917 1 6.176666e-01
## 2 90.17917 2 2.722153e-01
## 3 90.17917 3 2.961805e-01
## 4 90.17917 4 2.545461e-01
## 5 90.17917 5 2.004572e-01
## 6 90.17917 6 9.582364e-02
## 7 90.17917 7 2.340254e-01
## 8 90.17917 8 8.539175e-02
## 9 90.17917 9 9.474712e-02
## 10 90.17917 10 1.070831e-01
## 11 90.17917 11 5.540625e-02
## 12 90.17917 12 5.659706e-02
## 13 90.17917 13 1.414042e-01
## 14 90.17917 14 6.147257e-02
## 15 90.17917 15 2.756302e-02
## 16 90.17917 16 1.601448e-02
## 17 90.17917 17 4.133496e-02
## 18 90.17917 18 2.296735e-02
## 19 90.17917 19 2.060564e-08
## 20 90.17917 20 5.123264e-02
This block created a goal line data set for the nfc and then predicted touchdown probability at each yardline from the 1 to the 20. To get expected points, it is only necessary to multiply the percentage generated by 7 as extrapoints are largely automatic. Similarly, this number would be compared with an expected points of 3 as field goals that are snapped from the 20 in are also largely automatic. The quarterback rating used to predict success is based on a league average rating. Success rate by rating and yardline will be graphed later.
goal <- subset(mydata, ydline<21 & (play_type == "1" | play_type == "2"))
goal$conversion = factor(goal$conversion)
goal$ydline = factor(goal$ydline)
logit.conversion = glm(conversion ~ ydline + rating, data = goal, family = "binomial")
summary(logit.conversion)
##
## Call:
## glm(formula = conversion ~ ydline + rating, family = "binomial",
## data = goal)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4476 -0.5528 -0.3366 -0.2245 2.7479
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.879433 0.523955 -1.678 0.0933 .
## ydline2 -0.975063 0.235758 -4.136 3.54e-05 ***
## ydline3 -1.212412 0.261923 -4.629 3.68e-06 ***
## ydline4 -1.470259 0.286369 -5.134 2.83e-07 ***
## ydline5 -1.628700 0.263424 -6.183 6.30e-10 ***
## ydline6 -2.052082 0.308258 -6.657 2.79e-11 ***
## ydline7 -1.668872 0.288057 -5.794 6.89e-09 ***
## ydline8 -2.577247 0.394113 -6.539 6.18e-11 ***
## ydline9 -2.863259 0.413862 -6.918 4.57e-12 ***
## ydline10 -2.652898 0.372740 -7.117 1.10e-12 ***
## ydline11 -2.971288 0.412374 -7.205 5.79e-13 ***
## ydline12 -3.356310 0.475785 -7.054 1.74e-12 ***
## ydline13 -2.063779 0.307676 -6.708 1.98e-11 ***
## ydline14 -2.906284 0.388833 -7.474 7.76e-14 ***
## ydline15 -3.239936 0.439197 -7.377 1.62e-13 ***
## ydline16 -3.934607 0.598826 -6.571 5.01e-11 ***
## ydline17 -3.437374 0.474740 -7.241 4.47e-13 ***
## ydline18 -3.847120 0.523256 -7.352 1.95e-13 ***
## ydline19 -17.890229 381.655922 -0.047 0.9626
## ydline20 -3.804100 0.524359 -7.255 4.02e-13 ***
## rating 0.013807 0.005884 2.347 0.0189 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2026.5 on 2264 degrees of freedom
## Residual deviance: 1571.2 on 2244 degrees of freedom
## AIC: 1613.2
##
## Number of Fisher Scoring iterations: 16
newdata2 <- with(goal, data.frame(rating = mean(rating), ydline = factor(1:20)))
newdata2$rankP = predict(logit.conversion, newdata = newdata, type = "response")
newdata2
## rating ydline rankP
## 1 87.28547 1 5.706224e-01
## 2 87.28547 2 3.338836e-01
## 3 87.28547 3 2.833264e-01
## 4 87.28547 4 2.339988e-01
## 5 87.28547 5 2.068024e-01
## 6 87.28547 6 1.458297e-01
## 7 87.28547 7 2.002903e-01
## 8 87.28547 8 9.171640e-02
## 9 87.28547 9 7.051086e-02
## 10 87.28547 10 8.560591e-02
## 11 87.28547 11 6.375095e-02
## 12 87.28547 12 4.428054e-02
## 13 87.28547 13 1.443787e-01
## 14 87.28547 14 6.774265e-02
## 15 87.28547 15 4.947510e-02
## 16 87.28547 16 2.532738e-02
## 17 87.28547 17 4.097389e-02
## 18 87.28547 18 2.757914e-02
## 19 87.28547 19 2.258821e-08
## 20 87.28547 20 2.875661e-02
This block created a goal line data set for the nfl and then predicted touchdown probability at each yardline from the 1 to the 20. To get expected points, it is only necessary to multiply the percentage generated by 7 as extrapoints are largely automatic. Similarly, this number would be compared with an expected points of 3 as field goals that are snapped from the 20 in are also largely automatic. The quarterback rating used to predict success is based on a league average rating. Success rate by rating and yardline will be graphed later.
newdata21 = with(goal, data.frame(rating = rep(seq(from = 63, to = 108, length.out = 32), 20), ydline = factor(rep(1:20, each = 32))))
newdata211 = cbind(newdata21, predict(logit.conversion, newdata = newdata21, type = "link", se =TRUE))
newdata211 = within(newdata211, {PredictedProb = plogis(fit)
LL = plogis(fit -(1.96*se.fit))
UL = plogis(fit -(1.96*se.fit))})
ggplot(newdata211, aes(x = rating, y = PredictedProb)) + geom_ribbon(aes(ymin= LL, ymax = UL), alpha = 0.2) + geom_line(aes(colour = ydline), size = 1)
This block created a graph which can be used to evaluate success rates by quarterback rating at each yardline. Future plans include evaluating play call by quarterback rating and determining whether passing or running is more successful from any individual yardline. It is intuitive that the farther the offense is from the endzone, the more likely that a pass will work vs. a run. But it would be good to know exactly where the inflection point was.