Using the NFL FGA Data from 2008 in the Lecture 7 folder create a model that can “accurately” (just do your best) classify missed versus good (made) field goals using logistic regression. In doing so also answer the following questions:
## 'data.frame': 1039 obs. of 23 variables:
## $ GameDate: int 20081130 20081005 20081228 20081012 20080907 20081116 20081123 20081207 20081130 20090118 ...
## $ AwayTeam: chr "IND" "IND" "TEN" "BAL" ...
## $ HomeTeam: chr "CLE" "HOU" "IND" "IND" ...
## $ qtr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ min : int 47 54 45 45 50 50 46 52 46 49 ...
## $ sec : int 2 47 20 42 56 43 45 34 12 46 ...
## $ kickteam: chr "IND" "IND" "IND" "IND" ...
## $ def : chr "CLE" "HOU" "TEN" "BAL" ...
## $ down : int 4 4 4 4 4 4 4 4 4 4 ...
## $ togo : int 11 3 3 1 21 7 5 7 7 9 ...
## $ kicker : int 15 15 15 15 15 15 15 18 18 29 ...
## $ ydline : int 12 28 10 19 21 22 5 8 20 27 ...
## $ name : chr "A.Vinatieri" "A.Vinatieri" "A.Vinatieri" "A.Vinatieri" ...
## $ distance: int 30 46 28 37 39 40 23 26 38 45 ...
## $ homekick: int 0 0 1 1 1 1 0 0 0 0 ...
## $ kickdiff: int -3 0 7 14 0 -3 0 0 -3 -7 ...
## $ timerem : int 2822 3287 2720 2742 3056 3043 2805 3154 2772 2986 ...
## $ offscore: int 0 0 7 14 0 0 0 0 0 0 ...
## $ defscore: int 3 0 0 0 0 3 0 0 3 7 ...
## $ season : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
## $ GOOD : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Missed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Blocked : int 0 0 0 0 0 0 0 0 0 0 ...
## Classes 'tbl_df', 'tbl' and 'data.frame': 1039 obs. of 9 variables:
## $ qtr : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ down : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ togo : int 11 3 3 1 21 7 5 7 7 9 ...
## $ ydline : int 12 28 10 19 21 22 5 8 20 27 ...
## $ distance: int 30 46 28 37 39 40 23 26 38 45 ...
## $ homekick: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 1 1 ...
## $ kickdiff: int -3 0 7 14 0 -3 0 0 -3 -7 ...
## $ offscore: int 0 0 7 14 0 0 0 0 0 0 ...
## $ GOOD : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##
## Call:
## glm(formula = GOOD ~ ., family = binomial(link = "logit"), data = nflDataFrame)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8657 0.1951 0.3343 0.5523 1.4834
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.46186 5.04754 2.469 0.0136 *
## qtr2 -0.26606 0.31755 -0.838 0.4021
## qtr3 -0.60208 0.39584 -1.521 0.1283
## qtr4 -0.35761 0.46291 -0.773 0.4398
## qtr5 0.43947 1.18511 0.371 0.7108
## down2 -0.90693 0.69873 -1.298 0.1943
## down3 -0.33733 0.68786 -0.490 0.6238
## down4 0.04738 0.48751 0.097 0.9226
## togo 0.01079 0.02385 0.452 0.6510
## ydline 0.31175 0.27489 1.134 0.2568
## distance -0.43379 0.27619 -1.571 0.1163
## homekick1 -0.22789 0.20365 -1.119 0.2631
## kickdiff -0.01500 0.01604 -0.935 0.3499
## offscore 0.02944 0.02207 1.334 0.1822
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 809.65 on 1036 degrees of freedom
## Residual deviance: 676.28 on 1023 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 704.28
##
## Number of Fisher Scoring iterations: 6
If we try to apply backward elimination with a p-value of 0.05 like our textbook recommends it wouldn’t work with the above values.
So what if I adjusted logistic model to just the following 3 coefficients:
##
## Call:
## glm(formula = GOOD ~ distance + qtr + down, family = binomial(link = "logit"),
## data = nflDataFrame)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9278 0.1999 0.3437 0.5608 1.4498
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.95464 0.79120 8.790 <2e-16 ***
## distance -0.11886 0.01259 -9.438 <2e-16 ***
## qtr2 -0.16386 0.29392 -0.558 0.577
## qtr3 -0.32916 0.32269 -1.020 0.308
## qtr4 0.06100 0.31978 0.191 0.849
## qtr5 0.77574 1.11946 0.693 0.488
## down2 -0.87429 0.69347 -1.261 0.207
## down3 -0.49106 0.67782 -0.724 0.469
## down4 -0.14141 0.47347 -0.299 0.765
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 809.65 on 1036 degrees of freedom
## Residual deviance: 681.37 on 1028 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 699.37
##
## Number of Fisher Scoring iterations: 6
The AIC is only about 4 points lower; not much better; but I think distance will be a heavy factor in the model. What if I just do ydline?
##
## Call:
## glm(formula = GOOD ~ ydline, family = binomial(link = "logit"),
## data = nflDataFrame)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9368 0.2077 0.3504 0.5810 1.2128
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.53574 0.32880 13.795 <2e-16 ***
## ydline -0.11843 0.01231 -9.622 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 817.72 on 1038 degrees of freedom
## Residual deviance: 687.43 on 1037 degrees of freedom
## AIC: 691.43
##
## Number of Fisher Scoring iterations: 6
## (Intercept) ydline
## 93.292507 0.888318
The AIC dropped to 691.43; that’s alittle better. What if I just do distance though?
##
## Call:
## glm(formula = GOOD ~ distance, family = binomial(link = "logit"),
## data = nflDataFrame)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9526 0.2039 0.3478 0.5826 1.2309
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.76271 0.54443 12.422 <2e-16 ***
## distance -0.12084 0.01229 -9.836 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 817.72 on 1038 degrees of freedom
## Residual deviance: 686.93 on 1037 degrees of freedom
## AIC: 690.93
##
## Number of Fisher Scoring iterations: 6
## (Intercept) distance
## 864.9812621 0.8861796
That’s the best AIC model I think; 690.93. Distance is significant. But the factors of distance and ydline seem to be really close. What if we compare them?
anova(newModel2, newModel3)
## Analysis of Deviance Table
##
## Model 1: GOOD ~ ydline
## Model 2: GOOD ~ distance
## Resid. Df Resid. Dev Df Deviance
## 1 1037 687.43
## 2 1037 686.93 0 0.50303
# newModel3 is the best to use
Based on the results above between the 2 models I would say: - Distance contributes most to the model
And if I could add another ydline would be the 2nd most contributing factor/significant.
# Checking goodness of fit for the model
hoslem.test(nflDataFrame$GOOD, fitted(newModel3))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: nflDataFrame$GOOD, fitted(newModel3)
## X-squared = 1039, df = 8, p-value < 2.2e-16
# p-value is above 0.05 as expected, most likely due to other factors
model4hit <- mean(pred.model2!=as.numeric(nflDataFrame$GOOD))
percent(pnorm(model4hit))
## [1] "80.9%"
prob <- prediction(pred.model,nflDataFrame$GOOD)
prob.perform <- performance(prob, measure = "tpr", x.measure = "fpr")
AUC <- performance(prob, measure = "auc")
plot(prob.perform, main = "ROC of the Prediction Model")
AUC
## An object of class "performance"
## Slot "x.name":
## [1] "None"
##
## Slot "y.name":
## [1] "Area under the ROC curve"
##
## Slot "alpha.name":
## [1] "none"
##
## Slot "x.values":
## list()
##
## Slot "y.values":
## [[1]]
## [1] 0.78245
##
##
## Slot "alpha.values":
## list()
So the AUC comes out to be around 0.78245, which is fairly close to 0.8. Not ideal, but close.
goal50yds <- data.frame(GOOD=as.factor(1),distance=as.numeric(50))
lm1 <- lm(GOOD~distance, data = goal50yds)
pred50yds <- predict(lm1,goal50yds,type='response')
percent(pnorm(pred50yds))
## [1] "84.1%"
84.1% of making a 50 yard field goal if the model remains constant.