DATS 6101: Assignment 4

Problem A:

Using the NFL FGA Data from 2008 in the Lecture 7 folder create a model that can “accurately” (just do your best) classify missed versus good (made) field goals using logistic regression. In doing so also answer the following questions:

Which variables contribute most to your model
What is your “hit rate”, using 50% as the break point?
What is your ROC/AUC?
Holding all other variables in your model constant (or at an average) what are the chances of making a 50 yard field goal?
What additional information might be helpful?

## 'data.frame':    1039 obs. of  23 variables:
##  $ GameDate: int  20081130 20081005 20081228 20081012 20080907 20081116 20081123 20081207 20081130 20090118 ...
##  $ AwayTeam: chr  "IND" "IND" "TEN" "BAL" ...
##  $ HomeTeam: chr  "CLE" "HOU" "IND" "IND" ...
##  $ qtr     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ min     : int  47 54 45 45 50 50 46 52 46 49 ...
##  $ sec     : int  2 47 20 42 56 43 45 34 12 46 ...
##  $ kickteam: chr  "IND" "IND" "IND" "IND" ...
##  $ def     : chr  "CLE" "HOU" "TEN" "BAL" ...
##  $ down    : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ togo    : int  11 3 3 1 21 7 5 7 7 9 ...
##  $ kicker  : int  15 15 15 15 15 15 15 18 18 29 ...
##  $ ydline  : int  12 28 10 19 21 22 5 8 20 27 ...
##  $ name    : chr  "A.Vinatieri" "A.Vinatieri" "A.Vinatieri" "A.Vinatieri" ...
##  $ distance: int  30 46 28 37 39 40 23 26 38 45 ...
##  $ homekick: int  0 0 1 1 1 1 0 0 0 0 ...
##  $ kickdiff: int  -3 0 7 14 0 -3 0 0 -3 -7 ...
##  $ timerem : int  2822 3287 2720 2742 3056 3043 2805 3154 2772 2986 ...
##  $ offscore: int  0 0 7 14 0 0 0 0 0 0 ...
##  $ defscore: int  3 0 0 0 0 3 0 0 3 7 ...
##  $ season  : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ GOOD    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Missed  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Blocked : int  0 0 0 0 0 0 0 0 0 0 ...

## Classes 'tbl_df', 'tbl' and 'data.frame':    1039 obs. of  9 variables:
##  $ qtr     : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ down    : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ togo    : int  11 3 3 1 21 7 5 7 7 9 ...
##  $ ydline  : int  12 28 10 19 21 22 5 8 20 27 ...
##  $ distance: int  30 46 28 37 39 40 23 26 38 45 ...
##  $ homekick: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 1 1 ...
##  $ kickdiff: int  -3 0 7 14 0 -3 0 0 -3 -7 ...
##  $ offscore: int  0 0 7 14 0 0 0 0 0 0 ...
##  $ GOOD    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Full Model

## 
## Call:
## glm(formula = GOOD ~ ., family = binomial(link = "logit"), data = nflDataFrame)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8657   0.1951   0.3343   0.5523   1.4834  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 12.46186    5.04754   2.469   0.0136 *
## qtr2        -0.26606    0.31755  -0.838   0.4021  
## qtr3        -0.60208    0.39584  -1.521   0.1283  
## qtr4        -0.35761    0.46291  -0.773   0.4398  
## qtr5         0.43947    1.18511   0.371   0.7108  
## down2       -0.90693    0.69873  -1.298   0.1943  
## down3       -0.33733    0.68786  -0.490   0.6238  
## down4        0.04738    0.48751   0.097   0.9226  
## togo         0.01079    0.02385   0.452   0.6510  
## ydline       0.31175    0.27489   1.134   0.2568  
## distance    -0.43379    0.27619  -1.571   0.1163  
## homekick1   -0.22789    0.20365  -1.119   0.2631  
## kickdiff    -0.01500    0.01604  -0.935   0.3499  
## offscore     0.02944    0.02207   1.334   0.1822  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 809.65  on 1036  degrees of freedom
## Residual deviance: 676.28  on 1023  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 704.28
## 
## Number of Fisher Scoring iterations: 6

If we try to apply backward elimination with a p-value of 0.05 like our textbook recommends it wouldn’t work with the above values.

Removal of some coefficients:

So what if I adjusted logistic model to just the following 3 coefficients:

## 
## Call:
## glm(formula = GOOD ~ distance + qtr + down, family = binomial(link = "logit"), 
##     data = nflDataFrame)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9278   0.1999   0.3437   0.5608   1.4498  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.95464    0.79120   8.790   <2e-16 ***
## distance    -0.11886    0.01259  -9.438   <2e-16 ***
## qtr2        -0.16386    0.29392  -0.558    0.577    
## qtr3        -0.32916    0.32269  -1.020    0.308    
## qtr4         0.06100    0.31978   0.191    0.849    
## qtr5         0.77574    1.11946   0.693    0.488    
## down2       -0.87429    0.69347  -1.261    0.207    
## down3       -0.49106    0.67782  -0.724    0.469    
## down4       -0.14141    0.47347  -0.299    0.765    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 809.65  on 1036  degrees of freedom
## Residual deviance: 681.37  on 1028  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 699.37
## 
## Number of Fisher Scoring iterations: 6

The AIC is only about 4 points lower; not much better; but I think distance will be a heavy factor in the model. What if I just do ydline?

## 
## Call:
## glm(formula = GOOD ~ ydline, family = binomial(link = "logit"), 
##     data = nflDataFrame)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9368   0.2077   0.3504   0.5810   1.2128  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.53574    0.32880  13.795   <2e-16 ***
## ydline      -0.11843    0.01231  -9.622   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 817.72  on 1038  degrees of freedom
## Residual deviance: 687.43  on 1037  degrees of freedom
## AIC: 691.43
## 
## Number of Fisher Scoring iterations: 6

## (Intercept)      ydline 
##   93.292507    0.888318

The AIC dropped to 691.43; that’s alittle better. What if I just do distance though?

## 
## Call:
## glm(formula = GOOD ~ distance, family = binomial(link = "logit"), 
##     data = nflDataFrame)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9526   0.2039   0.3478   0.5826   1.2309  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.76271    0.54443  12.422   <2e-16 ***
## distance    -0.12084    0.01229  -9.836   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 817.72  on 1038  degrees of freedom
## Residual deviance: 686.93  on 1037  degrees of freedom
## AIC: 690.93
## 
## Number of Fisher Scoring iterations: 6

## (Intercept)    distance 
## 864.9812621   0.8861796

That’s the best AIC model I think; 690.93. Distance is significant. But the factors of distance and ydline seem to be really close. What if we compare them?

anova(newModel2, newModel3)

## Analysis of Deviance Table
## 
## Model 1: GOOD ~ ydline
## Model 2: GOOD ~ distance
##   Resid. Df Resid. Dev Df Deviance
## 1      1037     687.43            
## 2      1037     686.93  0  0.50303

# newModel3 is the best to use

Based on the results above between the 2 models I would say: - Distance contributes most to the model

And if I could add another ydline would be the 2nd most contributing factor/significant.

# Checking goodness of fit for the model
hoslem.test(nflDataFrame$GOOD, fitted(newModel3))

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  nflDataFrame$GOOD, fitted(newModel3)
## X-squared = 1039, df = 8, p-value < 2.2e-16

# p-value is above 0.05 as expected, most likely due to other factors

What is your “hit rate”, using 50% as the break point?

model4hit <- mean(pred.model2!=as.numeric(nflDataFrame$GOOD))
percent(pnorm(model4hit))

## [1] "80.9%"

What is your ROC/AUC?

prob <- prediction(pred.model,nflDataFrame$GOOD)
prob.perform <- performance(prob, measure = "tpr", x.measure = "fpr")
AUC <- performance(prob, measure = "auc")
plot(prob.perform, main = "ROC of the Prediction Model")

AUC

## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.78245
## 
## 
## Slot "alpha.values":
## list()

So the AUC comes out to be around 0.78245, which is fairly close to 0.8. Not ideal, but close.

If we hold all other variables in your model constant (or at an average) what are the chances of making a 50 yard field goal?

goal50yds <- data.frame(GOOD=as.factor(1),distance=as.numeric(50))
lm1 <- lm(GOOD~distance, data = goal50yds)
pred50yds <- predict(lm1,goal50yds,type='response')
percent(pnorm(pred50yds))

## [1] "84.1%"

84.1% of making a 50 yard field goal if the model remains constant.

What additional information might be helpful?

I think one of the things that would be more helpful for this type of dataset would just be more data in general. Because there are so many different categories and contributing factors that could possibly affect a good goal such as different quarters and time, it may make the data more normalized, create a stronger model, and provide a more ideal ROC curve.
Another helpful peice of information would be a more specific question. For example, what are the chances of a field goal with less than a minute left on the clock in quarter 4? Again, because the data is time-scaled, I think specifiying time for a question would help narrow down for a more precise analysis.