Hey there! Welcome to another R project of mine. This time, I’m going to be applying machine learning methods to a dataset of NFL field goals from 2000 - 2012. My goal is to predict whether a kick would be good or not based on factors like distance, windspeed, icing (or calling a timeout right before the kick), what week in the season it is, etc. Why is this useful? Well, as a playcaller who needs to decide whether to kick or punt or run a play, it would be strategic to know what people in similar situations have done and the results were.

I originally ran through this exercise in my Sports Analytics class. I want to expand the work to other models, however, so I have employed a logisitic regression and a Naive Bayes algorithm. If you read my previous project on OK Cupid, I wanted to use these same methods to predict matches on dating websites.

The process is as follows:

  1. Import, clean, and split the data
  2. Apply and evaluate two logisitic regression attempts
  3. Train a model based on the better logistic regression, apply that model to test data, & evaluate via confusion matrix
  4. Apply and train a Naive Bayes model and evaluate via confusion matrix

The main packages used for this project are tidyverse, caret, rsample, e1071, and C50.

And with that…

I couldn’t tell you what game this refers to, but who doesn’t love the stooges?

I couldn’t tell you what game this refers to, but who doesn’t love the stooges?

  1. Import, clean, and split the data

So, here is where I’m going to import and clean the data locally: kickdata is the main set we are working with.

After importing, I want to clean the data a bit. For this set I have columns I don’t really need, such as “blk” which indicates if the kick was blocked. This is a very rare event, so I’m not sure how useful it is for decision making.

I included a head() function that is supposed to show the first 5 rows of the dataset so you can see some of the fields.

The dataset has a lot of NA values that should be 0, such as windspeed in enclosed stadiums. Using “is.na(kickdata)<-0” we fix that.

Inspecting the data, specifically whether a kick was good or not, we find out that about 82% of all kicks are good. This gives us a benchmark for accuracy that we want to beat.

In terms of conducting machine learning, it is important to split the data into test and train sets (and sometimes validation sets). The process goes as follows: train the model on the training set, about 70% of the data you have, then test the model with the trained parameters on the test data, or data it hasn’t seen yet. We do this because a model needs to be general enough that it can respond well to new data. The predictive power fades away when a model is overfitted to the training data.

For us, we split the data into KTrain and KTest. It is important to notice that we stratify this data as well, based on whether the kick was good. On random chance that we have 20 failed kicks and we are training using 21 observations, this prevents us from accidently pulling all the failed kicks and having sampling bias.

kick<- read_excel("HW4_FieldGoalData2000-2013.xlsx", 
                            sheet = "ReducedData")

colnames(kick)[ apply(kick, 2, anyNA)]
## [1] "temp" "wspd" "surf" "pts"  "blk"  "brcv"
kickdata <-subset(kick, select = -c(brcv,blk,surf))

head(kickdata)
## # A tibble: 6 x 39
##     gid   pid  seas  week day   v     h     stad     temp humd   wspd wdir 
##   <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>   <dbl> <chr> <dbl> <chr>
## 1     1    17  2000     1 SUN   SF    ATL   Georgi~    79 "\\N"    NA "\\N"
## 2     1    34  2000     1 SUN   SF    ATL   Georgi~    79 "\\N"    NA "\\N"
## 3     1    52  2000     1 SUN   SF    ATL   Georgi~    79 "\\N"    NA "\\N"
## 4     1    64  2000     1 SUN   SF    ATL   Georgi~    79 "\\N"    NA "\\N"
## 5     1    95  2000     1 SUN   SF    ATL   Georgi~    79 "\\N"    NA "\\N"
## 6     2   241  2000     1 SUN   JAC   CLE   Clevel~    78 63        9 NE   
## # ... with 27 more variables: cond <chr>, ou <dbl>, sprv <dbl>, off <chr>,
## #   def <chr>, dseq <dbl>, len <dbl>, qtr <dbl>, min <dbl>, sec <dbl>,
## #   ptso <dbl>, ptsd <dbl>, timo <dbl>, timd <dbl>, dwn <dbl>, ytg <dbl>,
## #   yfog <dbl>, pts <dbl>, fgxp <chr>, fkicker <chr>, NumKicks <dbl>,
## #   dist <dbl>, good <chr>, ptsv <dbl>, ptsh <dbl>, iced <dbl>,
## #   detail <chr>
kickdata[is.na(kickdata)] <- 0

kickdata$good <- factor(kickdata$good)

summary(kickdata$good)
##     N     Y 
##  2397 10956
set.seed(123)

split<-rsample::initial_split(kickdata,prop = 7/10, strata = "good")
KTrain = training(split)
KTest  = testing(split)
  1. Apply and evaluate two logisitic regression attempts

Great, now to some actual fun stuff.

Logistic regression is useful for this problem since it is strong at predicting binomial outcomes (yes/no) when given numeric inputs. It still performs well with factors such as gender.

For the first try we use what we imagine would be important factors, such as whether the kicker was iced, the distance of the kick, the season(year) and the week. When we look at the summary, we see that the p-values (Pr(>|z|)) for iced and week aren’t great. We ideally want them to be lower than 0.05. This means we can replace these variables, as they don’t have huge impacts on the overall result. This comes as counterintuitive since icing the kicker is frequently an employed strategy.

The second try I swapped out those two variables for temperature and wind speed. The summary here reveals a much better fit for the variables.

The “estimate” column tells us the magnitude of impact that each variable supposedly has. What it translates to, roughly, is that for every 1 yard of distance the kick is increased, there is a -1.071e-01 chance decrease in making the kick. Don’t get hung up on this, however, until we know the model is actually useful.

kicklogit1 <- glm(good ~ iced + dist + seas + week,
               data = KTrain, family = "binomial")

summary(kicklogit1)
## 
## Call:
## glm(formula = good ~ iced + dist + seas + week, family = "binomial", 
##     data = KTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7449   0.2669   0.4229   0.6645   1.5744  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.061e+02  1.450e+01  -7.319 2.50e-13 ***
## iced        -1.606e-01  1.126e-01  -1.426  0.15396    
## dist        -1.006e-01  3.372e-03 -29.829  < 2e-16 ***
## seas         5.570e-02  7.234e-03   7.699 1.37e-14 ***
## week        -1.580e-02  5.574e-03  -2.835  0.00458 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8799.1  on 9347  degrees of freedom
## Residual deviance: 7651.1  on 9343  degrees of freedom
## AIC: 7661.1
## 
## Number of Fisher Scoring iterations: 5
betterlogit <- glm(good ~ temp + dist + seas + wspd,
                   data = KTrain, family = "binomial")
summary(betterlogit)
## 
## Call:
## glm(formula = good ~ temp + dist + seas + wspd, family = "binomial", 
##     data = KTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7987   0.2549   0.4189   0.6628   1.5418  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.060e+02  1.466e+01  -7.228 4.89e-13 ***
## temp         4.505e-03  1.190e-03   3.785 0.000154 ***
## dist        -1.030e-01  3.414e-03 -30.177  < 2e-16 ***
## seas         5.561e-02  7.312e-03   7.606 2.83e-14 ***
## wspd        -3.591e-02  4.712e-03  -7.620 2.54e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8799.1  on 9347  degrees of freedom
## Residual deviance: 7601.3  on 9343  degrees of freedom
## AIC: 7611.3
## 
## Number of Fisher Scoring iterations: 5
  1. Train a model based on the better logistic regression, apply that model to test data, & evaluate via confusion matrix

Next, we actually apply our logistic regression to the test data to see how we did. We look at accuracy based on a confusion matrix, to which we have….about an 82% rate of getting things right. We literally hit reality, which is a start, but we might as well just assume we will make every kick since we will have the same failure rate. If we had a higher accuracy, then we would circle back to the regression output to see which variables had the larger impacts. Strategically, we would prioritize these factors in decision making.

PredictKick <- predict(betterlogit, newdata = KTest,type="response")

PredictKick <- as.factor(ifelse(PredictKick>.45,"Y","N"))
summary(PredictKick)
##    N    Y 
##   57 3948
confusionMatrix(PredictKick,KTest$good,positive = "Y")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    N    Y
##          N   39   18
##          Y  680 3268
##                                           
##                Accuracy : 0.8257          
##                  95% CI : (0.8136, 0.8374)
##     No Information Rate : 0.8205          
##     P-Value [Acc > NIR] : 0.1997          
##                                           
##                   Kappa : 0.0762          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.99452         
##             Specificity : 0.05424         
##          Pos Pred Value : 0.82776         
##          Neg Pred Value : 0.68421         
##              Prevalence : 0.82047         
##          Detection Rate : 0.81598         
##    Detection Prevalence : 0.98577         
##       Balanced Accuracy : 0.52438         
##                                           
##        'Positive' Class : Y               
## 
  1. Apply and train a Naive Bayes model and evaluate via confusion matrix

Let’s see if we can do better. Another classifying algorithm is the Naive Bayes algorithm. This is commonly used as a good filter for spam in email systems. Naive Bayes is a relatively fast and simple yet powerful algorithm that is better explained through Googling.

Anyhow, we once again train our model on the training data. Then we test and evaluate on the testing data. Finally, we run a confusion matrix to a staggering 99% accuracy, which is bananas. While high accuracy seems good on the surface, it’s suspicious. Too high of an accuracy is a symptom of overfitting, so I wouldn’t be surprised if I messed this part up. At the same time, this method allows me to select a bunch of variables as opposed to the four or five I used in the logisitic regression. We have more data to use, so performance should be better. I would love if anyone could point out my issue, though.

If we find no issues, we could backtrack similarly to a logisitic regression to find which factors are the most important.

naive_kick <- naiveBayes(KTrain, KTrain$good, laplace = 1)

naive_kick_pred <- predict(naive_kick, KTest, type = "class")

summary(naive_kick_pred)
##    N    Y 
##  722 3283
confusionMatrix(naive_kick_pred,KTest$good,positive = "Y")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    N    Y
##          N  709   13
##          Y   10 3273
##                                           
##                Accuracy : 0.9943          
##                  95% CI : (0.9914, 0.9964)
##     No Information Rate : 0.8205          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9805          
##                                           
##  Mcnemar's Test P-Value : 0.6767          
##                                           
##             Sensitivity : 0.9960          
##             Specificity : 0.9861          
##          Pos Pred Value : 0.9970          
##          Neg Pred Value : 0.9820          
##              Prevalence : 0.8205          
##          Detection Rate : 0.8172          
##    Detection Prevalence : 0.8197          
##       Balanced Accuracy : 0.9911          
##                                           
##        'Positive' Class : Y               
## 

So, this was my attempt at machine learning. Thanks!