The NBA shot data taken from Kaggle, taken during the 2014-2015 NBA season. From this data I want to make a model to predict the shot result from the available predictor.

1 Libraries and data

library(tidyverse)
library(dplyr)
library(janitor)
library(lubridate)
library(partykit)
library(rpart)
library(caret)
library(randomForest)
library(rpart.plot)
library(e1071)
library(pROC)

nbaa <- as.data.frame(read_csv("nbashot.csv"))
glimpse(nbaa)
## Observations: 128,069
## Variables: 21
## $ GAME_ID                    <dbl> 21400899, 21400899, 21400899, 2140089…
## $ MATCHUP                    <chr> "MAR 04, 2015 - CHA @ BKN", "MAR 04, …
## $ LOCATION                   <chr> "A", "A", "A", "A", "A", "A", "A", "A…
## $ W                          <chr> "W", "W", "W", "W", "W", "W", "W", "W…
## $ FINAL_MARGIN               <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 1…
## $ SHOT_NUMBER                <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4…
## $ PERIOD                     <dbl> 1, 1, 1, 2, 2, 2, 4, 4, 4, 2, 2, 4, 4…
## $ GAME_CLOCK                 <drtn> 01:09:00, 00:14:00, 00:00:00, 11:47:…
## $ SHOT_CLOCK                 <dbl> 10.8, 3.4, NA, 10.3, 10.9, 9.1, 14.5,…
## $ DRIBBLES                   <dbl> 2, 0, 3, 2, 2, 2, 11, 3, 0, 0, 8, 14,…
## $ TOUCH_TIME                 <dbl> 1.9, 0.8, 2.7, 1.9, 2.7, 4.4, 9.0, 2.…
## $ SHOT_DIST                  <dbl> 7.7, 28.2, 10.1, 17.2, 3.7, 18.4, 20.…
## $ PTS_TYPE                   <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 2…
## $ SHOT_RESULT                <chr> "made", "missed", "missed", "missed",…
## $ CLOSEST_DEFENDER           <chr> "Anderson, Alan", "Bogdanovic, Bojan"…
## $ CLOSEST_DEFENDER_PLAYER_ID <dbl> 101187, 202711, 202711, 203900, 20115…
## $ CLOSE_DEF_DIST             <dbl> 1.3, 6.1, 0.9, 3.4, 1.1, 2.6, 6.1, 2.…
## $ FGM                        <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1…
## $ PTS                        <dbl> 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 2…
## $ player_name                <chr> "brian roberts", "brian roberts", "br…
## $ player_id                  <dbl> 203148, 203148, 203148, 203148, 20314…

2 Data Pre-Process

First, I standardize the name of the variables with clean_names function from Janitor library. Then I remove unnecessary variable such as game_id, matchup, closest defender name, and player name. I also remove the fgm and pts vars, since that two variables are directly correlated with shot_result as target variable. At last, I remove any NA on the data.

nbaa <- clean_names(nbaa)

nba <- nbaa %>% 
  select(-game_id,-matchup,-closest_defender,-player_name,-fgm,-pts) %>% 
  mutate(shot_result = factor(shot_result,levels = c("missed","made")),
         game_clock = as.numeric(seconds(hms(game_clock))))

nba <- na.omit(nba)

glimpse(nba)
## Observations: 122,502
## Variables: 15
## $ location                   <chr> "A", "A", "A", "A", "A", "A", "A", "A…
## $ w                          <chr> "W", "W", "W", "W", "W", "W", "W", "W…
## $ final_margin               <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 1, 1,…
## $ shot_number                <dbl> 1, 2, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 1…
## $ period                     <dbl> 1, 1, 2, 2, 2, 4, 4, 4, 2, 2, 4, 4, 4…
## $ game_clock                 <dbl> 4140, 840, 42420, 38040, 29700, 36900…
## $ shot_clock                 <dbl> 10.8, 3.4, 10.3, 10.9, 9.1, 14.5, 3.4…
## $ dribbles                   <dbl> 2, 0, 2, 2, 2, 11, 3, 0, 0, 8, 14, 2,…
## $ touch_time                 <dbl> 1.9, 0.8, 1.9, 2.7, 4.4, 9.0, 2.5, 0.…
## $ shot_dist                  <dbl> 7.7, 28.2, 17.2, 3.7, 18.4, 20.7, 3.5…
## $ pts_type                   <dbl> 2, 3, 2, 2, 2, 2, 2, 3, 3, 3, 2, 2, 3…
## $ shot_result                <fct> made, missed, missed, missed, missed,…
## $ closest_defender_player_id <dbl> 101187, 202711, 203900, 201152, 10111…
## $ close_def_dist             <dbl> 1.3, 6.1, 3.4, 1.1, 2.6, 6.1, 2.1, 7.…
## $ player_id                  <dbl> 203148, 203148, 203148, 203148, 20314…

3 Train Test Splitting

Splitting the train and test data with 80:20 proportion and assign to two different object for different modelling purposes

set.seed(212)
index1 <- sample(nrow(nba), nrow(nba)*0.8)

nbaTr1 <- nba[index1, ]
nbaTs1 <- nba[-index1, ]

nbaTr2 <- nba[index1, ]
nbaTs2 <- nba[-index1, ]

dim(nbaTr1)
## [1] 98001    15
dim(nbaTs1)
## [1] 24501    15

4 Models

4.1 Random Forest 1

Random Forest model with 4 predictor and without K Fold Cross Validation

set.seed(212)
ctrl <- trainControl(method="none")

# DO NOT RUN
#rfMod <- train(shot_result ~ shot_dist + game_clock + closest_defender_player_id + close_def_dist, data=nbaTr1, method="rf", trControl = ctrl)
rfMod <- readRDS("rf1.rds")

rfProb1 <- predict(rfMod, nbaTr1[,-12],type = "prob")
rfProb2 <- predict(rfMod, nbaTs1[,-12],type = "prob")

rfPred1 <- factor((ifelse(rfProb1[,1] >0.5,"missed","made")),levels = c("missed","made"))
rfPred2 <- factor((ifelse(rfProb2[,1] >0.5,"missed","made")),levels = c("missed","made"))

rf1 <- confusionMatrix(rfPred1,nbaTr1$shot_result,positive = "made")
rf2 <- confusionMatrix(rfPred2,nbaTs1$shot_result,positive = "made")

The variable importance of this model

varImp(rfMod)
## rf variable importance
## 
##                            Overall
## shot_dist                   100.00
## game_clock                   98.47
## closest_defender_player_id   71.70
## close_def_dist                0.00

ROC and AUC for Test Data

roc1 <- plot(roc(nbaTs1$shot_result,rfProb2[,2]))
## Setting levels: control = missed, case = made
## Setting direction: controls < cases

auc1 <- auc(roc1)
auc1
## Area under the curve: 0.6183

Above are the ROC Curve with Test Data and the AUC = 0.6183

4.2 Random Forest 2 - k fold

Random Forest model with all predictor and with K Fold Cross Validation. 5 segments and 3 iterations.

set.seed(212)
ctrl2 <- trainControl(method="repeatedcv", number=5, repeats=3) 

# DO NOT RUN
# rfMod2<- train(shot_result ~ ., data=nbaTr1, method="rf", trControl = ctrl2)
rfMod2 <- readRDS("rf2.rds")

rfProb3 <- predict(rfMod2, nbaTr1[,-12],type = "prob")
rfProb4 <- predict(rfMod2, nbaTs1[,-12],type = "prob")

rfPred3 <- factor((ifelse(rfProb3[,1] >0.5,"missed","made")),levels = c("missed","made"))
rfPred4 <- factor((ifelse(rfProb4[,1] >0.5,"missed","made")),levels = c("missed","made"))

rf3 <- confusionMatrix(rfPred3,nbaTr1$shot_result,positive = "made")
rf4 <- confusionMatrix(rfPred4,nbaTs1$shot_result,positive = "made")

The variable importance of this model

options(scipen = 999)
varImp(rfMod2)
## rf variable importance
## 
##                            Overall
## shot_dist                  100.000
## shot_clock                  77.463
## game_clock                  76.457
## close_def_dist              74.824
## closest_defender_player_id  73.747
## player_id                   70.569
## touch_time                  63.327
## final_margin                58.588
## shot_number                 47.921
## dribbles                    23.465
## period                      20.709
## locationH                    7.339
## pts_type                     4.352
## wW                           0.000

ROC and AUC for Test Data

roc2 <- plot(roc(nbaTs1$shot_result,rfProb4[,2]))
## Setting levels: control = missed, case = made
## Setting direction: controls < cases

auc2 <- auc(roc2)
auc2
## Area under the curve: 0.6354

Above are the ROC Curve with Test Data and the AUC = 0.6354

4.3 Decision Tree 1

Decision Tree model with 6 maxdepth

dtMod1 <- rpart(shot_result ~ ., nbaTr2,method = "class",control = rpart.control(maxdepth = 6 ,cp = 0))
rpart.plot(dtMod1,type = 3,box.palette = "GnBu",branch.lty = 3,shadow.col = "gray",nn = TRUE,cex = 0.5)

dtProb1 <- predict(dtMod1,nbaTr2[,-12])
dtProb2 <- predict(dtMod1,nbaTs2[,-12]) 

dtPred1 <- factor((ifelse(dtProb1[,1] >0.5,"missed","made")),levels = c("missed","made"))
dtPred2 <- factor((ifelse(dtProb2[,1] >0.5,"missed","made")),levels = c("missed","made"))

dt1 <- confusionMatrix(dtPred1,nbaTr2[,12],positive = "made")
dt2 <- confusionMatrix(dtPred2,nbaTs2[,12],positive = "made")

ROC and AUC for Test Data

roc3 <- plot(roc(nbaTs2$shot_result,dtProb2[,2]))
## Setting levels: control = missed, case = made
## Setting direction: controls < cases

auc3 <- auc(roc3)
auc3
## Area under the curve: 0.6221

Above are the ROC Curve with Test Data and the AUC = 0.6221

4.4 Decision Tree 2

Decision Tree model with 11 maxdepth

dtMod2 <- rpart(shot_result ~ ., nbaTr2,method = "class",control = rpart.control(maxdepth = 11 ,cp = 0))
rpart.plot(dtMod2,type = 3,box.palette = "GnBu",branch.lty = 3,shadow.col = "gray",nn = TRUE)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

dtProb3 <- predict(dtMod2,nbaTr2[,-12])
dtProb4 <- predict(dtMod2,nbaTs2[,-12])

dtPred3 <- factor((ifelse(dtProb3[,1] >0.5,"missed","made")),levels = c("missed","made"))
dtPred4 <- factor((ifelse(dtProb4[,1] >0.5,"missed","made")),levels = c("missed","made"))

dt3 <- confusionMatrix(dtPred3,nbaTr2[,12],positive = "made")
dt4 <- confusionMatrix(dtPred4,nbaTs2[,12],positive = "made")

ROC and AUC for Test Data

roc4 <- plot(roc(nbaTs2$shot_result,dtProb4[,2]))
## Setting levels: control = missed, case = made
## Setting direction: controls < cases

auc4 <- auc(roc4)
auc4
## Area under the curve: 0.6145

Above are the ROC Curve with Test Data and the AUC = 0.6145

4.5 Naive Bayes

4.5.1 Binning and Splitting

Since there are only two categorical vars (location and W), I want to add some vars with binning 4 numeric vars:
1. Shot Distance <4.8
2. Closest Def Distance <2.8
3. Shot Clock <24

4. Game Clock < 41000
I use the variable importance in Random Forest model as a reference to decide which vars to bin, then use Decision Tree to decide the threshold for each vars.

After some preprocessing, I split the train and test data with 80:20 proportion

nbaBn <- nba %>% 
  mutate(sd = ifelse(shot_dist < 4.8,"Short","Long"),
         cdd = ifelse(close_def_dist < 2.8, "Close", "Far"),
         sc = ifelse(shot_clock < 24,"Short","Long"),
         gc = ifelse(game_clock < 4100,"Short", "Long")) %>% 
  select(sd,cdd,sc,gc,shot_result,location,w)

set.seed(212)
index2 <- sample(nrow(nbaBn),nrow(nbaBn)*0.8)
nbTr <- nbaBn[index2,]
nbTs <- nbaBn[-index2,]

4.6 Modeling and Predicting

nbMod1 <- naiveBayes(shot_result ~ ., nbTr)

nbProb1 <- predict(nbMod1, nbTr,type = "raw")
nbProb2 <- predict(nbMod1, nbTs,type = "raw")

nbPred1 <- factor((ifelse(nbProb1[,1] >0.5,"missed","made")),levels = c("missed","made"))
nbPred2 <- factor((ifelse(nbProb2[,1] >0.5,"missed","made")),levels = c("missed","made"))

nb1 <- confusionMatrix(nbPred1,nbTr$shot_result,positive = "made")
nb2 <- confusionMatrix(nbPred2,nbTs$shot_result,positive = "made")

ROC and AUC for Test Data

roc5 <- plot(roc(nbTs$shot_result,nbProb2[,2]))
## Setting levels: control = missed, case = made
## Setting direction: controls < cases

auc5 <- auc(roc5)
auc5
## Area under the curve: 0.5985

Above are the ROC Curve with Test Data and the AUC = 0.5985

5 Evaluate

Below is data frame consist of parameter to evaluate between each model.

mod.name <- c("RandomForest","RandomForest-KFold","DecisionTree-6level","DecisionTree-11level","NaiveBayes")
AccuracyTrain <- c(rf1$overall[1],rf3$overall[1],dt1$overall[1],dt3$overall[1],nb1$overall[1])
AccuracyTest <- c(rf2$overall[1],rf4$overall[1],dt2$overall[1],dt4$overall[1],nb2$overall[1])
SensitivityTest <- c(rf2$byClass[1],rf4$byClass[1],dt2$byClass[1],dt4$byClass[1],nb2$byClass[1])
SpecificityTest <- c(rf2$byClass[2],rf4$byClass[2],dt2$byClass[2],dt4$byClass[2],nb2$byClass[2])
AucTest <- c(auc1,auc2,auc3,auc4,auc5)

eval <- data.frame(mod.name,AccuracyTrain,AccuracyTest,SensitivityTest,SpecificityTest,AucTest)
eval[,-1] <- round(eval[,-1],3)
eval
##               mod.name AccuracyTrain AccuracyTest SensitivityTest
## 1         RandomForest         0.987        0.602           0.383
## 2   RandomForest-KFold         0.997        0.617           0.356
## 3  DecisionTree-6level         0.617        0.618           0.306
## 4 DecisionTree-11level         0.646        0.606           0.357
## 5           NaiveBayes         0.607        0.608           0.348
##   SpecificityTest AucTest
## 1           0.784   0.618
## 2           0.833   0.635
## 3           0.876   0.622
## 4           0.812   0.614
## 5           0.822   0.599
eval %>% 
  gather("parameter","value",-mod.name) %>% 
  mutate(parameter = factor(parameter,levels = c("AccuracyTrain","AccuracyTest","SensitivityTest","SpecificityTest","AucTest")),
         mod.name = factor(mod.name,levels = c("NaiveBayes","DecisionTree-11level","DecisionTree-6level","RandomForest-KFold","RandomForest"))) %>% 
  ggplot(aes(x=mod.name,y=value,col=mod.name))+
  geom_point()+
  coord_flip()+
  facet_grid(.~parameter)+
  theme_get()+
  labs(x ="Model Name",y = "Value")+
  theme(axis.text.x = element_text(size = 6.5),legend.position = "none")

From the table and plot above, we can see that the Accuracy on Test Data are nearly the same for every model(range only 1.6%). Other parameter like Sensitivity, Specificity and AUC also shows a same kind/direction of every model.

On this analysis, I prefer more Specificity than Sensitivity. Because I want the model perform more on predicting the True Positive than True Negative. So if we look at the ROC Curve, we will sacrifice much Sensitivity for some Specificity.

In both Random Forest model, we can see that the accuracy on the Train Data is 100 then contrastly change with the test data. This can indicate that the model is overfitting. But I’m not sure since one of them is only used 4 out of 15 available predictor.

My recommendations:
1. Since the Parameter Evaluation give us poor result, find another data or/and another predictor to predict the shot result of NBA
2. But if I must choose best model from the analysis, it will be Decision Tree with 6 level. Since this model gives us best Accuracy on Test Data, best Specificity, and good AUC.

Cheers.