The NBA shot data taken from Kaggle, taken during the 2014-2015 NBA season. From this data I want to make a model to predict the shot result from the available predictor.
library(tidyverse)
library(dplyr)
library(janitor)
library(lubridate)
library(partykit)
library(rpart)
library(caret)
library(randomForest)
library(rpart.plot)
library(e1071)
library(pROC)
nbaa <- as.data.frame(read_csv("nbashot.csv"))
glimpse(nbaa)
## Observations: 128,069
## Variables: 21
## $ GAME_ID <dbl> 21400899, 21400899, 21400899, 2140089…
## $ MATCHUP <chr> "MAR 04, 2015 - CHA @ BKN", "MAR 04, …
## $ LOCATION <chr> "A", "A", "A", "A", "A", "A", "A", "A…
## $ W <chr> "W", "W", "W", "W", "W", "W", "W", "W…
## $ FINAL_MARGIN <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 1…
## $ SHOT_NUMBER <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4…
## $ PERIOD <dbl> 1, 1, 1, 2, 2, 2, 4, 4, 4, 2, 2, 4, 4…
## $ GAME_CLOCK <drtn> 01:09:00, 00:14:00, 00:00:00, 11:47:…
## $ SHOT_CLOCK <dbl> 10.8, 3.4, NA, 10.3, 10.9, 9.1, 14.5,…
## $ DRIBBLES <dbl> 2, 0, 3, 2, 2, 2, 11, 3, 0, 0, 8, 14,…
## $ TOUCH_TIME <dbl> 1.9, 0.8, 2.7, 1.9, 2.7, 4.4, 9.0, 2.…
## $ SHOT_DIST <dbl> 7.7, 28.2, 10.1, 17.2, 3.7, 18.4, 20.…
## $ PTS_TYPE <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 2…
## $ SHOT_RESULT <chr> "made", "missed", "missed", "missed",…
## $ CLOSEST_DEFENDER <chr> "Anderson, Alan", "Bogdanovic, Bojan"…
## $ CLOSEST_DEFENDER_PLAYER_ID <dbl> 101187, 202711, 202711, 203900, 20115…
## $ CLOSE_DEF_DIST <dbl> 1.3, 6.1, 0.9, 3.4, 1.1, 2.6, 6.1, 2.…
## $ FGM <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1…
## $ PTS <dbl> 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 2…
## $ player_name <chr> "brian roberts", "brian roberts", "br…
## $ player_id <dbl> 203148, 203148, 203148, 203148, 20314…
First, I standardize the name of the variables with clean_names function from Janitor library. Then I remove unnecessary variable such as game_id, matchup, closest defender name, and player name. I also remove the fgm and pts vars, since that two variables are directly correlated with shot_result as target variable. At last, I remove any NA on the data.
nbaa <- clean_names(nbaa)
nba <- nbaa %>%
select(-game_id,-matchup,-closest_defender,-player_name,-fgm,-pts) %>%
mutate(shot_result = factor(shot_result,levels = c("missed","made")),
game_clock = as.numeric(seconds(hms(game_clock))))
nba <- na.omit(nba)
glimpse(nba)
## Observations: 122,502
## Variables: 15
## $ location <chr> "A", "A", "A", "A", "A", "A", "A", "A…
## $ w <chr> "W", "W", "W", "W", "W", "W", "W", "W…
## $ final_margin <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 1, 1,…
## $ shot_number <dbl> 1, 2, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 1…
## $ period <dbl> 1, 1, 2, 2, 2, 4, 4, 4, 2, 2, 4, 4, 4…
## $ game_clock <dbl> 4140, 840, 42420, 38040, 29700, 36900…
## $ shot_clock <dbl> 10.8, 3.4, 10.3, 10.9, 9.1, 14.5, 3.4…
## $ dribbles <dbl> 2, 0, 2, 2, 2, 11, 3, 0, 0, 8, 14, 2,…
## $ touch_time <dbl> 1.9, 0.8, 1.9, 2.7, 4.4, 9.0, 2.5, 0.…
## $ shot_dist <dbl> 7.7, 28.2, 17.2, 3.7, 18.4, 20.7, 3.5…
## $ pts_type <dbl> 2, 3, 2, 2, 2, 2, 2, 3, 3, 3, 2, 2, 3…
## $ shot_result <fct> made, missed, missed, missed, missed,…
## $ closest_defender_player_id <dbl> 101187, 202711, 203900, 201152, 10111…
## $ close_def_dist <dbl> 1.3, 6.1, 3.4, 1.1, 2.6, 6.1, 2.1, 7.…
## $ player_id <dbl> 203148, 203148, 203148, 203148, 20314…
Splitting the train and test data with 80:20 proportion and assign to two different object for different modelling purposes
set.seed(212)
index1 <- sample(nrow(nba), nrow(nba)*0.8)
nbaTr1 <- nba[index1, ]
nbaTs1 <- nba[-index1, ]
nbaTr2 <- nba[index1, ]
nbaTs2 <- nba[-index1, ]
dim(nbaTr1)
## [1] 98001 15
dim(nbaTs1)
## [1] 24501 15
Random Forest model with 4 predictor and without K Fold Cross Validation
set.seed(212)
ctrl <- trainControl(method="none")
# DO NOT RUN
#rfMod <- train(shot_result ~ shot_dist + game_clock + closest_defender_player_id + close_def_dist, data=nbaTr1, method="rf", trControl = ctrl)
rfMod <- readRDS("rf1.rds")
rfProb1 <- predict(rfMod, nbaTr1[,-12],type = "prob")
rfProb2 <- predict(rfMod, nbaTs1[,-12],type = "prob")
rfPred1 <- factor((ifelse(rfProb1[,1] >0.5,"missed","made")),levels = c("missed","made"))
rfPred2 <- factor((ifelse(rfProb2[,1] >0.5,"missed","made")),levels = c("missed","made"))
rf1 <- confusionMatrix(rfPred1,nbaTr1$shot_result,positive = "made")
rf2 <- confusionMatrix(rfPred2,nbaTs1$shot_result,positive = "made")
The variable importance of this model
varImp(rfMod)
## rf variable importance
##
## Overall
## shot_dist 100.00
## game_clock 98.47
## closest_defender_player_id 71.70
## close_def_dist 0.00
ROC and AUC for Test Data
roc1 <- plot(roc(nbaTs1$shot_result,rfProb2[,2]))
## Setting levels: control = missed, case = made
## Setting direction: controls < cases
auc1 <- auc(roc1)
auc1
## Area under the curve: 0.6183
Above are the ROC Curve with Test Data and the AUC = 0.6183
Random Forest model with all predictor and with K Fold Cross Validation. 5 segments and 3 iterations.
set.seed(212)
ctrl2 <- trainControl(method="repeatedcv", number=5, repeats=3)
# DO NOT RUN
# rfMod2<- train(shot_result ~ ., data=nbaTr1, method="rf", trControl = ctrl2)
rfMod2 <- readRDS("rf2.rds")
rfProb3 <- predict(rfMod2, nbaTr1[,-12],type = "prob")
rfProb4 <- predict(rfMod2, nbaTs1[,-12],type = "prob")
rfPred3 <- factor((ifelse(rfProb3[,1] >0.5,"missed","made")),levels = c("missed","made"))
rfPred4 <- factor((ifelse(rfProb4[,1] >0.5,"missed","made")),levels = c("missed","made"))
rf3 <- confusionMatrix(rfPred3,nbaTr1$shot_result,positive = "made")
rf4 <- confusionMatrix(rfPred4,nbaTs1$shot_result,positive = "made")
The variable importance of this model
options(scipen = 999)
varImp(rfMod2)
## rf variable importance
##
## Overall
## shot_dist 100.000
## shot_clock 77.463
## game_clock 76.457
## close_def_dist 74.824
## closest_defender_player_id 73.747
## player_id 70.569
## touch_time 63.327
## final_margin 58.588
## shot_number 47.921
## dribbles 23.465
## period 20.709
## locationH 7.339
## pts_type 4.352
## wW 0.000
ROC and AUC for Test Data
roc2 <- plot(roc(nbaTs1$shot_result,rfProb4[,2]))
## Setting levels: control = missed, case = made
## Setting direction: controls < cases
auc2 <- auc(roc2)
auc2
## Area under the curve: 0.6354
Above are the ROC Curve with Test Data and the AUC = 0.6354
Decision Tree model with 6 maxdepth
dtMod1 <- rpart(shot_result ~ ., nbaTr2,method = "class",control = rpart.control(maxdepth = 6 ,cp = 0))
rpart.plot(dtMod1,type = 3,box.palette = "GnBu",branch.lty = 3,shadow.col = "gray",nn = TRUE,cex = 0.5)
dtProb1 <- predict(dtMod1,nbaTr2[,-12])
dtProb2 <- predict(dtMod1,nbaTs2[,-12])
dtPred1 <- factor((ifelse(dtProb1[,1] >0.5,"missed","made")),levels = c("missed","made"))
dtPred2 <- factor((ifelse(dtProb2[,1] >0.5,"missed","made")),levels = c("missed","made"))
dt1 <- confusionMatrix(dtPred1,nbaTr2[,12],positive = "made")
dt2 <- confusionMatrix(dtPred2,nbaTs2[,12],positive = "made")
ROC and AUC for Test Data
roc3 <- plot(roc(nbaTs2$shot_result,dtProb2[,2]))
## Setting levels: control = missed, case = made
## Setting direction: controls < cases
auc3 <- auc(roc3)
auc3
## Area under the curve: 0.6221
Above are the ROC Curve with Test Data and the AUC = 0.6221
Decision Tree model with 11 maxdepth
dtMod2 <- rpart(shot_result ~ ., nbaTr2,method = "class",control = rpart.control(maxdepth = 11 ,cp = 0))
rpart.plot(dtMod2,type = 3,box.palette = "GnBu",branch.lty = 3,shadow.col = "gray",nn = TRUE)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
dtProb3 <- predict(dtMod2,nbaTr2[,-12])
dtProb4 <- predict(dtMod2,nbaTs2[,-12])
dtPred3 <- factor((ifelse(dtProb3[,1] >0.5,"missed","made")),levels = c("missed","made"))
dtPred4 <- factor((ifelse(dtProb4[,1] >0.5,"missed","made")),levels = c("missed","made"))
dt3 <- confusionMatrix(dtPred3,nbaTr2[,12],positive = "made")
dt4 <- confusionMatrix(dtPred4,nbaTs2[,12],positive = "made")
ROC and AUC for Test Data
roc4 <- plot(roc(nbaTs2$shot_result,dtProb4[,2]))
## Setting levels: control = missed, case = made
## Setting direction: controls < cases
auc4 <- auc(roc4)
auc4
## Area under the curve: 0.6145
Above are the ROC Curve with Test Data and the AUC = 0.6145
Since there are only two categorical vars (location and W), I want to add some vars with binning 4 numeric vars:
1. Shot Distance <4.8
2. Closest Def Distance <2.8
3. Shot Clock <24
4. Game Clock < 41000
I use the variable importance in Random Forest model as a reference to decide which vars to bin, then use Decision Tree to decide the threshold for each vars.
After some preprocessing, I split the train and test data with 80:20 proportion
nbaBn <- nba %>%
mutate(sd = ifelse(shot_dist < 4.8,"Short","Long"),
cdd = ifelse(close_def_dist < 2.8, "Close", "Far"),
sc = ifelse(shot_clock < 24,"Short","Long"),
gc = ifelse(game_clock < 4100,"Short", "Long")) %>%
select(sd,cdd,sc,gc,shot_result,location,w)
set.seed(212)
index2 <- sample(nrow(nbaBn),nrow(nbaBn)*0.8)
nbTr <- nbaBn[index2,]
nbTs <- nbaBn[-index2,]
nbMod1 <- naiveBayes(shot_result ~ ., nbTr)
nbProb1 <- predict(nbMod1, nbTr,type = "raw")
nbProb2 <- predict(nbMod1, nbTs,type = "raw")
nbPred1 <- factor((ifelse(nbProb1[,1] >0.5,"missed","made")),levels = c("missed","made"))
nbPred2 <- factor((ifelse(nbProb2[,1] >0.5,"missed","made")),levels = c("missed","made"))
nb1 <- confusionMatrix(nbPred1,nbTr$shot_result,positive = "made")
nb2 <- confusionMatrix(nbPred2,nbTs$shot_result,positive = "made")
ROC and AUC for Test Data
roc5 <- plot(roc(nbTs$shot_result,nbProb2[,2]))
## Setting levels: control = missed, case = made
## Setting direction: controls < cases
auc5 <- auc(roc5)
auc5
## Area under the curve: 0.5985
Above are the ROC Curve with Test Data and the AUC = 0.5985
Below is data frame consist of parameter to evaluate between each model.
mod.name <- c("RandomForest","RandomForest-KFold","DecisionTree-6level","DecisionTree-11level","NaiveBayes")
AccuracyTrain <- c(rf1$overall[1],rf3$overall[1],dt1$overall[1],dt3$overall[1],nb1$overall[1])
AccuracyTest <- c(rf2$overall[1],rf4$overall[1],dt2$overall[1],dt4$overall[1],nb2$overall[1])
SensitivityTest <- c(rf2$byClass[1],rf4$byClass[1],dt2$byClass[1],dt4$byClass[1],nb2$byClass[1])
SpecificityTest <- c(rf2$byClass[2],rf4$byClass[2],dt2$byClass[2],dt4$byClass[2],nb2$byClass[2])
AucTest <- c(auc1,auc2,auc3,auc4,auc5)
eval <- data.frame(mod.name,AccuracyTrain,AccuracyTest,SensitivityTest,SpecificityTest,AucTest)
eval[,-1] <- round(eval[,-1],3)
eval
## mod.name AccuracyTrain AccuracyTest SensitivityTest
## 1 RandomForest 0.987 0.602 0.383
## 2 RandomForest-KFold 0.997 0.617 0.356
## 3 DecisionTree-6level 0.617 0.618 0.306
## 4 DecisionTree-11level 0.646 0.606 0.357
## 5 NaiveBayes 0.607 0.608 0.348
## SpecificityTest AucTest
## 1 0.784 0.618
## 2 0.833 0.635
## 3 0.876 0.622
## 4 0.812 0.614
## 5 0.822 0.599
eval %>%
gather("parameter","value",-mod.name) %>%
mutate(parameter = factor(parameter,levels = c("AccuracyTrain","AccuracyTest","SensitivityTest","SpecificityTest","AucTest")),
mod.name = factor(mod.name,levels = c("NaiveBayes","DecisionTree-11level","DecisionTree-6level","RandomForest-KFold","RandomForest"))) %>%
ggplot(aes(x=mod.name,y=value,col=mod.name))+
geom_point()+
coord_flip()+
facet_grid(.~parameter)+
theme_get()+
labs(x ="Model Name",y = "Value")+
theme(axis.text.x = element_text(size = 6.5),legend.position = "none")
From the table and plot above, we can see that the Accuracy on Test Data are nearly the same for every model(range only 1.6%). Other parameter like Sensitivity, Specificity and AUC also shows a same kind/direction of every model.
On this analysis, I prefer more Specificity than Sensitivity. Because I want the model perform more on predicting the True Positive than True Negative. So if we look at the ROC Curve, we will sacrifice much Sensitivity for some Specificity.
In both Random Forest model, we can see that the accuracy on the Train Data is 100 then contrastly change with the test data. This can indicate that the model is overfitting. But I’m not sure since one of them is only used 4 out of 15 available predictor.
My recommendations:
1. Since the Parameter Evaluation give us poor result, find another data or/and another predictor to predict the shot result of NBA
2. But if I must choose best model from the analysis, it will be Decision Tree with 6 level. Since this model gives us best Accuracy on Test Data, best Specificity, and good AUC.
Cheers.