This data has been uploaded to Kaggle having been scraped from the NBA API. I will be doing exploratory data analysis for this dataset as well as comparing model performances to predict shot results. This goal of this notebook is to explore the data and pick up a few insights, I will not be attempting and advanced prediction techniques.
https://www.kaggle.com/dansbecker/nba-shot-logs
library(tidyverse)
library(data.table)
library(corrplot)
library(caret)
library(gbm)
library(e1071)
library(randomForest)
First, I’ll read in the data which is simply a .csv file.
df<- read.csv("shot_logs.csv")
glimpse(df)
Observations: 128,069
Variables: 21
$ GAME_ID <int> 21400899, 21400899, 21400899, 21400899, 21400899, 21400899, 21400899, 21400899, 21400899, 21400890, 2...
$ MATCHUP <fctr> MAR 04, 2015 - CHA @ BKN, MAR 04, 2015 - CHA @ BKN, MAR 04, 2015 - CHA @ BKN, MAR 04, 2015 - CHA @ B...
$ LOCATION <fctr> A, A, A, A, A, A, A, A, A, H, H, H, H, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, H, H, H...
$ W <fctr> W, W, W, W, W, W, W, W, W, W, W, W, W, W, L, L, L, L, L, W, W, W, W, W, W, W, W, L, L, L, L, L, L, L...
$ FINAL_MARGIN <int> 24, 24, 24, 24, 24, 24, 24, 24, 24, 1, 1, 1, 1, 15, -8, -8, -8, -8, -8, 12, 12, 12, 12, 12, 12, 12, 1...
$ SHOT_NUMBER <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 1, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 1, 2, 3,...
$ PERIOD <int> 1, 1, 1, 2, 2, 2, 4, 4, 4, 2, 2, 4, 4, 4, 1, 2, 2, 4, 4, 1, 1, 2, 2, 2, 2, 4, 4, 1, 2, 2, 4, 1, 2, 4,...
$ GAME_CLOCK <fctr> 1:09, 0:14, 0:00, 11:47, 10:34, 8:15, 10:15, 8:00, 5:14, 11:32, 6:30, 11:32, 8:55, 9:10, 0:48, 10:38...
$ SHOT_CLOCK <dbl> 10.8, 3.4, NA, 10.3, 10.9, 9.1, 14.5, 3.4, 12.4, 17.4, 16.0, 12.1, 4.3, 4.4, 6.8, 6.4, 17.6, 8.7, 20....
$ DRIBBLES <int> 2, 0, 3, 2, 2, 2, 11, 3, 0, 0, 8, 14, 2, 0, 0, 3, 6, 1, 0, 2, 5, 6, 0, 16, 0, 7, 1, 0, 7, 1, 2, 0, 1,...
$ TOUCH_TIME <dbl> 1.9, 0.8, 2.7, 1.9, 2.7, 4.4, 9.0, 2.5, 0.8, 1.1, 7.5, 11.9, 2.9, 0.8, 0.5, 2.7, 5.1, 0.9, 1.2, 2.2, ...
$ SHOT_DIST <dbl> 7.7, 28.2, 10.1, 17.2, 3.7, 18.4, 20.7, 3.5, 24.6, 22.4, 24.5, 14.6, 5.9, 26.4, 22.8, 24.7, 25.0, 25....
$ PTS_TYPE <int> 2, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 2, 2, 3, 2, 2, 3, 3, 2,...
$ SHOT_RESULT <fctr> made, missed, missed, missed, missed, missed, missed, made, missed, missed, missed, made, made, miss...
$ CLOSEST_DEFENDER <fctr> Anderson, Alan, Bogdanovic, Bojan, Bogdanovic, Bojan, Brown, Markel, Young, Thaddeus, Williams, Dero...
$ CLOSEST_DEFENDER_PLAYER_ID <int> 101187, 202711, 202711, 203900, 201152, 101114, 101127, 203486, 202721, 201961, 202391, 202391, 20194...
$ CLOSE_DEF_DIST <dbl> 1.3, 6.1, 0.9, 3.4, 1.1, 2.6, 6.1, 2.1, 7.3, 19.8, 4.7, 1.8, 5.4, 4.4, 5.3, 5.6, 5.4, 5.1, 11.1, 3.5,...
$ FGM <int> 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,...
$ PTS <int> 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 2, 0, 0, 3, 0, 0, 3, 0, 2, 2, 0, 3, 0, 3, 2, 0, 0, 0, 0, 0, 0, 2,...
$ player_name <fctr> brian roberts, brian roberts, brian roberts, brian roberts, brian roberts, brian roberts, brian robe...
$ player_id <int> 203148, 203148, 203148, 203148, 203148, 203148, 203148, 203148, 203148, 203148, 203148, 203148, 20314...
The first glance at the data looks good. No major issues with column names and the dataset is quite small in terms of observations and columns which will make it easy to work with.
Some quick data recoding to make the data frame easier to work with. Game clock is also showing as a factor, it should be converted to time and even better the seconds which remain.
df$WIN_LOSS <- df$W
df$PERIOD <- as.factor(df$PERIOD)
df$PTS_TYPE <- as.factor(df$PTS_TYPE)
df$FGM <- as.factor(df$FGM)
strp_GC <- strptime(df$GAME_CLOCK, '%M:%S')
df$GAME_CLOCK_SEC <- strp_GC$min *60 + strp_GC$sec
A quick check shows that there is only one variable with missing values, which represents the time on the shot clock when the field goal was attempted:
apply(apply(df,2,is.na),2,sum)
GAME_ID MATCHUP LOCATION W FINAL_MARGIN
0 0 0 0 0
SHOT_NUMBER PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES
0 0 0 5567 0
TOUCH_TIME SHOT_DIST PTS_TYPE SHOT_RESULT CLOSEST_DEFENDER
0 0 0 0 0
CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM PTS player_name
0 0 0 0 0
player_id WIN_LOSS GAME_CLOCK_SEC
0 0 0
The SHOT_CLOCK variable can be plotted for some more insights:
df %>%
ggplot(aes(x = SHOT_CLOCK)) +
geom_area(stat = "bin", fill='blue', binwidth = 1)
There is an odd spike at 24 seconds which is bizarre because that’s the maximum time the shot clock can have. Seems odd a shot would go up the instant a team gains possession.
df %>%
filter(SHOT_CLOCK > 23) %>%
group_by(player_name) %>%
summarise(count = n()) %>%
arrange(desc(count))
After looking at a few of these, a fan of the game would realize these are all big men playing close to the hoop. Domain knowledge can lead to the conclusion that these are tip-ins/ put-backs where a player near the hoop collects an offensive rebound after a teammate missed a shot. In this situation, the shot clock would reset on hitting the rim!
Again, domain knowledge can help us after an exploratory plot of the missing values as well. It seems the majority are due to the game clock being less than the shot clock, so the shot clock is turned off. Unfortunately, this does not account for all the missing, but should help. It can be handled later.
null_sc <- df %>%
filter(is.na(SHOT_CLOCK) == T) %>%
group_by(GAME_CLOCK_SEC) %>%
summarise(count = n()) %>%
arrange(desc(count))
ggplot(null_sc, aes( x = GAME_CLOCK_SEC, y = count )) +
geom_area(stat = 'identity',fill='blue') +
coord_cartesian(xlim = c(0,100)) +
labs(title = 'Distribution of NULL Shot Clock Values') +
theme_bw()
The field goal attempts can be analyzed for a sanity check as well. Breaking it into a two or three point attempts, the data can be compared with official published stats. From offline comparison, it seems we do not have a complete season of data.
df %>%
filter(PTS_TYPE == 2) %>%
group_by(player_name) %>%
summarize(attempts = n(), makes = sum(as.numeric(FGM)-1), fg_pct = makes/attempts) %>%
arrange(desc(attempts))
df %>%
filter(PTS_TYPE == 3) %>%
group_by(player_name) %>%
summarize(attempts = n(), makes = sum(as.numeric(FGM)-1), fg_pct = makes/attempts) %>%
arrange(desc(attempts))
Defender data is quite interesting, this can be used as an example of a nice graphic that demonstrates how defenders differ from the mean on impacting a shot. The focus will be on the defenders who had the most shots taken against them.
df %>%
group_by(CLOSEST_DEFENDER) %>%
summarize(attempts = n(), makes = sum(as.numeric(FGM)-1), fg_pct = makes/attempts) %>%
mutate(diff = fg_pct - mean(fg_pct)) %>%
mutate(Defense = ifelse(diff > 0, 'Below Average' , 'Above Average')) %>%
arrange(desc(diff)) %>%
filter(attempts > 600) %>%
ggplot(aes( x = factor(CLOSEST_DEFENDER, levels =CLOSEST_DEFENDER), y = diff)) +
geom_bar(stat = 'identity', aes(fill=Defense), width = .6) +
coord_flip() +
labs(x = 'Defender', y = 'Percentage Point Difference', title = 'Most Frequent Defender Performance') +
theme_bw()
The field goal percentage distribution has a noticably lower mean and larger variance for a three-point shot.
df %>%
group_by(player_name,PTS_TYPE) %>%
summarize(attempts = n(),
mean = mean(as.numeric(FGM)-1)) %>%
filter(attempts > 10) %>%
ggplot(aes( x = mean)) +
geom_density(fill = 'green') +
facet_wrap(~PTS_TYPE) +
theme_bw()
Insights can be seen a bit quicker by applying our plot parameters and using facet_wrap on similar variable types.
df %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density(fill= 'blue') +
theme_bw()
There is a lot of great information here to understand our variables. Digging into perhaps an uninteresting variable, player_id appears to be sequential with the older players in the league having smaller numbers. This distribution hints that the league is quite youth driven. These were the most senior players for this 2014 season.
df %>%
group_by(player_id,player_name) %>%
summarise(count = n()) %>%
arrange(player_id)
SHOT_DIST also has an interesting distribution which can be explained by the three point line.
df %>%
ggplot(aes(x=SHOT_DIST,fill=PTS_TYPE)) +
geom_density(stat='bin',alpha = .5) +
coord_cartesian(xlim=c(0,27))
TOUCH_TIME has negative values, its not clear why even after pulling out a few to look at.
df %>%
filter(TOUCH_TIME <0) %>%
select(TOUCH_TIME, GAME_CLOCK_SEC, SHOT_CLOCK, SHOT_DIST, DRIBBLES,player_name)
For now, TOUCH_TIME can be put into a new variable will we limit this low touch to zero. Perhaps the mean can be used, more research would need to be done to determine what is actually causing this data and if it should be used at all.
df$touch_time_limit <- ifelse(df$TOUCH_TIME < 0 , 0, df$TOUCH_TIME)
df %>%
ggplot(aes(x=touch_time_limit)) +
geom_density(fill='dark green') +
theme_bw()
A similar analysis can be done for the categorical variables, no major suprises in the data.
df %>%
keep(is.factor) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_bar(fill = 'blue') +
theme_bw()
attributes are not identical across measure variables; they will be dropped
The dataset is quite small and there are some obvious features to be made. The shot clock variable can alse be updated to be the game clock under 24 seconds, otherwise the mean will be used.
df <- df %>%
mutate(shot_clock_fix = ifelse(is.na(SHOT_CLOCK)== T & GAME_CLOCK_SEC <= 24, GAME_CLOCK_SEC,
ifelse(is.na(SHOT_CLOCK)== T & GAME_CLOCK_SEC > 24, mean(SHOT_CLOCK, na.rm = T),SHOT_CLOCK)))
The “tip-in” scenario from earlier can be created, as this seems a unique type of shot.
df$tip_in <- ifelse(df$shot_clock_fix > 23 & df$SHOT_DIST < 5,1,0)
The average fg% of the offensive player or the defensive players defensive fg% can also be used. This is done with hesitation as future predictions may not hold as well if behavior changes significantly at a player level.
df <- df %>%
group_by(CLOSEST_DEFENDER) %>%
summarize(def_fg_pct = mean(as.numeric(FGM)-1)) %>%
right_join(df) %>%
ungroup()
Joining, by = "CLOSEST_DEFENDER"
df <- df %>%
group_by(player_name,PTS_TYPE) %>%
summarize(off_fg_pct = mean(as.numeric(FGM)-1)) %>%
right_join(df) %>%
ungroup()
Joining, by = c("player_name", "PTS_TYPE")
A “clutch” factor can be created for high pressure moments.
df$clutch <- ifelse(as.integer(df$PERIOD) >= 4 & df$GAME_CLOCK_SEC < 300 & df$FINAL_MARGIN < 7,1,0)
Not an extreme amount of correlation between the numeric variable. We do see SHOT_DIST with strong negative correlation with the players FG%, typical three point shooters would be expected to have lower accuracy.
df <- ungroup(df)
df %>%
select(SHOT_NUMBER, FINAL_MARGIN, SHOT_CLOCK, DRIBBLES
,SHOT_DIST, CLOSE_DEF_DIST, GAME_CLOCK_SEC, touch_time_limit,
off_fg_pct, def_fg_pct) %>%
cor(use='complete',method='pearson') %>%
corrplot(type='lower', diag = F)
The data needs to be prepared if we are going to try multiple model attempts. First let’s split into a training and testing set, a 70/30 split will be used.
set.seed(57)
train_flag <- createDataPartition(df$FGM, p = .7, list = FALSE, times = 1)
df_train <- df[train_flag,]
df_test <- df[-train_flag,]
The dataset can also be cleaned up as some variables were duplicated/updated or found to be unreliable. MATCHUP will be dropped as well but with more time valuable information could be derived.
df_train<- subset(df_train, select = -c(player_name, CLOSEST_DEFENDER, GAME_ID,MATCHUP,GAME_CLOCK,
SHOT_CLOCK,TOUCH_TIME, CLOSEST_DEFENDER_PLAYER_ID,SHOT_RESULT,PTS,
player_id, W) )
df_test<- subset(df_test, select = -c(player_name, CLOSEST_DEFENDER, GAME_ID,MATCHUP,GAME_CLOCK,
SHOT_CLOCK,TOUCH_TIME, CLOSEST_DEFENDER_PLAYER_ID, SHOT_RESULT,PTS,
player_id, W))
df_test$made <- ifelse(df_test$FGM==1,'yes','no')
df_train$made <- ifelse(df_train$FGM==1,'yes','no')
df_test$made <- ifelse(df_test$FGM==1,'yes','no')
Parameters can be set for the caret package for tuning. A new response variable will be made as well as some model packages have issues with 0/1 responses. We will drop the old response.
fitControl <- trainControl( method = "repeatedcv",
number = 5,
repeats = 3,
classProbs = T,
summaryFunction = twoClassSummary)
df_train$made <- ifelse(df_train$FGM==1,'yes','no')
df_train<- df_train[,-11]
df_test$made <- ifelse(df_test$FGM==1,'yes','no')
df_test<- df_test[,-11]
Gradient Boosted Machine tuning with ROC metric.
gbm_fit2 <- train(made ~ ., data = df_train,
method = 'gbm',
trControl = fitControl,
verbose = F,
metric = "ROC")
Loading required package: plyr
-------------------------------------------------------------------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
-------------------------------------------------------------------------------------------------------------------------------------------
Attaching package: <U+393C><U+3E31>plyr<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:lubridate<U+393C><U+3E32>:
here
The following objects are masked from <U+393C><U+3E31>package:dplyr<U+393C><U+3E32>:
arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
The following object is masked from <U+393C><U+3E31>package:purrr<U+393C><U+3E32>:
compact
randomForest
rf_fit2 <- train(made ~ ., data = df_train,
method = 'rf',
trControl = fitControl,
verbose = F,
metric = "ROC")
Logistic Regression
lr_fit2 <- train(made ~ ., data = df_train,
method = 'glm',
trControl = fitControl,
metric = "ROC")
Neural Network
nn_fit2 <- train(made ~ ., data = df_train,
method = 'nnet',
trControl = fitControl,
verbose = F,
preProcess = c('center','scale'),
metric = "ROC")
Comparing perfomrance
fits <- resamples(list(nnet = nn_fit2 , gbm = gbm_fit2 ,
lr = lr_fit2, rf = rf_fit2))
summary(fits)
Call:
summary.resamples(object = fits)
Models: nnet, gbm, lr, rf
Number of resamples: 15
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
nnet 0.6419749 0.6457639 0.6490521 0.6493024 0.6526983 0.6550079 0
gbm 0.6460465 0.6528676 0.6539142 0.6535505 0.6558328 0.6592357 0
lr 0.6372656 0.6390820 0.6414169 0.6421979 0.6450529 0.6492579 0
rf 0.6357384 0.6362956 0.6373488 0.6386375 0.6401108 0.6449399 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
nnet 0.8134989 0.8227629 0.8281584 0.8279141 0.8331976 0.8410872 0
gbm 0.8117683 0.8136516 0.8175710 0.8190234 0.8237809 0.8288710 0
lr 0.7204520 0.7240660 0.7281889 0.7271370 0.7299196 0.7335844 0
rf 0.8412908 0.8478062 0.8501476 0.8508806 0.8542197 0.8586990 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
nnet 0.3528251 0.3633897 0.3701739 0.3706189 0.3749846 0.3906982 0
gbm 0.3743678 0.3786234 0.3832963 0.3858818 0.3922536 0.4005181 0
lr 0.4676206 0.4718436 0.4756383 0.4755021 0.4774886 0.4880967 0
rf 0.3189836 0.3354716 0.3384729 0.3379057 0.3426668 0.3457506 0
GBM seems to be the best model at a glance.
bwplot(fits,layout = c(3,1))
GBM seems to be the most stable performer, this is the tuning parameter breakdown.
gbm_fit2
Stochastic Gradient Boosting
89649 samples
16 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 71719, 71720, 71719, 71719, 71719, 71719, ...
Resampling results across tuning parameters:
interaction.depth n.trees ROC Sens Spec
1 50 0.6398782 0.8137365 0.3725185
1 100 0.6458394 0.8095626 0.3847798
1 150 0.6482706 0.7971088 0.4023535
2 50 0.6468817 0.8334928 0.3603312
2 100 0.6513804 0.8223150 0.3771236
2 150 0.6526405 0.8151549 0.3871975
3 50 0.6496963 0.8398996 0.3555697
3 100 0.6528825 0.8262106 0.3764740
3 150 0.6535505 0.8190234 0.3858818
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
ROC was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
And finally a breakdown of variable importance used in the best model.
varImp(gbm_fit2)
gbm variable importance
only 20 most important variables shown (out of 21)
Overall
SHOT_DIST 100.00000
CLOSE_DEF_DIST 41.91805
off_fg_pct 19.76467
touch_time_limit 16.03255
shot_clock_fix 15.34549
def_fg_pct 10.98181
FINAL_MARGIN 8.27148
DRIBBLES 1.32042
GAME_CLOCK_SEC 1.21124
SHOT_NUMBER 0.95576
WIN_LOSSW 0.51476
PERIOD5 0.19436
PERIOD3 0.16487
tip_in 0.11762
PTS_TYPE3 0.11262
clutch 0.09712
LOCATIONH 0.04888
PERIOD2 0.00000
PERIOD6 0.00000
PERIOD4 0.00000
Thanks for reading the early stages of this analysis. Future steps would most likely be focused on creating stronger features for prediction to feed into the GBM model, depending on the solution I would be solving for.