In this lab, we’ll create decision trees to model shooting percentages for different types of shots in the NBA.

The Data

We can use the same data on December 2015 first quarter NBA jump shots that we used earlier in the year

shots <- read.csv('/home/rstudioshared/shared_files/data/nba_savant_jumpshots_dec2015_q1.csv')

Let’s take a quick look at the data:

nrow(shots); colnames(shots)
head(shots)
View(shots)

In this lab we’ll be trying to predict whether a shot will be made. The shot_made_flag takes on values of 0 or 1 depending on whether the shot was missed or made.

Decision Trees

We’ll use the rpart package to create our trees. rpart stands for recursive partitioning. Partitioning means splitting into subsets or branches such that every item is on exactly one branch. Recursion means repeating an algorithm on the results of the algorithm. In this case, we’ll take our branches and split them into smaller branches and then split those smaller branches into yet smaller branches, each time using the same algorithm to determine how we choose the branches.

library(rpart); library(rpart.plot)

A Simple Tree

Let’s try to build a simple model that predicts whether a shot was made based on the amount of time left on the shot clock:

fit <- rpart(shot_made_flag ~ shot_clock,cp=0.003,data=shots)
prp(fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)

Our plot tells us a few things. First, the top node reveals that we’re working with 6,468 shots of which 39% were made. Our simple model splits the data into two branches, shots taken with less than 3.8 second left on the clock and shots with 3.8 or more seconds left on the clock. 831 shots end up on the former branch and 5,632 on the latter. Of shots taken with less than 3.8 seconds remaining on the clock, 31% were made whereas 40% of all other shots were made.

Why did our model choose to split the data at 3.8 seconds? The median amount of time left on the clock was 11.9 seconds, why not use that?

The rpart function chose 3.8 seconds because that’s the partition that minimizes the prediction errors. Our model is predicting that 31% of shots will be made for shots taken with less than 3.8 seconds left on the clock and predicting 40% otherwise (these are just the averages for each branch). Any other partition of shots (that is, any other choice of branches), would have lead to worse predictions – predictions with a higher root mean square error (RMSE).

A Little Recursion

Okay, here comes the recursion part. After finding and creating the best possible first branches, rpart can at each of those branches and find the best possible places to create a sub branch. Remember that by best possible I simply mean the branch that will minimize prediction error.

fit <- rpart(shot_made_flag ~ shot_clock,cp=0.001,data=shots)
prp(fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)

Take a minute to make sure that you understand what this model says. Note that each “node” has an if statement, the left branch is the branch that you follow if this condition is true. The right branch is the branch you follow if this condition is false.

Q1. According to this simple model, what is the % chance that a shot taken with 15 seconds left will be made?

The Complexity Parameter

We created two branches in the second model but only one in the first model because we lowered the “cp”, the complexity parameter. The rpart function stops adding new branches as soon as the improvement from creating the best next branch falls below a certain level. This level is called the complexity parameter. If it’s very small - the improvement in RMSE need not be large in order to creates new branches. With a large value of cp, on the other hand, a new branch will only be created if it leads to a drastic improvement in prediction accuracy.

Why not just add any branch that improves the prediction accuracy by any amount no matter how small? Well, if we did this, we’d end up with 6,498 branches in our model - one for each shot taken in our data set. This incredibly complex model wouldn’t teach us a thing and, worse, it would make disastrous poor prediction “out of sample”, meaning it you tried to use it to predict shot results in the following month. One of the things that we’ll discovery in our model building is that simple models often outperform complex ones.

You can find out a bit more about your branches with the code:

printcp(fit)

This tells you than the first branch led to an improvement of 0.0038 in relative error. This means that it decrease the mean square error by only 0.38%. The next branch led to an improvement of only 0.14%. Our CP cutoff was 0.1% (0.001) and no additional branch led to an improvement of that size which limited the complexity of our tree.

We could try adding to the complexity of our tree by decreasing the complexity parameter:

fit <- rpart(shot_made_flag ~ shot_clock,cp=0.0007,data=shots)
prp(fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)

Using Acton Types

Let’s try predicting whether a shot goes in using the shot type:

fit <- rpart(shot_made_flag ~ action_type,cp=0.0007,data=shots)
prp(fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)

To understand the abbreviation in this tree, and which shots fall where we can use our old friend dplyr:

library(dplyr)

shots %>% group_by(action_type) %>% 
  summarize(n = length(action_type), FGper = mean(shot_made_flag)) %>% 
  arrange(desc(n))

Q2. Which shot types fall in the highest percentage branch?

Q3. Which shot types fall in the lowest percentage branch?

Using Multiple Variables

Try out each of the following models which combine two or three variables and try fiddling with the complexity parameter.

fit <- rpart(shot_made_flag ~ action_type+shot_clock,cp=0.0015,data=shots)
prp(fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)
fit <- rpart(shot_made_flag ~ action_type+shot_distance,cp=0.0015,data=shots)
prp(fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)
fit <- rpart(shot_made_flag ~ shot_clock+shot_distance,cp=0.0015,data=shots)
prp(fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)
fit <- rpart(shot_made_flag ~ shot_clock+shot_distance+action_type,cp=0.002,data=shots)
prp(fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)

Q4. Pick one of the models you created and describe the results. If possible, explain these results using your understanding of the game of basketball. Include the tree picture in the Google Doc with your answer.

Go Nuts!

Try building trees using some of the other variables such as dribbles, x, y and defender distance.