1. Introduction

This tutorial will walk you through building the Expected Points (xP) model in the NBA. The goal is to provide step-by-step instructions for creating a simple Random Forest model that predicts shot outcomes for the Boston Celtics during the 2024 NBA season.

Using the predicted shot outcomes, we can calculate the Expected Points (xP) for further analysis, such as comparing the team’s shooting performance against the expected values.

This approach is inspired by the concept of Expected Goals (xG) in football (soccer), which estimates the likelihood of a player scoring a goal in specific situations. For more details on xG, please refer to the Opta Analyst website


2. Data Import

First, load the required packages.

# Import packages
library(tidyverse) 
library(nbastatR)   
library(tidymodels) 
library(gganimate)


Use the nbastatR package to import the NBA shooting data for the Boston Celtics from the 2019-2020 to the 2023-2024 NBA seasons. (If you would like to explore other functions in nbastatR, please checkhere).

# Create a list to store the scrapped data
Boston_list <- list()

# Scrap the shots data for Boston from 2020-2024 seasons
for (season in 2020:2024) {
  
  ## Use "team_shots" function from nbastatR
  df <- teams_shots(teams = "Boston Celtics", # You can select any teams
                    seasons = season, 
                    return_message = FALSE) 
  
  ## Append each season data to the list
  Boston_list[[as.character(season)]] <- df
}

## Combine the list into a single data frame
Boston <- bind_rows(Boston_list)


3. Data Cleaning

1. Adjusting Shot Coordinates:
Before building the xP model, we will convert the x-coordinates and y-coordinates of the shot location into the appropriate format (Ex. 200 to 20 feet).

In the current dataset, the shot location at the center of the basket is represented as (x = 0, y = 0). Instead, we will adjust it to (x = 25, y = 5.25).

Note: These steps are optional

2. Converting Shot Outcomes:
To make predictions on the shot outcomes, we need to convert the shot outcome values (TRUE, FALSE) to binary values (1 = made, 0 = missed).

3. Creating Points Column:
Additionally, we will create a column that calculates the number of points scored based on the type of shot and whether it was made or missed.

Boston <- Boston %>% 
  mutate(
    # Convert x-coordinates to appropriate format (Ex. 200 to 25 feet)
    locationX = locationX / 10,
    # Adjust the range of x-coordinates (0 - 50 feet)
    locationX = case_when(locationX < 0 ~ 25 - locationX,  
                          locationX == 0 ~ 25,             
                          locationX > 0 ~ 25 - locationX), 
    
    # Convert y-coordinates to appropriate format (Ex. 450 to 45 feet)
    locationY = locationY / 10,
    # Adjust the range of y-coordinates (0 - 47 feet)
    locationY = case_when(locationY == 0 ~ 5.25,
                          locationY > 0 ~ 5.25 + locationY,
                          locationY < 0 ~ 5.25 - locationY),
    
    # Convert the shot outcomes to binary value (1 or 0)
    isShotMade = factor(case_when(isShotMade == "TRUE" ~ 1, # made = 1 
                           TRUE ~ 0)),                      # missed = 0
    
    # Creating a column for Points scored
    Points = case_when(isShotMade == 1 & typeShot == "2PT Field Goal" ~ 2, 
                       isShotMade == 1 & typeShot == "3PT Field Goal" ~ 3, 
                       TRUE ~ 0)                                           
  ) %>% 
  # Change the name of some columns
  rename(
    Season = yearSeason,
    Shot.made = isShotMade,
    Home.Team = slugTeamHome,
    Away.Team = slugTeamHome,
    X = locationX,
    Y = locationY
  )


4. Calculate the shot distance by using Euclidean distance formula: \[ Distance = \sqrt{(x - x_{Basket})^2 + (y - y_{Basket})^2} \] Note: This is an optional (You can use the shot-distance data available in the dataset)

# Define x- & y-coordinates of the basket
hoop_x <- 25
hoop_y <- 5.25

# Calculate shot distance 
Boston <- Boston %>% 
  mutate(
    Shot.distance = sqrt((X - hoop_x)^2 + (Y - hoop_y)^2))


4. Data Partitioning

4.1 - Model Building Data & Prediction Data

In this section, we prepare a dataset for modeling and predicting by splitting the original dataset based on the seasons.

  1. Model Building Data (2020-2023):
    We will create a subset of the original dataset that includes all seasons from 2020 to 2023.
    This dataset will be used to build Random Forest model.

  2. Prediction Data (2024):
    We will also create a separate subset containing only the data from the 2024 season.
    This dataset will be used to make the shot outcome predictions based on the model developed from the previous seasons.

# Split the data into 2020-2023 (building the model)
Boston_final <- subset(Boston, Season != "2024") # Not including the 2024 season

# Split the data into 2024 (making predictions)
Boston_2024 <- subset(Boston, Season == "2024")


4.2 - Training and Testing datasets

We further split the Model Building dataset into training and testing sets:

  1. Training dataset
    This dataset represents 70% of the original dataset, which will be used to train the model.

  2. Testing dataset
    This dataset represents 30% of the original dataset, which will be used to evaluate the predictive performance of the model on unseen data.

  3. Cross-Validation:
    Additionally, we will also create a resampling dataset, which performs 10-folds cross validation on the training set. This technique is more reliable compared to using the entire training set to build the model. (If you would like to learn more about the k-folds cross validation, please refer to the Analytics Vidhya website).

# Split dataframe into Training & Testing
Split <- initial_split(Boston_final)
Train <- training(Split)
Test <- testing(Split)

## Create a resample dataset 
Folds = vfold_cv(Train)


5. Random Forest Model (Without Tuning)

Once the training and testing datasets are ready, we will build Random Forest model without tuning any hyperparameters (adjustable settings in the model).

We will try to predict the shot outcomes (Made or Missed) based on type of shooting action and shooting distance. To improve the accuracy of the model, we will use the 10-folds cross-validation technique.

Finally, we collect the evaluation metrics from the trained model.

# Create a spec for Random Forest (without Tuning)
RF_spec <- rand_forest() %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

# Fit the model through 10-folds cross validation
RF1 <- fit_resamples(
  RF_spec,
  # Predict shot outcomes based on the Shooting types & Shot distance
  Shot.made ~ typeAction + Shot.distance, 
  # Use 10-folds cross validation
  Folds,                          
  control = control_resamples(save_pred = T)
)

# Collect evaluation metrics
RF1 %>% collect_metrics()
.metric .estimator mean n std_err .config
accuracy binary 0.6122130 10 0.0037153 Preprocessor1_Model1
brier_class binary 0.2305142 10 0.0011145 Preprocessor1_Model1
roc_auc binary 0.6403483 10 0.0043371 Preprocessor1_Model1

Accuracy and ROC-AUC closer to 1.0 indicate the high predictive performance of the model.

Accuracy: measures the ratio of correct predictions in relation to the total number of predictions made by the models

ROC-AUC: measures the model’s ability to distinguish difference between the made shots and the missed shots.

Brier Score (brier_class) closer to 0 indicates the high predictive performance of the model.

Brier Score: calculates the average squared error between predicted shot outcome probabilities and actual shot outcomes.


6. Random Forest Model (With Tuning)

This time, we will build Random Forest model with hyperparameter tuning.

To optimize the predictive performance of the model through the hyperparameter tuning, we will set up a grid search. This process evaluates the model using various combinations of the predefined parameters across the training folds.

In our example, we will use the grid search of 10, which means we will explore 10 different combinations of parameter values.

Finally, the valuation metrics (Accuracy, ROC-AUC, Brier-score) will be used to assess the model performance across these different configurations.

# Create a new spec for Random Forest (with Tuning)
RF_spec2 <- rand_forest(mtry = tune(),  # Tune the number of predictor variables
                        trees = tune(), # Tune the number of trees
                        min_n = tune()  # Tune the minimum number of observations in each node
                        ) %>%      
  set_mode("classification") %>% 
  set_engine("ranger", importance = "impurity")


# Tune grid
RF2 <- tune_grid(
  RF_spec2,
  Shot.made ~ typeAction + Shot.distance,
  resamples = Folds,
  # Use 10 grid search (10 combinations of parameter values)
  grid = 10,
  metrics = metric_set(accuracy, roc_auc, brier_class)
  )


1. Selecting Best Parameters:
From the grid search, we will select best combination of parameter values that shows the highest ROC-AUC score.

2. Finalizing the Model:
Next, we will evaluate the predictive performance of the model with the selected parameter values.

# Select the best parameter values
Best_ROC <- select_best(RF2, metric = "roc_auc") 

# Final model
RF_final <- finalize_model(RF_spec2, Best_ROC)

## Collect evaluation metrics
last_fit(RF_final, 
         Shot.made ~ typeAction + Shot.distance,
         Split) %>% 
  collect_metrics()
.metric .estimator .estimate .config
accuracy binary 0.6318875 Preprocessor1_Model1
roc_auc binary 0.6509722 Preprocessor1_Model1
brier_class binary 0.2270293 Preprocessor1_Model1


3. Fitting the Final Model:
Finally, we will fit the finalized Random Forest to the training dataset.

# Fit the final tuned model 
RF_fit <- RF_final %>% 
  fit(Shot.made ~ typeAction + Shot.distance,
      data = Train)


7. Shot Outcome Predictions for Boston Celtics

We will use the finalized model to make predictions on the shot outcomes for Boston Celtics in the 2024 season.

The goal is to compare the predicted shot outcomes with the actual shot outcomes, as well as evaluating the expected points based on shot distance and type.

1. Predicting Shot Outcomes in the 2024 Season:

# Predicting shot outcomes by using the 2024 season data
Pred <- augment(RF_fit, new_data = Boston_2024) %>% 
  mutate(
    # Convert Predicted outcomes to Missed or Made
    .pred_class = case_when(.pred_class == 0 ~ "Missed", TRUE ~ "Made"), 
    
    # Convert Actual outcomes to Missed or Made
    Shot.made = case_when(Shot.made == 0 ~ "Missed", TRUE ~ "Made")      
  ) %>% 
  rename(
    Pred.Outcomes = .pred_class, # Change the name of column (Predicted outcome)
    Actual.Outcomes = Shot.made  # Change the name of column (Actual outcome)
  )


2. Calculating Expected Points within each shooting distance range:

# Predict the Exp points for Boston in each shot distance range 
augment(RF_fit, new_data = Boston_2024) %>% 
  mutate(
    # Determine the expected points scored
    Exp.Points = case_when(.pred_class == 0 ~ 0,
                           .pred_class == 1 & typeShot == "2PT Field Goal" ~ 2,
                           .pred_class == 1 & typeShot == "3PT Field Goal" ~ 3)
  ) %>% 
  # Group by each distance range
  group_by(zoneRange) %>% 
  summarise(
    T.Point = sum(Points),         # Calculate the Actual total points
    T.Exp.Points = sum(Exp.Points) # Calculate the Expected total points
  )
zoneRange T.Point T.Exp.Points
16-24 ft. 328 40
24+ ft. 4050 63
8-16 ft. 748 210
Back Court Shot 3 0
Less Than 8 ft. 3424 3608

The table illustrates that the model seems to have difficulty predicting shot outcomes for longer range shootings. We can observe a large underestimation produced by the model in the 24+ ft shooting range.


3. Calculating Expected Points within each shooting type:

# Compare the Exp points for Boston in each shooting type 
augment(RF_fit, new_data = Boston_2024) %>% 
  mutate(
    Exp.Points = case_when(.pred_class == 0 ~ 0,
                           .pred_class == 1 & typeShot == "2PT Field Goal" ~ 2,
                           .pred_class == 1 & typeShot == "3PT Field Goal" ~ 3)
  ) %>% 
  # Group by each shooting type
  group_by(typeAction) %>% 
  summarise(
    T.Point = sum(Points),
    T.Exp.Points = sum(Exp.Points)
  )
Click to expand the table
typeAction T.Point T.Exp.Points
Alley Oop Dunk Shot 118 144
Alley Oop Layup shot 46 68
Cutting Dunk Shot 210 216
Cutting Finger Roll Layup Shot 16 20
Cutting Layup Shot 246 310
Driving Bank Hook Shot 6 6
Driving Dunk Shot 190 218
Driving Finger Roll Layup Shot 188 230
Driving Floating Bank Jump Shot 52 104
Driving Floating Jump Shot 188 4
Driving Hook Shot 30 8
Driving Layup Shot 722 564
Driving Reverse Layup Shot 60 92
Dunk Shot 56 56
Fadeaway Bank shot 6 2
Fadeaway Jump Shot 133 4
Finger Roll Layup Shot 20 14
Floating Jump shot 53 10
Hook Bank Shot 2 4
Hook Shot 28 20
Jump Bank Shot 25 14
Jump Shot 2770 4
Layup Shot 154 134
Pullup Jump shot 932 2
Putback Dunk Shot 30 38
Putback Layup Shot 148 206
Reverse Dunk Shot 2 6
Reverse Layup Shot 42 38
Running Alley Oop Dunk Shot 42 42
Running Alley Oop Layup Shot 10 16
Running Dunk Shot 184 194
Running Finger Roll Layup Shot 68 66
Running Jump Shot 265 20
Running Layup Shot 302 340
Running Pull-Up Jump Shot 143 43
Running Reverse Layup Shot 12 32
Step Back Bank Jump Shot 5 4
Step Back Jump shot 458 68
Tip Dunk Shot 46 50
Tip Layup Shot 138 256
Turnaround Bank Hook Shot 2 2
Turnaround Bank shot 42 48
Turnaround Fadeaway Bank Jump Shot 2 2
Turnaround Fadeaway shot 167 54
Turnaround Hook Shot 54 84
Turnaround Jump Shot 140 64


4. Comparing Season Total Points with Total Expected Points:

# Compare the season total points and total Exp points 
augment(RF_fit, new_data = Boston_2024) %>% 
  mutate(
    Exp.Points = case_when(.pred_class == 0 ~ 0,
                           .pred_class == 1 & typeShot == "2PT Field Goal" ~ 2,
                           .pred_class == 1 & typeShot == "3PT Field Goal" ~ 3)
  ) %>% 
  summarise(
    T.Point = sum(Points),
    T.Exp.Points = sum(Exp.Points)
  )
T.Point T.Exp.Points
8553 3921

Based on the results, the model underestimates the total number of points scored by the Boston Celtics in the 2024 season. To improve prediction accuracy, we may need to incorporate additional features, such as player-specific metrics, shot difficulty, or defensive pressure factors.


8. Create NBA Half-court Diagram

In this section, we will create the NBA half-court diagram.

This diagram will serve as a reference for plotting shot locations on the court.

Note: Feel free to create your own diagram or utilize any pre-made diagram available.

Click to see the code
# Assign x and y limits representing the half court dimentions (Width = 50ft, Length = 94ft / 2)
court <- ggplot(data = data.frame(0,0)) +
  
  # Boundary of the half court (End line & Side lines) 
  geom_segment(aes(x = 0, y = 0, xend = 0, yend = 47)) + # Draw left boundary of the court 
  geom_segment(aes(x = 50, y = 0, xend = 0, yend = 0)) + # Draw bottom boundary of the court 
  geom_segment(aes(x = 0, y = 47, xend = 50, yend = 47)) + # Draw top boundary of the court 
  geom_segment(aes(x = 50, y = 0, xend = 50, yend = 47)) + # Draw right boundary of the court
  
  #Centre Semi-Circle (Radius = 12ft / 2)
  geom_curve(aes(x = 19, y = 47, xend = 25, yend = 41), curvature = 0.44) + # Draw left half arc
  geom_curve(aes(x = 31, y = 47, xend = 25, yend = 41), curvature = -0.44) + # Draw right half arc
  
  # Restraining Semi-Circle (Radius = 4ft /2)
  geom_curve(aes(x = 23, y = 47, xend = 25, yend = 45), curvature = 0.44) + # Draw left half arc
  geom_curve(aes(x = 27, y = 47, xend = 25, yend = 45), curvature = -0.44) + # Draw right half arc
  
  # Free-Throw Line - (Width = 16ft, Length = 19ft)
  geom_segment(aes(x = 33, y = 0, xend = 33, yend = 19)) + # Draw right free-throw lane 
  geom_segment(aes(x = 17, y = 0, xend = 17, yend = 19)) + # Draw left free-throw lane
  geom_segment(aes(x = 17, y = 19, xend = 33, yend = 19)) + # Draw top free-throw line
  
  # Free-Throw Circle Line (Top Aec: Radius = 12ft / 2)
  geom_curve(aes(x = 19, y = 19, xend = 25, yend = 25), curvature = -0.42) + # Draw left half arc
  geom_curve(aes(x = 31, y = 19, xend = 25, yend = 25), curvature = 0.42) + # Draw right half arc
  
  # Free-Throw Circle Line (Bottom. Arc: Radius = 12ft / 2)
  geom_curve(aes(x = 19, y = 19, xend = 25, yend = 13), curvature = 0.42, linetype = "longdash", linewidth = 0.6) + # Draw left half arc (Dashed line)
  geom_curve(aes(x = 31, y = 19, xend = 25, yend = 13), curvature = -0.42, linetype = "longdash", linewidth = 0.6) + # Draw right hald arc (Dashed line)
  
  # Backboard(Width = 6ft, 4ft away from end line)
  geom_segment(aes(x = 22, y = 4, xend = 28, yend = 4)) +
  geom_segment(aes(x = 25, y = 4, xend = 25, yend = 4.5)) + # Draw line from backboard to hoop (5.25ft - 0.75ft: radius of hoop)
  
  # Hoop (Diameter = 1.5ft, Centre of hoop is 5.25ft away from end line)
  geom_curve(aes(x = 25, y = 4.5, xend = 25, yend = 6), curvature = 1) + # Draw right half arc
  geom_curve(aes(x = 25, y = 4.5, xend = 25, yend = 6), curvature = -1) + # Draw left half arc
  
  # Restricted Area Arc Line (Radius = 4ft, 4ft away from centre of hoop: 5.25ft away from end line)
  geom_curve(aes(x = 21, y = 4, xend = 25, yend = 9.25), curvature = -0.46) + # Draw left half arc
  geom_curve(aes(x = 29, y = 4, xend = 25, yend = 9.25), curvature = 0.46) + # Draw right half arc
  
  # Three-Point Side Lines (Length = 14ft, 3ft away from each side line)
  geom_segment(aes(x = 3, y = 0, xend = 3, yend = 14)) + # Draw left three-point side line
  geom_segment(aes(x = 47, y = 0, xend = 47, yend = 14)) + # Draw right three-point side line
  
  # Three-Point Line Arc (Radius = 29ft (5.25ft + 23.75ft))
  geom_curve(aes(x = 3, y = 14, xend = 25, yend = 29), curvature = -0.32) + # Draw left three-point arc
  geom_curve(aes(x = 47, y = 14, xend = 25, yend = 29), curvature = 0.32) +# Draw right three-point arc
  
  coord_equal() +
  
# Modify Theme Settings
theme(
  # Remove x-axis tittle
  axis.title.x = element_blank(),
  # Remove y-axis tittle
  axis.title.y = element_blank(),
  # Remove text on x-axis 
  axis.text.x = element_blank(),
  # Remove ticks on x-axis
  axis.ticks.x = element_blank(),
  # Remove text on y-axis
  axis.text.y = element_blank(),
  # Remove ticks on y-axis
  axis.ticks.y = element_blank(),
  # Remove grids
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  # Change the background color to White
  panel.background = element_rect("white")
)


9. Plot Shot Outcomes (Actual vs. Exp)

Finally, we will create animated visualizations of the shot outcomes (Actual vs. Predicted) for a specific NBA game.
Denver Nuggets vs. Boston Celtics - Mar 7, 2024

1. Create an animated plot for Actual shot outcomes

# Actual plot
Actual_plot <- 
  # Apply the half-court diagram
  court +
  # Create a scatter plot for actual shot outcomes
  geom_point(data = subset(Pred,  idGame == "22300906"), # Subset the data (BOS vs. DEN)
             aes(x = X, 
                 y = Y, 
                 fill = Actual.Outcomes), # Fill the color in each point, based on Actual.Outcomes
             pch = 21,      # Customise the shape (circle)
             size = 3,      # Customise the size of shape & outer color (black)
             color = "black") +         
  # Customise the color (lightblue = made | red = missed)
  scale_fill_manual(values = c("lightblue", "#cc4b4b")) +
  # Include the title
  labs(title = "Actual shot outcomes") +
  # Customise the font size and color for the title
  theme(text = element_text(size = 15), 
        plot.caption = element_text(color = "black")) +
  # Produce an animated plot 
  transition_states(as.factor(idEvent),      # Use "idEvent" to represent each frame in animation
                    transition_length = 5) + # Customise the animation speed 
  # Keep the previous data points in plot
  shadow_mark()                        

# Animate and export the plot
anim_save("Actual.gif", Actual_plot)


2. Create an animated plot for Predicted shot outcomes

# Exp plot
Exp_plot <- 
  court +
  geom_point(data = subset(Pred,  idGame == "22300906"),
             aes(x = X, y = Y, 
                 fill = Pred.Outcomes), 
             pch = 21, 
             size = 3, 
             color="black") +
  scale_fill_manual(values = c("lightblue", "#cc4b4b")) +
  labs(title = "Expected shot outcomes") +
  theme(text = element_text(size = 15), 
        plot.caption = element_text(color = "black")) +
  transition_states(as.factor(idEvent),
                    transition_length = 5) +
  shadow_mark()

# Animate and export the plot
anim_save("Exp.gif", Exp_plot)

Expected Plot Actual Plot