Building Expected Points Model

1. Introduction

This tutorial will walk you through building the Expected Points (xP) model in the NBA. The goal is to provide step-by-step instructions for creating a simple Random Forest model that predicts shot outcomes for the Boston Celtics during the 2024 NBA season.

Using the predicted shot outcomes, we can calculate the Expected Points (xP) for further analysis, such as comparing the team’s shooting performance against the expected values.

This approach is inspired by the concept of Expected Goals (xG) in football (soccer), which estimates the likelihood of a player scoring a goal in specific situations. For more details on xG, please refer to the Opta Analyst website

2. Data Import

First, load the required packages.

# Import packages
library(tidyverse) 
library(nbastatR)   
library(tidymodels) 
library(gganimate)

Use the nbastatR package to import the NBA shooting data for the Boston Celtics from the 2019-2020 to the 2023-2024 NBA seasons. (If you would like to explore other functions in nbastatR, please checkhere).

# Create a list to store the scrapped data
Boston_list <- list()

# Scrap the shots data for Boston from 2020-2024 seasons
for (season in 2020:2024) {
  
  ## Use "team_shots" function from nbastatR
  df <- teams_shots(teams = "Boston Celtics", # You can select any teams
                    seasons = season, 
                    return_message = FALSE) 
  
  ## Append each season data to the list
  Boston_list[[as.character(season)]] <- df
}

## Combine the list into a single data frame
Boston <- bind_rows(Boston_list)

3. Data Cleaning

1. Adjusting Shot Coordinates:
Before building the xP model, we will convert the x-coordinates and y-coordinates of the shot location into the appropriate format (Ex. 200 to 20 feet).

In the current dataset, the shot location at the center of the basket is represented as (x = 0, y = 0). Instead, we will adjust it to (x = 25, y = 5.25).

Note: These steps are optional

2. Converting Shot Outcomes:
To make predictions on the shot outcomes, we need to convert the shot outcome values (TRUE, FALSE) to binary values (1 = made, 0 = missed).

3. Creating Points Column:
Additionally, we will create a column that calculates the number of points scored based on the type of shot and whether it was made or missed.

Boston <- Boston %>% 
  mutate(
    # Convert x-coordinates to appropriate format (Ex. 200 to 25 feet)
    locationX = locationX / 10,
    # Adjust the range of x-coordinates (0 - 50 feet)
    locationX = case_when(locationX < 0 ~ 25 - locationX,  
                          locationX == 0 ~ 25,             
                          locationX > 0 ~ 25 - locationX), 
    
    # Convert y-coordinates to appropriate format (Ex. 450 to 45 feet)
    locationY = locationY / 10,
    # Adjust the range of y-coordinates (0 - 47 feet)
    locationY = case_when(locationY == 0 ~ 5.25,
                          locationY > 0 ~ 5.25 + locationY,
                          locationY < 0 ~ 5.25 - locationY),
    
    # Convert the shot outcomes to binary value (1 or 0)
    isShotMade = factor(case_when(isShotMade == "TRUE" ~ 1, # made = 1 
                           TRUE ~ 0)),                      # missed = 0
    
    # Creating a column for Points scored
    Points = case_when(isShotMade == 1 & typeShot == "2PT Field Goal" ~ 2, 
                       isShotMade == 1 & typeShot == "3PT Field Goal" ~ 3, 
                       TRUE ~ 0)                                           
  ) %>% 
  # Change the name of some columns
  rename(
    Season = yearSeason,
    Shot.made = isShotMade,
    Home.Team = slugTeamHome,
    Away.Team = slugTeamHome,
    X = locationX,
    Y = locationY
  )

4. Calculate the shot distance by using Euclidean distance formula: \[ Distance = \sqrt{(x - x_{Basket})^2 + (y - y_{Basket})^2} \] Note: This is an optional (You can use the shot-distance data available in the dataset)

# Define x- & y-coordinates of the basket
hoop_x <- 25
hoop_y <- 5.25

# Calculate shot distance 
Boston <- Boston %>% 
  mutate(
    Shot.distance = sqrt((X - hoop_x)^2 + (Y - hoop_y)^2))

4. Data Partitioning

4.1 - Model Building Data & Prediction Data

In this section, we prepare a dataset for modeling and predicting by splitting the original dataset based on the seasons.

Model Building Data (2020-2023):
We will create a subset of the original dataset that includes all seasons from 2020 to 2023.
This dataset will be used to build Random Forest model.
Prediction Data (2024):
We will also create a separate subset containing only the data from the 2024 season.
This dataset will be used to make the shot outcome predictions based on the model developed from the previous seasons.

# Split the data into 2020-2023 (building the model)
Boston_final <- subset(Boston, Season != "2024") # Not including the 2024 season

# Split the data into 2024 (making predictions)
Boston_2024 <- subset(Boston, Season == "2024")

4.2 - Training and Testing datasets

We further split the Model Building dataset into training and testing sets:

Training dataset
This dataset represents 70% of the original dataset, which will be used to train the model.
Testing dataset
This dataset represents 30% of the original dataset, which will be used to evaluate the predictive performance of the model on unseen data.
Cross-Validation:
Additionally, we will also create a resampling dataset, which performs 10-folds cross validation on the training set. This technique is more reliable compared to using the entire training set to build the model. (If you would like to learn more about the k-folds cross validation, please refer to the Analytics Vidhya website).

# Split dataframe into Training & Testing
Split <- initial_split(Boston_final)
Train <- training(Split)
Test <- testing(Split)

## Create a resample dataset 
Folds = vfold_cv(Train)

5. Random Forest Model (Without Tuning)

Once the training and testing datasets are ready, we will build Random Forest model without tuning any hyperparameters (adjustable settings in the model).

We will try to predict the shot outcomes (Made or Missed) based on type of shooting action and shooting distance. To improve the accuracy of the model, we will use the 10-folds cross-validation technique.

Finally, we collect the evaluation metrics from the trained model.

# Create a spec for Random Forest (without Tuning)
RF_spec <- rand_forest() %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

# Fit the model through 10-folds cross validation
RF1 <- fit_resamples(
  RF_spec,
  # Predict shot outcomes based on the Shooting types & Shot distance
  Shot.made ~ typeAction + Shot.distance, 
  # Use 10-folds cross validation
  Folds,                          
  control = control_resamples(save_pred = T)
)

# Collect evaluation metrics
RF1 %>% collect_metrics()

.metric	.estimator	mean	n	std_err	.config
accuracy	binary	0.6122130	10	0.0037153	Preprocessor1_Model1
brier_class	binary	0.2305142	10	0.0011145	Preprocessor1_Model1
roc_auc	binary	0.6403483	10	0.0043371	Preprocessor1_Model1

Accuracy and ROC-AUC closer to 1.0 indicate the high predictive performance of the model.

Accuracy: measures the ratio of correct predictions in relation to the total number of predictions made by the models

ROC-AUC: measures the model’s ability to distinguish difference between the made shots and the missed shots.

Brier Score (brier_class) closer to 0 indicates the high predictive performance of the model.

Brier Score: calculates the average squared error between predicted shot outcome probabilities and actual shot outcomes.

6. Random Forest Model (With Tuning)

This time, we will build Random Forest model with hyperparameter tuning.

To optimize the predictive performance of the model through the hyperparameter tuning, we will set up a grid search. This process evaluates the model using various combinations of the predefined parameters across the training folds.

In our example, we will use the grid search of 10, which means we will explore 10 different combinations of parameter values.

Finally, the valuation metrics (Accuracy, ROC-AUC, Brier-score) will be used to assess the model performance across these different configurations.

# Create a new spec for Random Forest (with Tuning)
RF_spec2 <- rand_forest(mtry = tune(),  # Tune the number of predictor variables
                        trees = tune(), # Tune the number of trees
                        min_n = tune()  # Tune the minimum number of observations in each node
                        ) %>%      
  set_mode("classification") %>% 
  set_engine("ranger", importance = "impurity")


# Tune grid
RF2 <- tune_grid(
  RF_spec2,
  Shot.made ~ typeAction + Shot.distance,
  resamples = Folds,
  # Use 10 grid search (10 combinations of parameter values)
  grid = 10,
  metrics = metric_set(accuracy, roc_auc, brier_class)
  )

1. Selecting Best Parameters:
From the grid search, we will select best combination of parameter values that shows the highest ROC-AUC score.

2. Finalizing the Model:
Next, we will evaluate the predictive performance of the model with the selected parameter values.

# Select the best parameter values
Best_ROC <- select_best(RF2, metric = "roc_auc") 

# Final model
RF_final <- finalize_model(RF_spec2, Best_ROC)

## Collect evaluation metrics
last_fit(RF_final, 
         Shot.made ~ typeAction + Shot.distance,
         Split) %>% 
  collect_metrics()

.metric	.estimator	.estimate	.config
accuracy	binary	0.6318875	Preprocessor1_Model1
roc_auc	binary	0.6509722	Preprocessor1_Model1
brier_class	binary	0.2270293	Preprocessor1_Model1

3. Fitting the Final Model:
Finally, we will fit the finalized Random Forest to the training dataset.

# Fit the final tuned model 
RF_fit <- RF_final %>% 
  fit(Shot.made ~ typeAction + Shot.distance,
      data = Train)

7. Shot Outcome Predictions for Boston Celtics

We will use the finalized model to make predictions on the shot outcomes for Boston Celtics in the 2024 season.

The goal is to compare the predicted shot outcomes with the actual shot outcomes, as well as evaluating the expected points based on shot distance and type.

1. Predicting Shot Outcomes in the 2024 Season:

# Predicting shot outcomes by using the 2024 season data
Pred <- augment(RF_fit, new_data = Boston_2024) %>% 
  mutate(
    # Convert Predicted outcomes to Missed or Made
    .pred_class = case_when(.pred_class == 0 ~ "Missed", TRUE ~ "Made"), 
    
    # Convert Actual outcomes to Missed or Made
    Shot.made = case_when(Shot.made == 0 ~ "Missed", TRUE ~ "Made")      
  ) %>% 
  rename(
    Pred.Outcomes = .pred_class, # Change the name of column (Predicted outcome)
    Actual.Outcomes = Shot.made  # Change the name of column (Actual outcome)
  )

2. Calculating Expected Points within each shooting distance range:

# Predict the Exp points for Boston in each shot distance range 
augment(RF_fit, new_data = Boston_2024) %>% 
  mutate(
    # Determine the expected points scored
    Exp.Points = case_when(.pred_class == 0 ~ 0,
                           .pred_class == 1 & typeShot == "2PT Field Goal" ~ 2,
                           .pred_class == 1 & typeShot == "3PT Field Goal" ~ 3)
  ) %>% 
  # Group by each distance range
  group_by(zoneRange) %>% 
  summarise(
    T.Point = sum(Points),         # Calculate the Actual total points
    T.Exp.Points = sum(Exp.Points) # Calculate the Expected total points
  )

zoneRange	T.Point	T.Exp.Points
16-24 ft.	328	40
24+ ft.	4050	63
8-16 ft.	748	210
Back Court Shot	3	0
Less Than 8 ft.	3424	3608

The table illustrates that the model seems to have difficulty predicting shot outcomes for longer range shootings. We can observe a large underestimation produced by the model in the 24+ ft shooting range.

3. Calculating Expected Points within each shooting type:

# Compare the Exp points for Boston in each shooting type 
augment(RF_fit, new_data = Boston_2024) %>% 
  mutate(
    Exp.Points = case_when(.pred_class == 0 ~ 0,
                           .pred_class == 1 & typeShot == "2PT Field Goal" ~ 2,
                           .pred_class == 1 & typeShot == "3PT Field Goal" ~ 3)
  ) %>% 
  # Group by each shooting type
  group_by(typeAction) %>% 
  summarise(
    T.Point = sum(Points),
    T.Exp.Points = sum(Exp.Points)
  )

Click to expand the table

typeAction	T.Point	T.Exp.Points
Alley Oop Dunk Shot	118	144
Alley Oop Layup shot	46	68
Cutting Dunk Shot	210	216
Cutting Finger Roll Layup Shot	16	20
Cutting Layup Shot	246	310
Driving Bank Hook Shot	6	6
Driving Dunk Shot	190	218
Driving Finger Roll Layup Shot	188	230
Driving Floating Bank Jump Shot	52	104
Driving Floating Jump Shot	188	4
Driving Hook Shot	30	8
Driving Layup Shot	722	564
Driving Reverse Layup Shot	60	92
Dunk Shot	56	56
Fadeaway Bank shot	6	2
Fadeaway Jump Shot	133	4
Finger Roll Layup Shot	20	14
Floating Jump shot	53	10
Hook Bank Shot	2	4
Hook Shot	28	20
Jump Bank Shot	25	14
Jump Shot	2770	4
Layup Shot	154	134
Pullup Jump shot	932	2
Putback Dunk Shot	30	38
Putback Layup Shot	148	206
Reverse Dunk Shot	2	6
Reverse Layup Shot	42	38
Running Alley Oop Dunk Shot	42	42
Running Alley Oop Layup Shot	10	16
Running Dunk Shot	184	194
Running Finger Roll Layup Shot	68	66
Running Jump Shot	265	20
Running Layup Shot	302	340
Running Pull-Up Jump Shot	143	43
Running Reverse Layup Shot	12	32
Step Back Bank Jump Shot	5	4
Step Back Jump shot	458	68
Tip Dunk Shot	46	50
Tip Layup Shot	138	256
Turnaround Bank Hook Shot	2	2
Turnaround Bank shot	42	48
Turnaround Fadeaway Bank Jump Shot	2	2
Turnaround Fadeaway shot	167	54
Turnaround Hook Shot	54	84
Turnaround Jump Shot	140	64

4. Comparing Season Total Points with Total Expected Points:

# Compare the season total points and total Exp points 
augment(RF_fit, new_data = Boston_2024) %>% 
  mutate(
    Exp.Points = case_when(.pred_class == 0 ~ 0,
                           .pred_class == 1 & typeShot == "2PT Field Goal" ~ 2,
                           .pred_class == 1 & typeShot == "3PT Field Goal" ~ 3)
  ) %>% 
  summarise(
    T.Point = sum(Points),
    T.Exp.Points = sum(Exp.Points)
  )

T.Point	T.Exp.Points
8553	3921

Based on the results, the model underestimates the total number of points scored by the Boston Celtics in the 2024 season. To improve prediction accuracy, we may need to incorporate additional features, such as player-specific metrics, shot difficulty, or defensive pressure factors.

8. Create NBA Half-court Diagram

In this section, we will create the NBA half-court diagram.

This diagram will serve as a reference for plotting shot locations on the court.

Note: Feel free to create your own diagram or utilize any pre-made diagram available.

Click to see the code

# Assign x and y limits representing the half court dimentions (Width = 50ft, Length = 94ft / 2)
court <- ggplot(data = data.frame(0,0)) +
  
  # Boundary of the half court (End line & Side lines) 
  geom_segment(aes(x = 0, y = 0, xend = 0, yend = 47)) + # Draw left boundary of the court 
  geom_segment(aes(x = 50, y = 0, xend = 0, yend = 0)) + # Draw bottom boundary of the court 
  geom_segment(aes(x = 0, y = 47, xend = 50, yend = 47)) + # Draw top boundary of the court 
  geom_segment(aes(x = 50, y = 0, xend = 50, yend = 47)) + # Draw right boundary of the court
  
  #Centre Semi-Circle (Radius = 12ft / 2)
  geom_curve(aes(x = 19, y = 47, xend = 25, yend = 41), curvature = 0.44) + # Draw left half arc
  geom_curve(aes(x = 31, y = 47, xend = 25, yend = 41), curvature = -0.44) + # Draw right half arc
  
  # Restraining Semi-Circle (Radius = 4ft /2)
  geom_curve(aes(x = 23, y = 47, xend = 25, yend = 45), curvature = 0.44) + # Draw left half arc
  geom_curve(aes(x = 27, y = 47, xend = 25, yend = 45), curvature = -0.44) + # Draw right half arc
  
  # Free-Throw Line - (Width = 16ft, Length = 19ft)
  geom_segment(aes(x = 33, y = 0, xend = 33, yend = 19)) + # Draw right free-throw lane 
  geom_segment(aes(x = 17, y = 0, xend = 17, yend = 19)) + # Draw left free-throw lane
  geom_segment(aes(x = 17, y = 19, xend = 33, yend = 19)) + # Draw top free-throw line
  
  # Free-Throw Circle Line (Top Aec: Radius = 12ft / 2)
  geom_curve(aes(x = 19, y = 19, xend = 25, yend = 25), curvature = -0.42) + # Draw left half arc
  geom_curve(aes(x = 31, y = 19, xend = 25, yend = 25), curvature = 0.42) + # Draw right half arc
  
  # Free-Throw Circle Line (Bottom. Arc: Radius = 12ft / 2)
  geom_curve(aes(x = 19, y = 19, xend = 25, yend = 13), curvature = 0.42, linetype = "longdash", linewidth = 0.6) + # Draw left half arc (Dashed line)
  geom_curve(aes(x = 31, y = 19, xend = 25, yend = 13), curvature = -0.42, linetype = "longdash", linewidth = 0.6) + # Draw right hald arc (Dashed line)
  
  # Backboard(Width = 6ft, 4ft away from end line)
  geom_segment(aes(x = 22, y = 4, xend = 28, yend = 4)) +
  geom_segment(aes(x = 25, y = 4, xend = 25, yend = 4.5)) + # Draw line from backboard to hoop (5.25ft - 0.75ft: radius of hoop)
  
  # Hoop (Diameter = 1.5ft, Centre of hoop is 5.25ft away from end line)
  geom_curve(aes(x = 25, y = 4.5, xend = 25, yend = 6), curvature = 1) + # Draw right half arc
  geom_curve(aes(x = 25, y = 4.5, xend = 25, yend = 6), curvature = -1) + # Draw left half arc
  
  # Restricted Area Arc Line (Radius = 4ft, 4ft away from centre of hoop: 5.25ft away from end line)
  geom_curve(aes(x = 21, y = 4, xend = 25, yend = 9.25), curvature = -0.46) + # Draw left half arc
  geom_curve(aes(x = 29, y = 4, xend = 25, yend = 9.25), curvature = 0.46) + # Draw right half arc
  
  # Three-Point Side Lines (Length = 14ft, 3ft away from each side line)
  geom_segment(aes(x = 3, y = 0, xend = 3, yend = 14)) + # Draw left three-point side line
  geom_segment(aes(x = 47, y = 0, xend = 47, yend = 14)) + # Draw right three-point side line
  
  # Three-Point Line Arc (Radius = 29ft (5.25ft + 23.75ft))
  geom_curve(aes(x = 3, y = 14, xend = 25, yend = 29), curvature = -0.32) + # Draw left three-point arc
  geom_curve(aes(x = 47, y = 14, xend = 25, yend = 29), curvature = 0.32) +# Draw right three-point arc
  
  coord_equal() +
  
# Modify Theme Settings
theme(
  # Remove x-axis tittle
  axis.title.x = element_blank(),
  # Remove y-axis tittle
  axis.title.y = element_blank(),
  # Remove text on x-axis 
  axis.text.x = element_blank(),
  # Remove ticks on x-axis
  axis.ticks.x = element_blank(),
  # Remove text on y-axis
  axis.text.y = element_blank(),
  # Remove ticks on y-axis
  axis.ticks.y = element_blank(),
  # Remove grids
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  # Change the background color to White
  panel.background = element_rect("white")
)

9. Plot Shot Outcomes (Actual vs. Exp)

Finally, we will create animated visualizations of the shot outcomes (Actual vs. Predicted) for a specific NBA game.
Denver Nuggets vs. Boston Celtics - Mar 7, 2024

1. Create an animated plot for Actual shot outcomes

# Actual plot
Actual_plot <- 
  # Apply the half-court diagram
  court +
  # Create a scatter plot for actual shot outcomes
  geom_point(data = subset(Pred,  idGame == "22300906"), # Subset the data (BOS vs. DEN)
             aes(x = X, 
                 y = Y, 
                 fill = Actual.Outcomes), # Fill the color in each point, based on Actual.Outcomes
             pch = 21,      # Customise the shape (circle)
             size = 3,      # Customise the size of shape & outer color (black)
             color = "black") +         
  # Customise the color (lightblue = made | red = missed)
  scale_fill_manual(values = c("lightblue", "#cc4b4b")) +
  # Include the title
  labs(title = "Actual shot outcomes") +
  # Customise the font size and color for the title
  theme(text = element_text(size = 15), 
        plot.caption = element_text(color = "black")) +
  # Produce an animated plot 
  transition_states(as.factor(idEvent),      # Use "idEvent" to represent each frame in animation
                    transition_length = 5) + # Customise the animation speed 
  # Keep the previous data points in plot
  shadow_mark()                        

# Animate and export the plot
anim_save("Actual.gif", Actual_plot)

2. Create an animated plot for Predicted shot outcomes

# Exp plot
Exp_plot <- 
  court +
  geom_point(data = subset(Pred,  idGame == "22300906"),
             aes(x = X, y = Y, 
                 fill = Pred.Outcomes), 
             pch = 21, 
             size = 3, 
             color="black") +
  scale_fill_manual(values = c("lightblue", "#cc4b4b")) +
  labs(title = "Expected shot outcomes") +
  theme(text = element_text(size = 15), 
        plot.caption = element_text(color = "black")) +
  transition_states(as.factor(idEvent),
                    transition_length = 5) +
  shadow_mark()

# Animate and export the plot
anim_save("Exp.gif", Exp_plot)

Expected Plot

Building Expected Points Model - NBA

SPE5AMS – Portfolio Assessment

Natsumi Sugisaki

2024-11-01