This tutorial will walk you through building the Expected
Points (xP) model in the NBA. The goal is to provide
step-by-step instructions for creating a simple Random Forest
model that predicts shot outcomes for the Boston
Celtics during the 2024 NBA season.
Using the predicted shot outcomes, we can calculate the
Expected Points (xP) for further analysis, such as
comparing the team’s shooting performance against the expected values.
This approach is inspired by the concept of Expected Goals (xG) in football (soccer), which estimates the likelihood of a player scoring a goal in specific situations. For more details on xG, please refer to the Opta Analyst website
First, load the required packages.
# Import packages
library(tidyverse)
library(nbastatR)
library(tidymodels)
library(gganimate)
Use the nbastatR package to import the NBA shooting data for the Boston Celtics from the 2019-2020 to the 2023-2024 NBA seasons. (If you would like to explore other functions in nbastatR, please checkhere).
# Create a list to store the scrapped data
Boston_list <- list()
# Scrap the shots data for Boston from 2020-2024 seasons
for (season in 2020:2024) {
## Use "team_shots" function from nbastatR
df <- teams_shots(teams = "Boston Celtics", # You can select any teams
seasons = season,
return_message = FALSE)
## Append each season data to the list
Boston_list[[as.character(season)]] <- df
}
## Combine the list into a single data frame
Boston <- bind_rows(Boston_list)
1. Adjusting Shot Coordinates:
Before
building the xP model, we will convert the x-coordinates and
y-coordinates of the shot location into the appropriate format (Ex. 200
to 20 feet).
In the current dataset, the shot location at the
center of the basket is represented as (x = 0, y = 0). Instead, we will
adjust it to (x = 25, y = 5.25).
Note:
These steps are optional
2. Converting Shot Outcomes:
To make
predictions on the shot outcomes, we need to convert the shot outcome
values (TRUE, FALSE) to binary values (1 = made, 0 = missed).
3. Creating Points Column:
Additionally,
we will create a column that calculates the number of points scored
based on the type of shot and whether it was made or missed.
Boston <- Boston %>%
mutate(
# Convert x-coordinates to appropriate format (Ex. 200 to 25 feet)
locationX = locationX / 10,
# Adjust the range of x-coordinates (0 - 50 feet)
locationX = case_when(locationX < 0 ~ 25 - locationX,
locationX == 0 ~ 25,
locationX > 0 ~ 25 - locationX),
# Convert y-coordinates to appropriate format (Ex. 450 to 45 feet)
locationY = locationY / 10,
# Adjust the range of y-coordinates (0 - 47 feet)
locationY = case_when(locationY == 0 ~ 5.25,
locationY > 0 ~ 5.25 + locationY,
locationY < 0 ~ 5.25 - locationY),
# Convert the shot outcomes to binary value (1 or 0)
isShotMade = factor(case_when(isShotMade == "TRUE" ~ 1, # made = 1
TRUE ~ 0)), # missed = 0
# Creating a column for Points scored
Points = case_when(isShotMade == 1 & typeShot == "2PT Field Goal" ~ 2,
isShotMade == 1 & typeShot == "3PT Field Goal" ~ 3,
TRUE ~ 0)
) %>%
# Change the name of some columns
rename(
Season = yearSeason,
Shot.made = isShotMade,
Home.Team = slugTeamHome,
Away.Team = slugTeamHome,
X = locationX,
Y = locationY
)
4. Calculate the shot distance by using Euclidean distance formula: \[ Distance = \sqrt{(x - x_{Basket})^2 + (y - y_{Basket})^2} \] Note: This is an optional (You can use the shot-distance data available in the dataset)
# Define x- & y-coordinates of the basket
hoop_x <- 25
hoop_y <- 5.25
# Calculate shot distance
Boston <- Boston %>%
mutate(
Shot.distance = sqrt((X - hoop_x)^2 + (Y - hoop_y)^2))
4.1 - Model Building Data & Prediction Data
In this section, we prepare a dataset for modeling and
predicting by splitting the original dataset based on the
seasons.
Model Building Data (2020-2023):
We will create a
subset of the original dataset that includes all seasons from 2020 to
2023.
This dataset will be used to build Random Forest
model.
Prediction Data (2024):
We will also create a
separate subset containing only the data from the 2024 season.
This
dataset will be used to make the shot outcome
predictions based on the model developed from the previous
seasons.
# Split the data into 2020-2023 (building the model)
Boston_final <- subset(Boston, Season != "2024") # Not including the 2024 season
# Split the data into 2024 (making predictions)
Boston_2024 <- subset(Boston, Season == "2024")
4.2 - Training and Testing datasets
We further split the Model Building dataset into training and testing
sets:
Training dataset
This dataset represents
70% of the original dataset, which will be used to
train the model.
Testing dataset
This dataset represents
30% of the original dataset, which will be used to
evaluate the predictive performance of the model on unseen
data.
Cross-Validation:
Additionally, we will also create a
resampling dataset, which performs 10-folds cross validation on
the training set. This technique is more reliable compared to using the
entire training set to build the model. (If you would like to learn
more about the k-folds cross validation, please refer to the Analytics
Vidhya website).
# Split dataframe into Training & Testing
Split <- initial_split(Boston_final)
Train <- training(Split)
Test <- testing(Split)
## Create a resample dataset
Folds = vfold_cv(Train)
Once the training and testing datasets are ready, we will build
Random Forest model without tuning any
hyperparameters (adjustable settings in the model).
We will try to predict the shot outcomes (Made or Missed) based on
type of shooting action and shooting
distance. To improve the accuracy of the model, we will use the
10-folds cross-validation technique.
Finally, we collect the evaluation metrics from the trained model.
# Create a spec for Random Forest (without Tuning)
RF_spec <- rand_forest() %>%
set_mode("classification") %>%
set_engine("ranger")
# Fit the model through 10-folds cross validation
RF1 <- fit_resamples(
RF_spec,
# Predict shot outcomes based on the Shooting types & Shot distance
Shot.made ~ typeAction + Shot.distance,
# Use 10-folds cross validation
Folds,
control = control_resamples(save_pred = T)
)
# Collect evaluation metrics
RF1 %>% collect_metrics()
| .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|
| accuracy | binary | 0.6122130 | 10 | 0.0037153 | Preprocessor1_Model1 |
| brier_class | binary | 0.2305142 | 10 | 0.0011145 | Preprocessor1_Model1 |
| roc_auc | binary | 0.6403483 | 10 | 0.0043371 | Preprocessor1_Model1 |
Accuracy and ROC-AUC closer to 1.0 indicate the high
predictive performance of the model.
Accuracy: measures the ratio of correct
predictions in relation to the total number of predictions made by the
models
Brier Score (brier_class) closer to 0 indicates the
high predictive performance of the model.
This time, we will build Random Forest model with
hyperparameter tuning.
To optimize the predictive performance of the model through the
hyperparameter tuning, we will set up a grid search.
This process evaluates the model using various combinations of
the predefined parameters across the training folds.
In our example, we will use the grid search of 10, which means we
will explore 10 different combinations of parameter values.
Finally, the valuation metrics (Accuracy, ROC-AUC, Brier-score) will be used to assess the model performance across these different configurations.
# Create a new spec for Random Forest (with Tuning)
RF_spec2 <- rand_forest(mtry = tune(), # Tune the number of predictor variables
trees = tune(), # Tune the number of trees
min_n = tune() # Tune the minimum number of observations in each node
) %>%
set_mode("classification") %>%
set_engine("ranger", importance = "impurity")
# Tune grid
RF2 <- tune_grid(
RF_spec2,
Shot.made ~ typeAction + Shot.distance,
resamples = Folds,
# Use 10 grid search (10 combinations of parameter values)
grid = 10,
metrics = metric_set(accuracy, roc_auc, brier_class)
)
1. Selecting Best Parameters:
From the
grid search, we will select best combination of parameter values that
shows the highest ROC-AUC score.
2. Finalizing the Model:
Next, we will
evaluate the predictive performance of the model with the selected
parameter values.
# Select the best parameter values
Best_ROC <- select_best(RF2, metric = "roc_auc")
# Final model
RF_final <- finalize_model(RF_spec2, Best_ROC)
## Collect evaluation metrics
last_fit(RF_final,
Shot.made ~ typeAction + Shot.distance,
Split) %>%
collect_metrics()
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| accuracy | binary | 0.6318875 | Preprocessor1_Model1 |
| roc_auc | binary | 0.6509722 | Preprocessor1_Model1 |
| brier_class | binary | 0.2270293 | Preprocessor1_Model1 |
3. Fitting the Final Model:
Finally, we
will fit the finalized Random Forest to the training dataset.
# Fit the final tuned model
RF_fit <- RF_final %>%
fit(Shot.made ~ typeAction + Shot.distance,
data = Train)
We will use the finalized model to make predictions on the shot
outcomes for Boston Celtics in the 2024 season.
The goal is to compare the predicted shot outcomes with the actual
shot outcomes, as well as evaluating the expected points based on shot
distance and type.
1. Predicting Shot Outcomes in the 2024
Season:
# Predicting shot outcomes by using the 2024 season data
Pred <- augment(RF_fit, new_data = Boston_2024) %>%
mutate(
# Convert Predicted outcomes to Missed or Made
.pred_class = case_when(.pred_class == 0 ~ "Missed", TRUE ~ "Made"),
# Convert Actual outcomes to Missed or Made
Shot.made = case_when(Shot.made == 0 ~ "Missed", TRUE ~ "Made")
) %>%
rename(
Pred.Outcomes = .pred_class, # Change the name of column (Predicted outcome)
Actual.Outcomes = Shot.made # Change the name of column (Actual outcome)
)
2. Calculating Expected Points within each shooting distance range:
# Predict the Exp points for Boston in each shot distance range
augment(RF_fit, new_data = Boston_2024) %>%
mutate(
# Determine the expected points scored
Exp.Points = case_when(.pred_class == 0 ~ 0,
.pred_class == 1 & typeShot == "2PT Field Goal" ~ 2,
.pred_class == 1 & typeShot == "3PT Field Goal" ~ 3)
) %>%
# Group by each distance range
group_by(zoneRange) %>%
summarise(
T.Point = sum(Points), # Calculate the Actual total points
T.Exp.Points = sum(Exp.Points) # Calculate the Expected total points
)
| zoneRange | T.Point | T.Exp.Points |
|---|---|---|
| 16-24 ft. | 328 | 40 |
| 24+ ft. | 4050 | 63 |
| 8-16 ft. | 748 | 210 |
| Back Court Shot | 3 | 0 |
| Less Than 8 ft. | 3424 | 3608 |
The table illustrates that the model seems to have difficulty predicting shot outcomes for longer range shootings. We can observe a large underestimation produced by the model in the 24+ ft shooting range.
3. Calculating Expected Points within each shooting type:
# Compare the Exp points for Boston in each shooting type
augment(RF_fit, new_data = Boston_2024) %>%
mutate(
Exp.Points = case_when(.pred_class == 0 ~ 0,
.pred_class == 1 & typeShot == "2PT Field Goal" ~ 2,
.pred_class == 1 & typeShot == "3PT Field Goal" ~ 3)
) %>%
# Group by each shooting type
group_by(typeAction) %>%
summarise(
T.Point = sum(Points),
T.Exp.Points = sum(Exp.Points)
)
| typeAction | T.Point | T.Exp.Points |
|---|---|---|
| Alley Oop Dunk Shot | 118 | 144 |
| Alley Oop Layup shot | 46 | 68 |
| Cutting Dunk Shot | 210 | 216 |
| Cutting Finger Roll Layup Shot | 16 | 20 |
| Cutting Layup Shot | 246 | 310 |
| Driving Bank Hook Shot | 6 | 6 |
| Driving Dunk Shot | 190 | 218 |
| Driving Finger Roll Layup Shot | 188 | 230 |
| Driving Floating Bank Jump Shot | 52 | 104 |
| Driving Floating Jump Shot | 188 | 4 |
| Driving Hook Shot | 30 | 8 |
| Driving Layup Shot | 722 | 564 |
| Driving Reverse Layup Shot | 60 | 92 |
| Dunk Shot | 56 | 56 |
| Fadeaway Bank shot | 6 | 2 |
| Fadeaway Jump Shot | 133 | 4 |
| Finger Roll Layup Shot | 20 | 14 |
| Floating Jump shot | 53 | 10 |
| Hook Bank Shot | 2 | 4 |
| Hook Shot | 28 | 20 |
| Jump Bank Shot | 25 | 14 |
| Jump Shot | 2770 | 4 |
| Layup Shot | 154 | 134 |
| Pullup Jump shot | 932 | 2 |
| Putback Dunk Shot | 30 | 38 |
| Putback Layup Shot | 148 | 206 |
| Reverse Dunk Shot | 2 | 6 |
| Reverse Layup Shot | 42 | 38 |
| Running Alley Oop Dunk Shot | 42 | 42 |
| Running Alley Oop Layup Shot | 10 | 16 |
| Running Dunk Shot | 184 | 194 |
| Running Finger Roll Layup Shot | 68 | 66 |
| Running Jump Shot | 265 | 20 |
| Running Layup Shot | 302 | 340 |
| Running Pull-Up Jump Shot | 143 | 43 |
| Running Reverse Layup Shot | 12 | 32 |
| Step Back Bank Jump Shot | 5 | 4 |
| Step Back Jump shot | 458 | 68 |
| Tip Dunk Shot | 46 | 50 |
| Tip Layup Shot | 138 | 256 |
| Turnaround Bank Hook Shot | 2 | 2 |
| Turnaround Bank shot | 42 | 48 |
| Turnaround Fadeaway Bank Jump Shot | 2 | 2 |
| Turnaround Fadeaway shot | 167 | 54 |
| Turnaround Hook Shot | 54 | 84 |
| Turnaround Jump Shot | 140 | 64 |
4. Comparing Season Total Points with Total Expected Points:
# Compare the season total points and total Exp points
augment(RF_fit, new_data = Boston_2024) %>%
mutate(
Exp.Points = case_when(.pred_class == 0 ~ 0,
.pred_class == 1 & typeShot == "2PT Field Goal" ~ 2,
.pred_class == 1 & typeShot == "3PT Field Goal" ~ 3)
) %>%
summarise(
T.Point = sum(Points),
T.Exp.Points = sum(Exp.Points)
)
| T.Point | T.Exp.Points |
|---|---|
| 8553 | 3921 |
Based on the results, the model underestimates the total number of points scored by the Boston Celtics in the 2024 season. To improve prediction accuracy, we may need to incorporate additional features, such as player-specific metrics, shot difficulty, or defensive pressure factors.
In this section, we will create the NBA half-court
diagram.
This diagram will serve as a reference for plotting shot locations on
the court.
Note: Feel free to create your own diagram or utilize any pre-made diagram available.
# Assign x and y limits representing the half court dimentions (Width = 50ft, Length = 94ft / 2)
court <- ggplot(data = data.frame(0,0)) +
# Boundary of the half court (End line & Side lines)
geom_segment(aes(x = 0, y = 0, xend = 0, yend = 47)) + # Draw left boundary of the court
geom_segment(aes(x = 50, y = 0, xend = 0, yend = 0)) + # Draw bottom boundary of the court
geom_segment(aes(x = 0, y = 47, xend = 50, yend = 47)) + # Draw top boundary of the court
geom_segment(aes(x = 50, y = 0, xend = 50, yend = 47)) + # Draw right boundary of the court
#Centre Semi-Circle (Radius = 12ft / 2)
geom_curve(aes(x = 19, y = 47, xend = 25, yend = 41), curvature = 0.44) + # Draw left half arc
geom_curve(aes(x = 31, y = 47, xend = 25, yend = 41), curvature = -0.44) + # Draw right half arc
# Restraining Semi-Circle (Radius = 4ft /2)
geom_curve(aes(x = 23, y = 47, xend = 25, yend = 45), curvature = 0.44) + # Draw left half arc
geom_curve(aes(x = 27, y = 47, xend = 25, yend = 45), curvature = -0.44) + # Draw right half arc
# Free-Throw Line - (Width = 16ft, Length = 19ft)
geom_segment(aes(x = 33, y = 0, xend = 33, yend = 19)) + # Draw right free-throw lane
geom_segment(aes(x = 17, y = 0, xend = 17, yend = 19)) + # Draw left free-throw lane
geom_segment(aes(x = 17, y = 19, xend = 33, yend = 19)) + # Draw top free-throw line
# Free-Throw Circle Line (Top Aec: Radius = 12ft / 2)
geom_curve(aes(x = 19, y = 19, xend = 25, yend = 25), curvature = -0.42) + # Draw left half arc
geom_curve(aes(x = 31, y = 19, xend = 25, yend = 25), curvature = 0.42) + # Draw right half arc
# Free-Throw Circle Line (Bottom. Arc: Radius = 12ft / 2)
geom_curve(aes(x = 19, y = 19, xend = 25, yend = 13), curvature = 0.42, linetype = "longdash", linewidth = 0.6) + # Draw left half arc (Dashed line)
geom_curve(aes(x = 31, y = 19, xend = 25, yend = 13), curvature = -0.42, linetype = "longdash", linewidth = 0.6) + # Draw right hald arc (Dashed line)
# Backboard(Width = 6ft, 4ft away from end line)
geom_segment(aes(x = 22, y = 4, xend = 28, yend = 4)) +
geom_segment(aes(x = 25, y = 4, xend = 25, yend = 4.5)) + # Draw line from backboard to hoop (5.25ft - 0.75ft: radius of hoop)
# Hoop (Diameter = 1.5ft, Centre of hoop is 5.25ft away from end line)
geom_curve(aes(x = 25, y = 4.5, xend = 25, yend = 6), curvature = 1) + # Draw right half arc
geom_curve(aes(x = 25, y = 4.5, xend = 25, yend = 6), curvature = -1) + # Draw left half arc
# Restricted Area Arc Line (Radius = 4ft, 4ft away from centre of hoop: 5.25ft away from end line)
geom_curve(aes(x = 21, y = 4, xend = 25, yend = 9.25), curvature = -0.46) + # Draw left half arc
geom_curve(aes(x = 29, y = 4, xend = 25, yend = 9.25), curvature = 0.46) + # Draw right half arc
# Three-Point Side Lines (Length = 14ft, 3ft away from each side line)
geom_segment(aes(x = 3, y = 0, xend = 3, yend = 14)) + # Draw left three-point side line
geom_segment(aes(x = 47, y = 0, xend = 47, yend = 14)) + # Draw right three-point side line
# Three-Point Line Arc (Radius = 29ft (5.25ft + 23.75ft))
geom_curve(aes(x = 3, y = 14, xend = 25, yend = 29), curvature = -0.32) + # Draw left three-point arc
geom_curve(aes(x = 47, y = 14, xend = 25, yend = 29), curvature = 0.32) +# Draw right three-point arc
coord_equal() +
# Modify Theme Settings
theme(
# Remove x-axis tittle
axis.title.x = element_blank(),
# Remove y-axis tittle
axis.title.y = element_blank(),
# Remove text on x-axis
axis.text.x = element_blank(),
# Remove ticks on x-axis
axis.ticks.x = element_blank(),
# Remove text on y-axis
axis.text.y = element_blank(),
# Remove ticks on y-axis
axis.ticks.y = element_blank(),
# Remove grids
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
# Change the background color to White
panel.background = element_rect("white")
)
Finally, we will create animated visualizations of the shot outcomes
(Actual vs. Predicted) for a specific NBA game.
Denver
Nuggets vs. Boston Celtics - Mar 7, 2024
1. Create an animated plot for Actual shot outcomes
# Actual plot
Actual_plot <-
# Apply the half-court diagram
court +
# Create a scatter plot for actual shot outcomes
geom_point(data = subset(Pred, idGame == "22300906"), # Subset the data (BOS vs. DEN)
aes(x = X,
y = Y,
fill = Actual.Outcomes), # Fill the color in each point, based on Actual.Outcomes
pch = 21, # Customise the shape (circle)
size = 3, # Customise the size of shape & outer color (black)
color = "black") +
# Customise the color (lightblue = made | red = missed)
scale_fill_manual(values = c("lightblue", "#cc4b4b")) +
# Include the title
labs(title = "Actual shot outcomes") +
# Customise the font size and color for the title
theme(text = element_text(size = 15),
plot.caption = element_text(color = "black")) +
# Produce an animated plot
transition_states(as.factor(idEvent), # Use "idEvent" to represent each frame in animation
transition_length = 5) + # Customise the animation speed
# Keep the previous data points in plot
shadow_mark()
# Animate and export the plot
anim_save("Actual.gif", Actual_plot)
2. Create an animated plot for Predicted shot outcomes
# Exp plot
Exp_plot <-
court +
geom_point(data = subset(Pred, idGame == "22300906"),
aes(x = X, y = Y,
fill = Pred.Outcomes),
pch = 21,
size = 3,
color="black") +
scale_fill_manual(values = c("lightblue", "#cc4b4b")) +
labs(title = "Expected shot outcomes") +
theme(text = element_text(size = 15),
plot.caption = element_text(color = "black")) +
transition_states(as.factor(idEvent),
transition_length = 5) +
shadow_mark()
# Animate and export the plot
anim_save("Exp.gif", Exp_plot)