NBA Prediction Project

Introduction

library(pacman)
p_load(dplyr, readr, janitor, data.table, rsample, ggplot2, recipes, parsnip, workflows, tune, yardstick, vip, xgboost, plotly, stringr, kableExtra)
cores = parallel::detectCores()

I had the idea to try and predict how well an NBA team will finish in the postseason using data collected at the all-star break. I think that if built properly, this could be a very useful model to NBA front offices.

NBA teams have 3 main ways to improve their rosters: the rookie draft, free agency, and the trade market. Most in-season trades are made near the trade deadline, which closely corresponds with the all-star break. They are both in February each year and less than a week apart. When assessing if your team should make a trade at the trade deadline, it is crucial to know how good your team is / will be if it stays the same, and what your goal is for the current season. This is where a good model could come into play. Many of the best teams have to decide each year whether or not they should trade future assets or promising young talent for an extra player that could improve their team this season. If going ‘all-in’ can help you win a championship, you should do it, but no front office wants to trade away those assets just to later realize they still aren’t going to win it all. This makes it important to be able to accurately and unbiasedly evaluate where you stand in the league, and a good model could help do that.

For example, in 2020-21, the Chicago Bulls weren’t a great team, and they missed the playoffs. In the 2021 offseason, they made major upgrades to their roster. At the all-star break of the 2021-22 season, the were 4th in the NBA in Win %. A major topic in sports media was whether or not the Bulls should trade young Patrick Williams for a player who could help them potentially win a championship this season (Jerami Grant was a popular name), at the expense of trading away Williams, who is looked at by many people as a potential superstar for many years to come. To make an informed decision, the Bulls front office members need to accurately know where they stand. If a good model told them that they are still somewhat far from a championship as it stands, they may not think that making the deal improves their team enough to make them real contenders. However, if they think they are right on the brink of a championship, then they may believe that trading for a player like Grant is what will push them over the finish line.

Of course, they could use the win-loss standings at the time of the all-star break to try and estimate where they are, but I wanted to try and find other important predictors that can identify teams that are over or undervalued relative to their standings at the all-star break.

Lastly this model doesn’t just apply to teams at the top chasing a championship. You could work for a team where the owner demanded you make the playoffs this season. If the model says you are right on the fringe of the playoffs, maybe you should try to improve your roster (or risk getting fired)!

Data

Data Preperation

The data I used (mostly) comes from the Official NBA Stats website. I collected traditional, advanced, hustle, and clutch stats for all 30 teams for each season going back to the 2003-04 season (19 seasons of data including this year that has no outcome yet). I manually entered a specific season result for all of these observations. I then classified all of these specific results into 5 separate classes which I used as my outcome variable. The 5 classes were “High Lottery Odds”, “Low Lottery Odds”, “Fringe Playoff Team”, “Solid Playoff Team”, and “Championship Contender”.

#Assigns result classes to each observation using the specific results from the data
good_nba_data$result_class = NA
good_nba_data$result_class[good_nba_data$result=='1st Pick'] = 'High Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='2nd Pick'] = 'High Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='3rd Pick'] = 'High Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='4th Pick'] = 'High Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='5th Pick'] = 'High Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='6th Pick'] = 'High Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='7th Pick'] = 'High Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='8th Pick'] = 'Low Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='9th Pick'] = 'Low Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='10th Pick'] = 'Low Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='11th Pick'] = 'Low Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='12th Pick'] = 'Low Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='13th Pick'] = 'Low Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='14th Pick'] = 'Low Lottery Odds'
good_nba_data$result_class[good_nba_data$result=='Made Playoffs'] = 'Fringe Playoff Team'
good_nba_data$result_class[good_nba_data$result=='Made 2nd Round'] = 'Solid Playoff Team'
good_nba_data$result_class[good_nba_data$result=='Made Conference Finals'] = 'Championship Contender'
good_nba_data$result_class[good_nba_data$result=='Made Finals'] = 'Championship Contender'
good_nba_data$result_class[good_nba_data$result=='Champions'] = 'Championship Contender'

I removed highly correlated variables that gave redundant information. For example, Wins, Losses, and Win% all tell the same thing, so I don’t need all 3 in my data set.

I now have 96 predictors in my data. However, some stats are not consistent from year to year. For example, the 26th ranked team in PPG in 2021-22 scores more points than the 1st ranked team in 2003-04. Because of this, I want a column for how each team ranks in that stat in the season they played in. Therefore, were you the best team that year, the 18th best, or the worst?

#creates the new Rank columns

for (c in 1:96) { #this will give me a Rank column using all of my column values for all but the first 3 columns
  col = c+3
  newcol = c()
  oldcol_name = names(good_nba_data[col])
  newcol_name = paste(oldcol_name, 'Rk', sep = '_') #what to name the new column
  for (s in 1:19) { #separates observations by season before ranking
    sea = unique(good_nba_data$season)[s]
    a = good_nba_data %>% filter(season==sea)
    order = sort.list(desc(a[col]))
    ranks = seq(0,0,length.out=length(order)) #ranks of each team for the specified season
    for (o in 1:length(order)){ #assigns each teams rank to one single column that includes all teams from all seasons
      obs = order[o]
      ranks[obs] = o}
    newcol = append(newcol, ranks)}
  good_nba_data$new = newcol #adds the new column into the data set
  names(good_nba_data)[names(good_nba_data) == 'new'] = newcol_name} #renames the new column with the new name

However, when looking at the ranks columns, there is no way to distinguish how far apart teams right next to each other are. Is the top scoring team in first by a mile, or is it neck and neck with the second place team? This removes the cardinality of our data. I will add a new set of columns that is a z-score of each predictor, grouped by season. This will solve the problem of stats having different means from season to season, but also preserve our cardinality.

#creates the new z-score columns

zs = data.frame()
for (s in 1:length(unique(good_nba_data$season))) { 
  sea = unique(good_nba_data$season)[s] 
  a = good_nba_data %>% filter(season==sea) #creates a data frame for each season
  b = sapply(a[4:99], scale) #creates a z score for every column in that season
  zs = rbind(zs, b)} #binds all of the seasons into one data frame
oldcolname = c()
newcolname = c()
for (c in 1:ncol(zs)){ #changes names in transformed data frame (adds a '_z' to each variable name)
  old = names(zs)[c]
  new = paste(old, 'Z', sep = '_')
  oldcolname = append(oldcolname, old)
  newcolname = append(newcolname, new)
}
setnames(zs, old=oldcolname, new=newcolname)

good_nba_data = cbind(good_nba_data, zs) #adds my Z columns into the data set

I now have 3 different versions of the same set of data. I have the original version, a ranked version, and a z-score version. I will use all three sets of data in my model, individually and together, to see which sets have the most predictive power.

Sampling

set.seed(521021) #set seed for reproducibility; ran runif() to pick a seed

After removing the 2021-22 season observations from the data since they have no result and therefore cannot be used in supervised learning, I will assign observations into stratified cross validation folds by seasons. I decided to do 5-fold cross validation due to simplicity and computing power.

cv_data = nba_data %>% vfold_cv(v=5, strata = season, repeats = 1) #assign observations to cv folds

Models

I will run 4 different types of models: Random Forest, a Boosting ensemble, K-Nearest Neighbors, and Multinomial Regression.

I started off by running 7 random forest models. Each time I ran the model, I used a different combination of the sets of predictors. Each model was cross validated and accuracy was reported with its best combination of parameters.

As you can see from the graph, they all produced very similar accuracy, meaning all sets are telling us basically the same information. So, from here out, I will train all of my models with all 3 sets of predictors.

After training the random forest using all predictors, I plotted variable importance from the model that used all 3 sets of predictors.

When looking at the 25 most important variables, which are shown in the plot, many of them are different versions of the same variable (win_percent is the top 3!). I was looking to find some variables that could explain variation in postseason finish beyond just looking at Win %, but this plot tells me I haven’t really found many. One hypothesis I have for why I am not finding many important predictors is that the older data in this set is not very representative of the modern NBA game, and things that lead to a championship in 2004 don’t lead you to a championship in 2022. As I train all 4 of my model types, I will train each one twice: once with all sets of predictors and once with all predictors but the 3 oldest seasons in the data set are taken out of the sample.

Models

Each model’s accuracy and optimal parameters are show below.

Random Forest

Random Forest With All Predictors

rec1 = recipe(result_class ~ ., data = nba_data) %>%
  update_role(season, team, result, new_role = 'ID')

mod_rf1 = rand_forest(mtry = tune(), min_n = tune(), trees = tune()) %>%
  set_engine('ranger', num.threads = cores, importance = 'impurity') %>%
  set_mode('classification')

wf1 = workflow() %>% add_model(mod_rf1) %>% add_recipe(rec1)

fit_rf1 = wf1 %>% tune_grid(cv_data,
                            grid = expand.grid(mtry = c(5, 10, 15, 25, 40, 60),
                                               min_n = c(10, 30, 50, 70, 100),
                                               trees = c(100, 250, 500)),
                            metrics = metric_set(accuracy, f_meas, roc_auc))

fit_rf1 %>% show_best(metric = 'accuracy', n=1) %>% kbl() %>% kable_styling()

mtry	trees	min_n	.metric	.estimator	mean	n	std_err	.config
40	250	30	accuracy	multiclass	0.6214953	5	0.0160195	Preprocessor1_Model41

Random Forest With Oldest 3 Seasons Removed

rec2 = recipe(result_class ~ ., data = nba_data_limited) %>%
  update_role(season, team, result, new_role = 'ID')

mod_rf2 = rand_forest(mtry = tune(), min_n = tune(), trees = tune()) %>%
  set_engine('ranger', num.threads = cores, importance = 'impurity') %>%
  set_mode('classification')

wf2 = workflow() %>% add_model(mod_rf2) %>% add_recipe(rec2)

fit_rf2 = wf2 %>% tune_grid(cv_data_limited,
                            grid = expand.grid(mtry = c(5, 10, 15, 25, 40, 60),
                                               min_n = c(10, 30, 50, 70, 100),
                                               trees = c(100, 250, 500)),
                            metrics = metric_set(accuracy, f_meas, roc_auc))

fit_rf2 %>% show_best(metric = 'accuracy', n=1) %>% kbl() %>% kable_styling()

mtry	trees	min_n	.metric	.estimator	mean	n	std_err	.config
40	100	70	accuracy	multiclass	0.6355556	5	0.0266203	Preprocessor1_Model23

Boosting

Boosting Ensemble With All Predictors

mod_boost1 = boost_tree(tree_depth = tune(), learn_rate = tune(), trees = tune()) %>%
  set_engine('xgboost', num.threads = cores) %>%
  set_mode('classification')

wf3 = workflow() %>% add_model(mod_boost1) %>% add_recipe(rec1) #rec1 uses all nba data

fit_boost1 = wf3 %>% tune_grid(cv_data,
                               grid = expand.grid(tree_depth = c(1, 2, 3, 5, 10),
                                                  learn_rate = seq(0, 1, length.out=5),
                                                  trees = c(100, 250, 500)),
                               metrics = metric_set(accuracy, f_meas, roc_auc))

fit_boost1 %>% show_best(metric = 'accuracy', n=1) %>% kbl() %>% kable_styling()

trees	tree_depth	learn_rate	.metric	.estimator	mean	n	std_err	.config
100	1	0.25	accuracy	multiclass	0.5808238	5	0.0200508	Preprocessor1_Model04

Boosting Ensemble With Oldest 3 Seasons Removed

wf4 = workflow() %>% add_model(mod_boost1) %>% add_recipe(rec2) #rec2 uses limited nba data

fit_boost2 = wf4 %>% tune_grid(cv_data_limited,
                               grid = expand.grid(tree_depth = c(1, 2, 3, 5, 10),
                                                  learn_rate = seq(0, 1, length.out=5),
                                                  trees = c(100, 250, 500)),
                               metrics = metric_set(accuracy, f_meas, roc_auc))

fit_boost2 %>% show_best(metric = 'accuracy', n=1) %>% kbl() %>% kable_styling()

trees	tree_depth	learn_rate	.metric	.estimator	mean	n	std_err	.config
100	1	0.25	accuracy	multiclass	0.5888889	5	0.0213726	Preprocessor1_Model04

KNN

KNN With All Predictors

mod_knn1 = nearest_neighbor(neighbors = tune()) %>%
  set_engine('kknn') %>%
  set_mode('classification')

wf5 = workflow() %>% add_model(mod_knn1) %>% add_recipe(rec1) #rec1 uses all nba data

fit_knn1 = wf5 %>% tune_grid(cv_data,
                             grid = expand.grid(neighbors = seq(1,200)),
                             metrics = metric_set(accuracy, f_meas, roc_auc))


fit_knn1 %>% show_best(metric = 'accuracy', n=1) %>% kbl() %>% kable_styling()

neighbors	.metric	.estimator	mean	n	std_err	.config
156	accuracy	multiclass	0.5156802	5	0.02451	Preprocessor1_Model156

KNN With Oldest 3 Seasons Removed

mod_knn2 = nearest_neighbor(neighbors = tune()) %>%
  set_engine('kknn') %>%
  set_mode('classification')

wf6 = workflow() %>% add_model(mod_knn1) %>% add_recipe(rec2) #rec2 uses limited nba data

fit_knn2 = wf6 %>% tune_grid(cv_data_limited,
                             grid = expand.grid(neighbors = seq(1,200)),
                             metrics = metric_set(accuracy, f_meas, roc_auc))


fit_knn2 %>% show_best(metric = 'accuracy', n=1) %>% kbl() %>% kable_styling()

neighbors	.metric	.estimator	mean	n	std_err	.config
99	accuracy	multiclass	0.5511111	5	0.0155556	Preprocessor1_Model099

Multinomial Regression

Multinomial Regression With All Predictors

mod_mnr1 = multinom_reg(penalty = tune(), mixture = tune()) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

wf7 = workflow() %>% add_model(mod_mnr1) %>% add_recipe(rec1) #rec1 uses all nba data

fit_mnr1 = wf7 %>% tune_grid(cv_data,
                             grid = expand.grid(penalty = 10^seq(-7,3, length.out=8),
                                                mixture = seq(0, 1, length.out=8)),
                             metrics = metric_set(accuracy, f_meas, roc_auc))

fit_mnr1 %>% show_best(metric = 'accuracy', n=1) %>% kbl() %>% kable_styling()

penalty	mixture	.metric	.estimator	mean	n	std_err	.config
0.0517947	0.4285714	accuracy	multiclass	0.6010211	5	0.0190488	Preprocessor1_Model29

Multinomial Regression With Oldest 3 Seasons Removed

wf8 = workflow() %>% add_model(mod_mnr1) %>% add_recipe(rec2) #rec2 uses limited nba data

fit_mnr2 = wf8 %>% tune_grid(cv_data_limited,
                             grid = expand.grid(penalty = 10^seq(-7,3, length.out=8),
                                                mixture = seq(0, 1, length.out=8)),
                             metrics = metric_set(accuracy, f_meas, roc_auc))

fit_mnr2 %>% show_best(metric = 'accuracy', n=1) %>% kbl() %>% kable_styling()

penalty	mixture	.metric	.estimator	mean	n	std_err	.config
0.0517947	0.2857143	accuracy	multiclass	0.6177778	5	0.0259629	Preprocessor1_Model21

Model Results

This is an exciting result. Every model type improved in accuracy by removing some of the old data. I want to take this further and see how much the accuracy can improve depending on how many years of older data we remove. I will use only random forest from here out since it is my best model, and so I can track variable importance.

It turns out that removing 4 years from the data provides the best model we can get by removing data from our sample. If we remove any more, our accuracy will decrease. When looking at the variable importance in the model with 4 years of data removed from the sample, it is noticeably different from the model with the entire sample. Win_percent has become less important relative to the other predictors, and there are a couple more predictors that show up as important. However, the changes to variable importance are not significant enough to be excited about, as we still didn’t find much beyond win% that is a solid predictor of postseason success, as we were hoping to.

Potential Improvements

When talking about downfalls in my model, the first important thing to mention is that sports inherently has a large amount of randomness in it which is obviously not possible to explain in a model. That amount of randomness will always keep the model from being perfect.

nba_data_limited$teamseason = paste(nba_data_limited$season, nba_data_limited$team, sep = ' ')
plot_ly(nba_data_limited[sample(nrow(nba_data), 120), ], 
              x = ~drtg_Z, y = ~ts_percent_Z, z = ~win_percent_Z, color = ~result_class,
              text = ~teamseason, hoverinfo=text) %>%
  layout(scene = list(xaxis = list(title = 'Defensive Rating Z-score'),
                     yaxis = list(title = 'True Shooting % Z-score'),
                     zaxis = list(title = 'Win % Z-score')))

This interactive scatter plot is a random sample of 120 teams from our data set plotted by 3 of our important predictors. As you can see, the result classes are overlapping each other significantly, which makes this a tough prediction problem that cannot be perfectly solved.

With that in mind, I do believe there is still unexplained variation in my model that would be explainable with other data.

My data set had many predictors, but most of them turned out to be pretty useless. However, I think there are predictors that would be possible to obtain that could improve model performance. First, I think injury data from before the all-star break could be a useful predictor. I found websites that gave how much salary cap worth of games was missed due to injury, but it wasn’t broken down to pre all-star break, so I was unable to use it. However, I think a predictor like this could be useful because if a team was extremely injured before the break, it’s reasonable to expect that they could return back to the mean post break, and perform better as a result of less injury. There would be no parameter that could explain post-break or playoff injury in my model though, and that would fall into unexplainable error.

Other ideas I had for potentially useful predictor variables are salary level, age of team, and performance in last 15 games leading up to the all-star break.

My best performing model was a random forest model that used 14 seasons of data (4 oldest years removed). After tuning 3 hyperparameters (mtry=60, trees=250, min_n=70), the resulting CV accuracy was 63.69%.

When looking at how well my model does at predicting each class, one thing jumps out right away.

The model has an accuracy of 63.69 (represented by the horizontal line), but as you can see, this is weighed down by one class in particular. 4/5 classes have a sensitivity over 80%, but the sensitivity for the ‘Solid Playoff Team’ class is only 28.57%.The way a ‘Solid Playoff Team’ is defined is a team that makes the playoffs and does not lose in the first round (first round losses are defined as ‘Fringe Playoff Teams’). However, ‘Championship Contenders’ are defined as teams that make the Conference Finals (3rd Round) or better. This means that teams that are truly ‘Solid Playoff Teams’ lose in the 2nd round, and there are 4 of them every year. Out of the ‘Solid Playoff Teams’ that are incorrectly classified, most of them are predicted to be ‘Fringe Playoff Teams’, followed by ‘Championship Contenders’. While it’s good that the model is incorrectly classifying into classes directly next to the correct class, it is still either giving a team false hope or false doubt in themselves. This suggest that teams in the ‘Solid Playoff Team’ class aren’t very distinguishable from the classes on either side of it. I could remove this class of teams and only have 4 total classes, but I think it’s an important distinction to try and make.

My model defines ‘Championship Contenders’ as the last 4 teams in the playoffs. Each year there are a different amount of teams who are truly good enough to compete for a championship; some years it’s more than 4, but I would argue that at least in recent times, it’s less. I am going to redefine ‘Championship Contenders’ as only the last 2 teams remaining in the playoffs, and the two teams each year I’m removing from that class will be placed in ‘Solid Playoff Team’ class. Maybe with that class having more data points, it can improve its sensitivity and overall model accuracy.

As you can see, this did help improve sensitivity in the ‘Solid Playoff Team’ class, but did so at the expense of sensitivity in the ‘Championship Contender’ class and overall model accuracy. This result is not all that shocking.

There is one more thing I wanted to try to improve this. When looking through each team’s probability distribution between classes, which you can see on the next page, the Heat and Bucks charts both stood out to me. I’ll include them below.

For both teams, they are predicted to be a fringe playoff team. However, they have a higher probability of being classified as a Championship Contender than a Solid Playoff Team. Since both Championship Contender and Fringe Playoff team are given such a high probability, we may believe that the model is systematically underestimating their chance of being a Solid Playoff Team.

Because our model rarely and poorly predicts Solid Playoff Teams, we could conditionally force a prediction to that class if the classes on either side of it are sufficiently high enough. I believe this would increase sensitivity in the Solid Playoff Team class, as well as overall model accuracy.

After creating a conditional rule for forcing some predictions to be Solid Playoff Team based off of the likelihoods are model gives us for the CC, SP, and FP classes, we greatly increase the sensitivity and increase the overall accuracy to 75.47%! This is a very encouraging result.

2021-22 Predictions

I already discussed the shortcomings of my model, and because of those it wouldn’t be smart to make decisions using this model, but I am going to give an example of how the model could be useful, so I will pretend in this section like the model has been improved and the results are reliable. (The model used in this section is my best performing model: the random forest with the 3 oldest seasons removed from the sample, with results classified the original way before I altered them)

In the introduction, I talked about how the Chicago Bulls could use a model like this to help them with their decision making this season. As a reminder, there was lots of talk about them potentially trading away their young player Patrick Williams for somebody who could help them win a title this year at the expense of future seasons.

The model says that the Bulls are a fringe playoff team, and that there are 4 championship contenders. Even though the Bulls are 4th in Win% at the all-star break, the model says they are overvalued. If this model could be relied upon as accurate, then this would lead the Bulls to believe that even if they traded for an additional star player, they would likely not be Championship Contenders, so giving up a young player they value very highly would not be a good decision.

Below are all team’s model results for the 2021-22 season.

NBA Prediction Project

Jeremy Patak

Introduction

Data

Data Preperation

Sampling

Models

Models

Random Forest

Random Forest With All Predictors

Random Forest With Oldest 3 Seasons Removed

Boosting

Boosting Ensemble With All Predictors

Boosting Ensemble With Oldest 3 Seasons Removed

KNN

KNN With All Predictors

KNN With Oldest 3 Seasons Removed

Multinomial Regression

Multinomial Regression With All Predictors

Multinomial Regression With Oldest 3 Seasons Removed

Model Results

Potential Improvements

2021-22 Predictions

Top Championship Contenders

Atlanta Hawks

Boston Celtics

Brooklyn Nets

Charlotte Hornets

Chicago Bulls

Cleveland Cavaliers

Dallas Mavericks

Denver Nuggets

Detroit Pistons

Golden State Warriors

Houston Rockets

Indiana Pacers

LA Clippers

Los Angeles Lakers

Memphis Grizzlies

Miami Heat

Milwaukee Bucks

Minnesota Timberwolves

New Orleans Pelicans

New York Knicks

Oklahoma City Thunder

Orlando Magic

Philadelphia 76ers

Phoenix Suns

Portland Trail Blazers

Sacramento Kings

San Antonio Spurs

Toronto Raptors

Utah Jazz

Washington Wizards