Summary of Approach to Recommending Players for Acquisition

This site will showcase my data-driven approach for recommending players for acquisition an NBA team. I am using a two-stage approach that will combine an unsupervised machine learning clustering approach and a supervised machine learning regression model to make educated predictions about which high performing players are underpaid and thus ideal targets for acquisition. I will be using a dataset of 401 NBA players throughout the 2020-2021 season that includes the following information and stats:

  • Player
  • Position
  • Age
  • Tm (Team)
  • G (Number of Games Played in)
  • GS (Number of Games Player Has Started in)
  • MP (Minutes Played)
  • FG (Field Goals)
  • FGA (Field Goals Attempted)
  • FG. (Field Goal Percentage)
  • X3P (Three point baskets)
  • X3PA (Three point shot attempts)
  • X3P. (Three point shot percentage)
  • X2P (Two point baskets)
  • X2PA (Two point shot attempts)
  • X2P. (Two point shot percentage)
  • eFG. (effective field goal percentage)
  • FT (Free Throws)
  • FTA (Free Throws Attempted)
  • FT. (Free Throw Percentage)
  • ORB (Offensive Rebounds)
  • DRB (Defensive Rebounds)
  • TRB (Total Rebounds)
  • AST (Assists)
  • STL (Steals)
  • BLK (Blocks)
  • TOV (Turnovers)
  • PF (Personal Fouls)
  • PTS (Points)
  • 2020-2021 (Player’s Salary in 2020-2021 Season)

Data Preparation and Variable Selection

For this dataset, I removed players from consideration who had incomplete stat reports. Removing players with NA values for some of their information took 36 players out of consideration. Considering that 401 out of the original 437 players were still included to inform the models and visualizations and be considered as candidates for acquisition, removing players with incomplete sets of stats was not a decision that rendered this dataset useless.

As I selected the variables to be considered for consideration in the models, I removed variables that would not provide value to the model or could not be processed such as the name of the players (Player) and the name of the teams (Tm). Columns that referenced shooting data columns - made shots and attempted shots - were removed as the shooting percentage stats captured that data. As I produced the initial clustering model, position had to be removed from consideration as the kmeans clustering approach that I employed cannot process categorical data.

Unsupervised Machine Learning: K-means Clustering

With the data cleaned and prepared, the first thing that I did was use the data in a k-means clustering model. Based on the features (variables) in consideration, K-means clustering assigns each player to a cluster in an effort to sort (basically categorize) similar data together. This provides value when I go to make a supervised machine learning approach as the information about what cluster each player is assigned to can be used as a new feature that could be associated with their salary and help the model in making more accurate predictions.

Using Correlogram to Identify Best Predictors of a Player’s Salary

At this point, I also created a correlogram which shows which of a player’s stats are most correlated with their salary. This correlogram suggested that Assists (AST), Points (PTS), and Blocks (BLK) were the three variables most correlated with predicting a player’s salary. As a result, these were the three variables that I selected to visualize a players and their salaries in a 3D visualization.

3D Visualization or Assists, Points, Blocks and Salary

As these three variables are the best individual predictors of a players salary, I graphed them expecting to find the players with the best stats across these variables as the ones who would be earning the highest salary. However, I also expected to find players who are high-performing across these three crucial stats that were compensated significantly less than players of similar caliber, and these would be targets for acquisition that should be given further consideration. To see the discrepancies between different players’ salaries, I plotted a player’s salary as the size of their plotted point. The idea being that a player with a small circle amongst players with much larger circles would be a player that is paid significantly less than other players of a similar caliber.

## null device 
##           1

This visualization provoked interest in several players, specifically Trae Young, Donovan Mitchell, DeAaron Fox, Bam Adebayo, Shai Gilgeous Alexander, and LaMelo Ball.

Developing Supervised Machine Learning Regression Model to Predict Salary

After examining this visualization and equipped with cluster data from my initial k-means clustering, I then implemented a supervised machine learning regression approach to further examine the relationship between performance and compensation in order to predict who would be the most cost-efficient players to acquire. This model would consider all of their stats and the cluster that they were assigned to in the earlier k-means clustering model.

Evaluating Supervised Machine Learning Regression Model

##         RMSE     Rsquared          MAE 
## 2.535178e+06 9.337872e-01 1.881484e+06

I developed a few machine learning regression models, and I ultimately chose to proceed with a generalized linear regression model. This model produced the following metrics:

  • RMSE = 2658754 The RMSE value of 2658754 is the root of the average squared error (difference between the actual age of an individual and the age that our model predicted). This means that on average, the difference between the actual salary of a player and the salary that our model predicted is approximately $2,658,754.
  • Rsquared = 0.929 This Rsquared value means that 92.9% of the variance in salary can be explained by the independent variables considered. As a perfect Rsquared value is 1.00, this means that the model is performing well in predicting salaries.
  • MAE = 1995981 MAE (mean absolute error) also communicates the average error which means that according to MAE, this model is inaccurate by $1,995,981 on average.

With these metrics showing that the model is performing well, I then used the model to make predictions on what a player’s salary should be based on their stats. By subtracting a player’s actual salary from their predicted salary, I developed a column (pred_vs_obs_residual) that could then be filtered on to identify the players who are the most underpaid according to the model.

Final Analysis

From the 3D visualization, I became interested in Trae Young, Donovan Mitchell, DeAaron Fox, Bam Adebayo, Shai Gilgeous Alexander, and LaMelo Ball as they had high performance markers and appeared to be significantly underpaid relative to their peers. With interests in these players established, I then looked at the salary predictions that my supervised ML regression model made to see which ones would be the most cost-effective to acquire.

  • Trae Young Predicted - Actual Salary = $4,151,926
  • Donovan Mitchell Predicted - Actual Salary = $2,897,025
  • DeAaron Fox Predicted - Actual Salary = $1,135,911
  • Bam Adebayo Predicted - Actual Salary = $2,083,290
  • Shai Gilgeous Alexander Predicted - Actual Salary = $3,169,741
  • LaMelo Ball Predicted - Actual Salary = -$193,005

Players to target for acquisition: Trae Young, Donovan Alexander, and Shai Gilgeous Alexander

Trae Young

Our generalized linear model predicted that Trae Young would earn $10723726 but during the 2020-2021 season he was only paid $6,571,800. Our metrics of error for our generalized model recognizes that the average error of our predictions is approximately $2.5 million or $1.8 million (depending on whether you use RMSE or MAE). Even if you consider the possibility that our model is off by the average error, Trae Young is still significantly underpaid and is thus a great candidate for signing a high-caliber player for less money.

Donovan Mitchell

Looking at the 3d model, Donovan Mitchell is another player that we would expect to be underpaid, and our generalized linear model confirms this. Our generalized linear model predicted that Donovan Mitchell would earn $8,092,526 but during the 2020-2021 season he was only paid $5,195,501.

Shai Gilgeous Alexander

Looking at the 3d model, Shai Gilgeous Alexander is another player that we would expect to be underpaid, and our generalized linear model confirms this. Our generalized linear model predicted that Donovan Mitchell would earn $7,311,061 but during the 2020-2021 season he was only paid $4,141,320.