This site will showcase my data-driven approach for recommending players for acquisition an NBA team. I am using a two-stage approach that will combine an unsupervised machine learning clustering approach and a supervised machine learning regression model to make educated predictions about which high performing players are underpaid and thus ideal targets for acquisition. I will be using a dataset of 401 NBA players throughout the 2020-2021 season that includes the following information and stats:
For this dataset, I removed players from consideration who had incomplete stat reports. Removing players with NA values for some of their information took 36 players out of consideration. Considering that 401 out of the original 437 players were still included to inform the models and visualizations and be considered as candidates for acquisition, removing players with incomplete sets of stats was not a decision that rendered this dataset useless.
As I selected the variables to be considered for consideration in the models, I removed variables that would not provide value to the model or could not be processed such as the name of the players (Player) and the name of the teams (Tm). Columns that referenced shooting data columns - made shots and attempted shots - were removed as the shooting percentage stats captured that data. As I produced the initial clustering model, position had to be removed from consideration as the kmeans clustering approach that I employed cannot process categorical data.
With the data cleaned and prepared, the first thing that I did was use the data in a k-means clustering model. Based on the features (variables) in consideration, K-means clustering assigns each player to a cluster in an effort to sort (basically categorize) similar data together. This provides value when I go to make a supervised machine learning approach as the information about what cluster each player is assigned to can be used as a new feature that could be associated with their salary and help the model in making more accurate predictions.
At this point, I also created a correlogram which shows which of a player’s stats are most correlated with their salary. This correlogram suggested that Assists (AST), Points (PTS), and Blocks (BLK) were the three variables most correlated with predicting a player’s salary. As a result, these were the three variables that I selected to visualize a players and their salaries in a 3D visualization.
As these three variables are the best individual predictors of a players salary, I graphed them expecting to find the players with the best stats across these variables as the ones who would be earning the highest salary. However, I also expected to find players who are high-performing across these three crucial stats that were compensated significantly less than players of similar caliber, and these would be targets for acquisition that should be given further consideration. To see the discrepancies between different players’ salaries, I plotted a player’s salary as the size of their plotted point. The idea being that a player with a small circle amongst players with much larger circles would be a player that is paid significantly less than other players of a similar caliber.
## null device
## 1
This visualization provoked interest in several players, specifically Trae Young, Donovan Mitchell, DeAaron Fox, Bam Adebayo, Shai Gilgeous Alexander, and LaMelo Ball.
After examining this visualization and equipped with cluster data from my initial k-means clustering, I then implemented a supervised machine learning regression approach to further examine the relationship between performance and compensation in order to predict who would be the most cost-efficient players to acquire. This model would consider all of their stats and the cluster that they were assigned to in the earlier k-means clustering model.
## RMSE Rsquared MAE
## 2.535178e+06 9.337872e-01 1.881484e+06
I developed a few machine learning regression models, and I ultimately chose to proceed with a generalized linear regression model. This model produced the following metrics:
With these metrics showing that the model is performing well, I then used the model to make predictions on what a player’s salary should be based on their stats. By subtracting a player’s actual salary from their predicted salary, I developed a column (pred_vs_obs_residual) that could then be filtered on to identify the players who are the most underpaid according to the model.
From the 3D visualization, I became interested in Trae Young, Donovan Mitchell, DeAaron Fox, Bam Adebayo, Shai Gilgeous Alexander, and LaMelo Ball as they had high performance markers and appeared to be significantly underpaid relative to their peers. With interests in these players established, I then looked at the salary predictions that my supervised ML regression model made to see which ones would be the most cost-effective to acquire.
Our generalized linear model predicted that Trae Young would earn $10723726 but during the 2020-2021 season he was only paid $6,571,800. Our metrics of error for our generalized model recognizes that the average error of our predictions is approximately $2.5 million or $1.8 million (depending on whether you use RMSE or MAE). Even if you consider the possibility that our model is off by the average error, Trae Young is still significantly underpaid and is thus a great candidate for signing a high-caliber player for less money.
Looking at the 3d model, Donovan Mitchell is another player that we would expect to be underpaid, and our generalized linear model confirms this. Our generalized linear model predicted that Donovan Mitchell would earn $8,092,526 but during the 2020-2021 season he was only paid $5,195,501.
Looking at the 3d model, Shai Gilgeous Alexander is another player that we would expect to be underpaid, and our generalized linear model confirms this. Our generalized linear model predicted that Donovan Mitchell would earn $7,311,061 but during the 2020-2021 season he was only paid $4,141,320.