STEP 1: Load the Data

The data is read in from the NBA statistics and salaries csv files.

STEP 2: Process the Data

The two datasets are merged by the player so that we now have their statistics and how much they are paid. Duplicates are removed. Odd characters are filtered out using regex with the rule being non-alpha, space, and punctuations are replaced with a blank character.

A new column is added with our “efficiency” metric– a combination of points scores, assists made, blocks, field goals, free throws, and turnovers. This is our overall metric to give players a single score on how “good” of a player they are. We ignored the player’s position for this assignment. In the future, various metrics based on each player’s position could be made and the data could be separated by the player’s position in order to develop a more in-depth drafting strategy.

STEP 3 : Determining the # of Clusters

We used the NbClust package to determine how many clusters to use. Based on the various methods applied, the majority (12) voted for 2 clusters so we decided that 2 clusters would give us the optimal variance accounted for by clustering.

STEP 4: Running k-means

We ran the k-means function using the pre-determined 2 centers with the Lloyd algorithm. Only 59.7% of the total variance was accounted for by clustering which wasn’t nearly as good as the Republican or Democrat data.

STEP 5: Visualizing our Data

The drafting strategy given we have a low budget and are trying to build the best team would be to first, rank the players by their efficiency per dollar spent on them.

Those that have high efficiency/dollar are underpaid and better fit the scope of the constraints of our team. Those who have lower efficiency/dollar are overpaid and would not be good candidiates in regards to our budget or increasing the wins our team makes.

The clusters we see in the graphs help break down our players into high skill vs. low skill. Based off of these two clusters, we would value the player’s efficiency metric more than cost in the high skill cluster and the efficiency per dollar metric more in the low skill cluster.

Within the high skill cluster, we see the datapoints begin to spread out in terms of skill. The extra skill and win turnovers a player could potentially bring the team is more valuable than being able to afford more lower skilled players as due to the nature of basketball as a sport (limited number of players on the court, limited number of players on the team, etc.). Thus, this is also why emphasizing cost in this “high skill” cluster is important. Nonetheless, it should be noted that high cost does not signify high skill– we would still be taking into account their efficiency/dollar metric, but less so than the low skill cluster.

Examples of good potential players in the high skill cluster are: Trae Young (Standardized Salary = $0.15 M, Efficiency = 0.764) He is the #9 player by efficiency and has the lowest salary in the top 10 most efficient players.

Luka Doncic (Standardized Salary = $0.18M, Efficiency = 0.757) He is one of the highest performing players at rank 10 in the NBA with the second lowest salary in the top 10 most efficient players..

For both of these NBA stars, they over-perform their pay level by a significant margin. As an agent, this is exactly the type of player to fill the “star player” role we are looking for. Other scouts may see the other high performing players salaries, and correlate salary with skill, but oftentimes other factors are involved such as the performance of the overall team, outside activities the player does to improve their personal brand, or time in the league. We are looking to pick up players who individually shine and may not have as inflated of a salary because of non-skill related factors. Trae Young and Luka Doncic are good examples where high performance is worth the extra money spent on them.

Examples of good potential players in the low skill cluster are:

Shai Gilgeous-Alexander(Standardized Salary = $.09, Efficiency = .60, Efficiency/dollar = 6.66) Shai is the most efficient in the low skill cluster and has a decent price paired with it. Overall, it also gives him a high efficiency per dollar score making this the perfect medium between skill and price.

Naz Reid (Standardized Salary = $0.03, Efficiency = 0.32, Efficiency/Dollar = 10.7) Naz has the highest efficiency per dollar ratio in the lower skill cluster with a 10.7. He is a good example of a base player to fill our roster with that gets us a good amount of skill for every dollar spent. He won’t be our star player, but he also won’t be depleting too much of our budget without reason.

Both of these players serve as important base players we will need to fill our team with that will provide a baseline skill level we know is balanced by their salary. We won’t deplete too much money on them so we will have good funds to spend on our high-performing cluster to find our star player. However, every star player also needs good teammates which is where these players come in. Especially because of how crowded that player’s data is in cluster 1, this efficiency per dollar is an important metric to rank them as the marginal benefit each brings is small compared to the marginal benefit from spending more on the high performing cluster players.

NBA Clustering

Megan Lin

03/11/2021

STEP 1: Load the Data

STEP 2: Process the Data

STEP 3 : Determining the # of Clusters

STEP 4: Running k-means

STEP 5: Visualizing our Data

Salary vs Efficiency

Plotly Diagram