Introduction

Before we begin our analysis, it is advantageous to establish a clear goal: to identify and analyze players who are being underpaid. Let’s begin by providing an objective definition for a subjective metric. To be classified as an “underpaid” player, let’s say that this means that they are demonstrating high performance statistics, but are still in a lower salary bracket. To assess performance metrics, we will use our k-means clustering method. To assess salary, we will factorize salary into four brackets: upper, upper-middle, lower-middle, and lower, whose bounds will be dictated by our inner quartile range. To begin, it would be advantageous to take a preliminary look at the relationships between performance statistics and salary. To numerically, look at these relationships, let’s begin by looking at a correlation matrix between each metric and salary.

Exploratory Data Analysis (EDA)

From these results, we see that MPG, PPG, and RPG have the three strongest linear relationships with salary. However, since correlation only is only a descriptor of linear relationships, it would be advantageous to look at scatter plots as well to see if there are any non-linear relationships that exist. Although weak, it still may be advantageous to assess APG, STPG, BLKPG, and FG_PG. We should consider dropping GP, X3P_PC, and FT_PC since those correlations are virtually zero. Using these seven variables, we can now visually assess their relationship through a scatterplot matrix.

Data Cleaning

Now that we have established the variables that we would like to use, it is imperative to standardize our data. This will make our data unitless and will assist in provide more accurate and unbiased results. Then we want to create factors for our different salary brackets. Doing this allows us to compare the level of salary with the performance based clusters. We can then extrapolate information as to which players are clustered into higher performing groups, but are categorized as lower paying. We will assign player into the four categories stated earlier by using the quartile values as the bounds.

K-Means Clustering

Deciding K

Now, we we want to perform our clustering. Before we cluster the data, we need to figure out how many centers to use. To do this, we will select the optimal k value through a majority vote method.

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 11 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 8 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 3 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

Looking at the numerical and graphical output, we see that the optimal number of clusters to use for our k-means analysis is three. Now, we can use this to cluster our standardized performance data.

K-Means Clustering

## K-means clustering with 3 clusters of sizes 110, 47, 73
## 
## Cluster means:
##          MPG        PPG        RPG      FG_PC       APG        STPG      BLKPG
## 1 -0.8338993 -0.7754207 -0.5286545 -0.2415170 -0.509444 -0.52827468 -0.3336184
## 2  0.5314130  0.4896430  1.5364339  1.0697283 -0.327948 -0.08973745  1.3409188
## 3  0.9144180  0.8531925 -0.1926083 -0.3247994  0.978800  0.85380651 -0.3606186
## 
## Clustering vector:
##   [1] 3 1 2 1 1 3 2 1 3 2 1 3 3 1 1 1 3 3 1 1 3 2 2 2 2 3 2 3 1 3 2 1 3 3 3 2 2
##  [38] 1 1 3 2 2 1 1 1 1 1 2 3 3 2 3 2 3 2 2 1 1 1 1 1 2 3 3 2 3 1 2 1 3 1 1 1 3
##  [75] 3 3 1 3 2 3 1 1 1 1 1 1 3 1 3 2 1 3 3 1 3 1 1 3 3 2 3 1 3 1 1 2 1 3 3 2 1
## [112] 2 1 3 3 1 3 1 2 1 2 2 2 1 3 3 2 1 3 1 3 3 3 1 1 3 1 1 3 1 1 1 1 1 3 1 3 3
## [149] 2 1 2 3 3 3 1 1 3 1 2 1 2 3 1 3 1 1 2 2 3 1 2 1 1 1 1 3 3 2 2 1 3 1 2 1 1
## [186] 1 1 2 1 1 3 3 3 1 1 3 1 1 3 1 1 3 1 1 3 1 1 1 1 3 3 1 1 1 1 2 3 1 1 1 2 1
## [223] 1 2 1 1 1 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 278.8752 253.4273 282.9645
##  (between_SS / total_SS =  49.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

We can also get an indication of the quality of our kmeans clustering model by looking at the explained variance.

Model.Quality
0.4914117

From here, we can further graphically look at the results of our clustering and salary factorization. What we are targeting is players who are clustered in higher performance groups, but are factorized as lower paying. For the sake of brevity, we will preview these results by plotting the three variables with the highest correlation value to salary: MPG, PPG, and RPG.

From these graphics, we notice that it appears that the three clusters are representative of three performance groups. Cluster 1 seems to represent lower performing players. Cluster 2 represents mediocre and average performing players. Cluster 3 represents high performing players. The individuals we are targeting are those that are in cluster 3, but still categorized as “lower” or even “lower-mid” for their salary.

From these preliminary results, we find that there are players that are clustered in the higher performance group, but are still categorized as lower paid players, meaning they make less than the 25th percentile of all player salaries. We can further explore this relationship through making a 3d graphic. For this graphic, and once again for the sake of brevity, we will only plot the three variables with the highest correlation to salary: MPG, PPG, and RPG.

Further Analysis

From these plots, we see that there are cases where a player has been clustered in to a higher performing group (3), but is still categorized as being a lower or lower-mid paid player. These are the instances that we would like to identify. From this, we can draw the observations that are clustered in group 3, but are factorized as “lower” or “lower-mid” in salary.

Under Paid Players

Now, we have a comprehensive list of all of the players that either were categorized as “lower” or “lower-mid” pay, but were still clustered in group three, which was the group that tended to have higher performance statistics. This information is extremely useful, but we can look into it further. Let’s see if there is a balance in the number of underpaid players per position.

Under Paid Players by Position

Here we see that there tend to be more point guards who are exhibiting high performing statistics, but are still categorized in a lower salary group. Furthermore, let’s analyze the relationship between position and cluster.

Cluster One

In cluster 1, we largely see a balanced sample across the different positions.

Cluster Two

In cluster 2, we start to see a much greater imbalance. Centers and power forwards have significantly more representation in the average performance cluster, whereas shooting guards and small forwards have almost no representation.

Cluster Three

In cluster 3, now we see strong representation from point guards and shooting guards, the opposite of our results from the previous cluster. We should also note NO centers are represented in cluster 3. This is important to note. Now, let’s see if there is a relationship between position and pay level.

Lower Salary

Lower-Middle Salary

Upper-Middle Salary

Upper Salary

Final Thoughts and Conclusions

Looking at these, we see that there doesn’t really appear to be a significant imbalance between positions and pay level. The results from table to table appear to be predominantly similar. Therefore, we can say that the mere position alone likely cannot be attributed to the pay level. However, we do see imbalances between position and cluster position. This means that certain positions are more likely to be clustered into higher or lower perfoming cluster groups, thus effectively affecting their pay level. This is likely a more plausible explanation as to why we see the relationship between the two. Something else to note also is that we only see representation of three positions in our low pay but high performance group. More in depth industry knowledge may help to explain this causal effect.

One potential explanation as to why these players are not being paid comparably to their performance is that they may be young or up and coming players. Perhaps these are individuals who have just recently been drafted, or are hit an up swing in their career. Another potential explanation could be that they are players who are playing behind big star players. Although their performance is excellent, they are still shadowed behind bigger players whose pay may be significantly more due to exogenous commercial factors.

NBA Clustering

Jasmine Dogu

9/30/2020

Introduction

Exploratory Data Analysis (EDA)

Data Cleaning

K-Means Clustering

Deciding K

K-Means Clustering

Further Analysis

Under Paid Players

Under Paid Players by Position

Cluster One

Cluster Two

Cluster Three

Lower Salary

Lower-Middle Salary

Upper-Middle Salary

Upper Salary

Final Thoughts and Conclusions