Overview

You are a scout for the worst team in the NBA, probably the Wizards. Your general manager just heard about Data Science and thinks it can solve all the teams problems!!! She wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal to get your team out of the toilet!

Objectives

Guidelines

Dataset

##              SALARY.M.           GP        MPG        PPG       FG_PC
## SALARY.M.  1.000000000 -0.004104394 0.51707936 0.56746359  0.23469041
## GP        -0.004104394  1.000000000 0.27874749 0.22490566  0.04671663
## MPG        0.517079360  0.278747488 1.00000000 0.86329120  0.05276614
## PPG        0.567463593  0.224905660 0.86329120 1.00000000  0.08786569
## FG_PC      0.234690409  0.046716629 0.05276614 0.08786569  1.00000000
## X3P_PC    -0.023310061  0.002371412 0.17773899 0.17130270 -0.45539619
##                 X3P_PC      FT_PC        RPG         APG        STPG      BLKPG
## SALARY.M. -0.023310061  0.0651504  0.4507075  0.31203196  0.25739673  0.2762625
## GP         0.002371412  0.1310821  0.1551020  0.05158089  0.05044995  0.1127017
## MPG        0.177738994  0.2450232  0.4776952  0.59586410  0.65880924  0.2268268
## PPG        0.171302696  0.3447196  0.4489743  0.57817015  0.56012678  0.1676730
## FG_PC     -0.455396192 -0.4332171  0.5710216 -0.23358128 -0.08938333  0.5664555
## X3P_PC     1.000000000  0.4856232 -0.3980713  0.29570023  0.21275814 -0.3388815

We selected the three variables with the highest correlation to salary to analyze. These variables were PPG, MPG, and RPG.

Clustering by Points Per Game and Salary

## K-means clustering with 2 clusters of sizes 55, 175
## 
## Cluster means:
##   SALARY.M.       PPG
## 1  12.60545 18.214545
## 2   3.51600  9.042286
## 
## Clustering vector:
##   [1] 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2
##  [38] 2 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1
##  [75] 1 1 2 1 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 1 2 1 2 2 1 2 2 1 2 2
## [112] 2 2 1 2 2 2 2 2 2 1 1 1 2 2 2 2 2 1 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2
## [149] 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 1 2 2 2 2 2 2
## [186] 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2
## [223] 2 1 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 2536.297 3633.942
##  (between_SS / total_SS =  53.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Visualized the output

By looking at this cluster analysis, we are able to identify underpaid players who have a high PPG average as well as overpaid players with low PPG averages. This allows us to identify trade targets and players to avoid. ### Quality-analysis of clustering

## [1] 0.5307197

The variance accounted for by clusters for Salary and PPG is .5307.

##  [1] 6.917202e-16 5.307197e-01 6.959266e-01 7.617605e-01 7.971992e-01
##  [6] 8.470543e-01 8.691314e-01 8.859761e-01 8.989823e-01 9.041396e-01

Created an elbow chart for explained variance

Clustering analysis for “PPG” and “Salary”

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 11 proposed 2 as the best number of clusters 
## * 6 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 2 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
## [1] 15

NBClust determined that 2 is the optimal number of clusters. We performed quality-analysis of 2 clusters earlier, and the variance accounted for by clusters for Salary and PPG is 0.5307.

Clustering by Minutes per Game and Salary

## Warning: did not converge in 10 iterations
## K-means clustering with 2 clusters of sizes 108, 122
## 
## Cluster means:
##   SALARY.M.      MPG
## 1  8.895370 32.41759
## 2  2.851639 20.33770
## 
## Clustering vector:
##   [1] 2 2 1 1 2 1 1 2 1 1 2 1 1 2 2 2 1 1 2 1 1 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 2
##  [38] 2 2 1 1 1 2 2 2 2 2 1 1 2 1 1 2 1 1 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 2 2 1
##  [75] 1 1 2 1 1 2 2 2 1 2 2 2 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1 2 2 1 2 1 1 1 2
## [112] 2 2 1 1 2 1 2 2 2 1 1 1 2 1 1 1 2 1 2 1 1 1 2 1 1 2 2 1 2 2 2 2 2 2 2 1 2
## [149] 1 2 1 1 1 1 2 2 1 2 2 2 2 1 2 1 2 2 1 1 1 2 1 2 2 2 2 2 1 1 1 2 1 2 1 2 2
## [186] 2 1 1 2 2 1 1 1 2 2 2 2 2 1 2 2 1 2 2 1 2 2 2 2 1 1 2 2 2 2 1 1 2 2 2 1 2
## [223] 2 1 2 2 2 2 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 4574.524 3299.031
##  (between_SS / total_SS =  57.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"      
## Warning: did *not* converge in specified number of iterations

Visualized the output

By looking at this cluster analysis, we are able to identify underpaid players who have a high MPG average as well as overpaid players with low MPG averages. This allows us to identify trade targets and players to avoid.

## [1] 0.5703517

The variance accounted for by clusters for Salary and MPG is 0.5704.

##  [1] 0.0000000 0.5703517 0.7453753 0.8297462 0.8508755 0.8762168 0.8923871
##  [8] 0.8986121 0.9092996 0.9126940

Created an elbow chart for explained variance

Clustering analysis for “Minutes per Game” and “Salary”

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 2 proposed 12 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
## [1] 15

## [1] 0.7453753

NBClust shows a tie between 2 and 3 for the optimal number of clusters.We performed quality-analysis of 2 clusters earlier, and the variance is 0.5704. When using 3 clusters, the variance is .7454. Because of the tie between 2 and 3 clusters but the higher variance accounted for by 3 clusters, we decided it would be better to use 3 clusters. Below is the graph with three clusters for MPG.

Visualized the output for 3 clusters

Cluster 1 represents the high paid players who play a lot of minutes per game. Cluster 2 represents the low paid players who only play a few minutes per game. Cluster 3 represents the players between these two groups - a medium salary playing an average number of minutes per game. Looking specifically at Cluster 3, we are able to identify players who play a lot of minutes yet are not paid much. That being said, MPG is not the best metric to evaluate players, as it doesn’t evaluate the quality of the performance when the player is on the court.

Clustering by Rebounds per Game and Salary

## K-means clustering with 2 clusters of sizes 47, 183
## 
## Cluster means:
##   SALARY.M.      RPG
## 1 14.357447 6.995745
## 2  3.463388 4.016393
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2
##  [38] 2 2 2 1 1 2 2 2 2 2 1 2 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2
##  [75] 1 2 2 1 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 1 2
## [112] 2 2 1 2 2 2 2 2 2 1 1 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2
## [149] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2
## [186] 2 1 1 2 2 2 1 1 2 2 2 2 2 1 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2
## [223] 2 1 2 1 2 2 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 1083.074 1687.836
##  (between_SS / total_SS =  63.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Visualized the output

By looking at this cluster analysis, we are able to identify underpaid players who have a high RPG average as well as overpaid players with low RPG averages. This allows us to identify trade targets and players to avoid.

## [1] 0.6325535

The variance accounted for by clusters for Salary and MPG is 0.6326.

##  [1] 3.618205e-16 6.325535e-01 7.479180e-01 7.950819e-01 8.500925e-01
##  [6] 8.740251e-01 8.909245e-01 9.066414e-01 9.154651e-01 9.236271e-01

Created an elbow chart for explained variance

##   k explained_var_nba_RPG
## 1 1          3.618205e-16
## 2 2          6.325535e-01
## 3 3          7.479180e-01
## 4 4          7.950819e-01
## 5 5          8.500925e-01
## 6 6          8.740251e-01

Clustering analysis for “Rebounds per Game” and “Salary”

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 12 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 4 proposed 8 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
##          freq_k_NBA_RPG
## KL                    2
## CH                    2
## Hartigan              3
## CCC                   2
## Scott                 3
## Marriot               8
## [1] 15

NBClust determined that 2 is the optimal number of clusters. We performed quality-analysis of 2 clusters earlier, and the variance accounted for by clusters for Salary and RPG is 0.6326.

3D Analysis

3D plot of clustering results of PPG, MPG, and Salary

## Joining, by = "Minutes.Above.Avg"
## [1] "C/C/C/C/C/en_US.UTF-8"
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

This 3D plot combines all elements we have been analyzing.

Conclusions

In this report, we have provided analysis on PPG, MPG, RPG and their relation to salary. This analysis allows us to compare players’ performances and salaries. By looking at this analysis, we are able to identify both underrated players who perform extremely well yet aren’t paid much, as well as overrated players who are paid a lot yet do not perform well. This provides the Wizards’ GM potential trade targets, as well as players to either cut or avoid signing. For example, we are able to identify Anthony Davis, Paul George, and DeMarcus Cousins as underpaid, high-performing trade targets.