You are a data focused scout for the worst team in the NBA, probably the Wizards. Your general manager just heard about Data Science and thinks it can solve all the teams problems!!! She wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal (offer contracts/trade) to get your team out of last place!

Details:

Determine a way to use clustering to estimate based on performance if players are under or over paid, generally.
Provide a well commented and clean (knitted) report of your findings that can be presented to your GM. Include a rationale for variable selection, details on your approach, a overview of the results with supporting visualizations and most importantly make recommendations on players she should consider pursuing.

Hints:

Salary(2020-21) is the variable you are trying to understand
You can include numerous performance variables in the clustering but when interpreting you might want to use graphs that include variables that are the most correlated with Salary, also might be a good way to select variables to include, but don’t include salary in the cluster.
You’ll need to standardize the variables before performing the clustering
Remember interpretation/evaluation of the results is the most important part of every data science project. Your evaluation should be clearly linked to the question you are trying to answer and include a discussion on the risks associated with using your model and next steps.
The data you are getting will need to be merged, cleaned and prepared for analysis. Create a function called “nba_pre_processing” for these tasks.
You’ll likely be building one or a small number of central graphics that will guide much of your recommendations. Turn this graphic(s) into a function that takes inputs to help you present your recommendations.
Work together with your team members.
Each group will present their recommendations in class, 5 minutes per group. Your results don’t have to be identical, but you must agree on who will present (1 per group), we will continue to do this moving forward and the presenter must rotate each week.

Pre-processing function and setup

nba_pre_processing <- function(nba, nba_salaries){
  nba <- read.csv("nba2020-21.csv-1")
nba_salaries <- read.csv("nba_salaries_21.csv-1")
  merged <- merge(nba, nba_salaries, by="Player")
  merged <- na.omit(merged)
  merged <- merged[order(merged[,'Player'],-merged[,'PTS']),]
  merged <- merged[!duplicated(merged$Player),]

return(merged)
}

merged<-nba_pre_processing(nba, nba_salaries)
clust_data_nba <- merged[,-c(1,2,4)]
head(clust_data_nba)

##   Age  G GS  MP  FG FGA   FG. X3P X3PA  X3P. X2P X2PA  X2P.  eFG. FT FTA   FT.
## 1  25 19 19 552  91 213 0.427  31   84 0.369  60  129 0.465 0.500 49  80 0.613
## 2  24 35  6 667  95 256 0.371  38  108 0.352  57  148 0.385 0.445 29  40 0.725
## 3  21 18  0 281  25  63 0.397  17   48 0.354   8   15 0.533 0.532  8  12 0.667
## 4  27 19  0 273  44  90 0.489  13   33 0.394  31   57 0.544 0.561 20  25 0.800
## 5  34 24 24 677 138 312 0.442  47  132 0.356  91  180 0.506 0.518 14  18 0.778
## 6  30  9  6 163  17  43 0.395   6   15 0.400  11   28 0.393 0.465  7   9 0.778
##   ORB DRB TRB AST STL BLK TOV PF PTS X2020.21
## 1  33 104 137  80  14  16  53 38 262 18136364
## 2   7  39  46  63  20   4  29 62 257  2345640
## 3  10  34  44   7   3   5  11 37  75  3458400
## 4   3  43  46  15   6   7  13 24 121  1752950
## 5  23 137 160  84  21  21  27 43 337 27500000
## 6   9  28  37  15   8   5  10  9  47  9720900

set.seed(1)
kmeans_obj_nba = kmeans(clust_data_nba, centers = 3, 
                        algorithm = "Lloyd")

Correlation Matrix

From the correlation matrix, some of the most correlated variables with Salary are Assists and Points

res <- cor(merged[,-c(1,2,4)]) #taking out categorical variables
new <- as.data.frame(round(res, 2)) #correlation matrix

# Run an algorithm with 3 centers.
set.seed(1)
kmeans_obj_nba = kmeans(clust_data_nba, centers = 3, 
                        algorithm = "Lloyd")

#View the results
# Tell R to read the cluster labels as factors so that ggplot2 
# (the graphing package) can read them as category labels instead of 
# continuous variables (numeric variables).

Subset players we want to display- higher performing and lower cost

playerswewant<- merged[(merged$PTS>500)&
                          (merged$AST>175) &
                          (merged$X2020.21<10000000),]

#Visualize the output
clusters_nba = as.factor(kmeans_obj_nba$cluster)
labels_nba = merged$Player

ggplot(merged, aes(x = AST,
                    y = PTS,
                            shape = clusters_nba,
                    color=X2020.21)) + 
  geom_point(size =4 ) + 
  geom_text(aes(label=ifelse(PTS>500 & AST>175 & X2020.21<10000000, Player,'')),
            hjust=0,vjust=0) +
  ggtitle("AST & PTS vs. Salary Clustering") +
  xlab("Assists") +
  ylab("Points") +
  scale_color_gradient(low="red", high="blue")+
  scale_shape_manual(name = "Cluster", 
                     labels = c("High performing", "Mid performance level", "Lower performing"),
                     values = c("1", "2","3")) + 
  theme_light()

# List of Players we want
playerswewant$Player

## [1] "Bam Adebayo"             "De'Aaron Fox"           
## [3] "Donovan Mitchell"        "LaMelo Ball"            
## [5] "Luka Don?i?"             "Shai Gilgeous-Alexander"
## [7] "Trae Young"

First, we determine what performance variables are most correlated with Salaries, and came up with Assists and Points. To determine which players to recommend to our GM to try to aquire, we have subset players with a salary under $10,000,000, Points over 500, and Assists below 175. In the clustering plot, this would be the players in the upper right quadrant (high performance stats) who are bright red. This resulted with: Donovan Mitchell, De’Aron Fox, LaMelo Ball, Trae Young, Bam Adebayo, Luka Doncic, Shai Gilgeous-Alexander. These players are of good value, and are good trade targets for the Wizards GM.

Drawbacks to this method include the fact that we are unable to incorporate categorical variables into this type of analysis, but categorical factors like position can be a relevant factor in decision-making here. Similarly, we can’t reasonably assume that underpaid players will be willing to switch teams and continue to be paid the same salary.

NBA Clustering RMD– Emma Seiberlich

Pre-processing function and setup

Correlation Matrix

Subset players we want to display- higher performing and lower cost