You are a data focused scout for the worst team in the NBA, probably the Wizards. Your general manager just heard about Data Science and thinks it can solve all the teams problems!!! She wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal (offer contracts/trade) to get your team out of last place!
Details:
Hints:
Salary(2020-21) is the variable you are trying to understand
You can include numerous performance variables in the clustering but when interpreting you might want to use graphs that include variables that are the most correlated with Salary, also might be a good way to select variables to include, but don’t include salary in the cluster.
You’ll need to standardize the variables before performing the clustering
Remember interpretation/evaluation of the results is the most important part of every data science project. Your evaluation should be clearly linked to the question you are trying to answer and include a discussion on the risks associated with using your model and next steps.
The data you are getting will need to be merged, cleaned and prepared for analysis. Create a function called “nba_pre_processing” for these tasks.
You’ll likely be building one or a small number of central graphics that will guide much of your recommendations. Turn this graphic(s) into a function that takes inputs to help you present your recommendations.
Work together with your team members.
Each group will present their recommendations in class, 5 minutes per group. Your results don’t have to be identical, but you must agree on who will present (1 per group), we will continue to do this moving forward and the presenter must rotate each week.
nba_pre_processing <- function(nba, nba_salaries){
nba <- read.csv("nba2020-21.csv-1")
nba_salaries <- read.csv("nba_salaries_21.csv-1")
merged <- merge(nba, nba_salaries, by="Player")
merged <- na.omit(merged)
merged <- merged[order(merged[,'Player'],-merged[,'PTS']),]
merged <- merged[!duplicated(merged$Player),]
return(merged)
}
merged<-nba_pre_processing(nba, nba_salaries)
clust_data_nba <- merged[,-c(1,2,4)]
head(clust_data_nba)
## Age G GS MP FG FGA FG. X3P X3PA X3P. X2P X2PA X2P. eFG. FT FTA FT.
## 1 25 19 19 552 91 213 0.427 31 84 0.369 60 129 0.465 0.500 49 80 0.613
## 2 24 35 6 667 95 256 0.371 38 108 0.352 57 148 0.385 0.445 29 40 0.725
## 3 21 18 0 281 25 63 0.397 17 48 0.354 8 15 0.533 0.532 8 12 0.667
## 4 27 19 0 273 44 90 0.489 13 33 0.394 31 57 0.544 0.561 20 25 0.800
## 5 34 24 24 677 138 312 0.442 47 132 0.356 91 180 0.506 0.518 14 18 0.778
## 6 30 9 6 163 17 43 0.395 6 15 0.400 11 28 0.393 0.465 7 9 0.778
## ORB DRB TRB AST STL BLK TOV PF PTS X2020.21
## 1 33 104 137 80 14 16 53 38 262 18136364
## 2 7 39 46 63 20 4 29 62 257 2345640
## 3 10 34 44 7 3 5 11 37 75 3458400
## 4 3 43 46 15 6 7 13 24 121 1752950
## 5 23 137 160 84 21 21 27 43 337 27500000
## 6 9 28 37 15 8 5 10 9 47 9720900
set.seed(1)
kmeans_obj_nba = kmeans(clust_data_nba, centers = 3,
algorithm = "Lloyd")
From the correlation matrix, some of the most correlated variables with Salary are Assists and Points
res <- cor(merged[,-c(1,2,4)]) #taking out categorical variables
new <- as.data.frame(round(res, 2)) #correlation matrix
# Run an algorithm with 3 centers.
set.seed(1)
kmeans_obj_nba = kmeans(clust_data_nba, centers = 3,
algorithm = "Lloyd")
#View the results
# Tell R to read the cluster labels as factors so that ggplot2
# (the graphing package) can read them as category labels instead of
# continuous variables (numeric variables).
playerswewant<- merged[(merged$PTS>500)&
(merged$AST>175) &
(merged$X2020.21<10000000),]
#Visualize the output
clusters_nba = as.factor(kmeans_obj_nba$cluster)
labels_nba = merged$Player
ggplot(merged, aes(x = AST,
y = PTS,
shape = clusters_nba,
color=X2020.21)) +
geom_point(size =4 ) +
geom_text(aes(label=ifelse(PTS>500 & AST>175 & X2020.21<10000000, Player,'')),
hjust=0,vjust=0) +
ggtitle("AST & PTS vs. Salary Clustering") +
xlab("Assists") +
ylab("Points") +
scale_color_gradient(low="red", high="blue")+
scale_shape_manual(name = "Cluster",
labels = c("High performing", "Mid performance level", "Lower performing"),
values = c("1", "2","3")) +
theme_light()
# List of Players we want
playerswewant$Player
## [1] "Bam Adebayo" "De'Aaron Fox"
## [3] "Donovan Mitchell" "LaMelo Ball"
## [5] "Luka Don?i?" "Shai Gilgeous-Alexander"
## [7] "Trae Young"
First, we determine what performance variables are most correlated with Salaries, and came up with Assists and Points. To determine which players to recommend to our GM to try to aquire, we have subset players with a salary under $10,000,000, Points over 500, and Assists below 175. In the clustering plot, this would be the players in the upper right quadrant (high performance stats) who are bright red. This resulted with: Donovan Mitchell, De’Aron Fox, LaMelo Ball, Trae Young, Bam Adebayo, Luka Doncic, Shai Gilgeous-Alexander. These players are of good value, and are good trade targets for the Wizards GM.
Drawbacks to this method include the fact that we are unable to incorporate categorical variables into this type of analysis, but categorical factors like position can be a relevant factor in decision-making here. Similarly, we can’t reasonably assume that underpaid players will be willing to switch teams and continue to be paid the same salary.