In a separate Rmarkdown document work through a similar process with the NBA data (nba2020-21 and nba_salaries_21), merge them together.

You are a scout for the worst team in the NBA, probably the Wizards. Your general manager just heard about Data Science and thinks it can solve all the teams problems!!! She wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal to get the team to the playoffs!

Details:

Determine a way to use clustering to estimate based on performance if players are under or over paid, generally.
Then select three players you believe would be best your team and explain why.
Provide a well commented and clean (knitted) report of your findings that can be presented to your GM. Include a rationale for variable selection, details on your approach and a overview of the results with supporting visualizations.

Hints:

Salary is the variable you are trying to understand (don’t put it in clustering) —> put salary as a gradient label (like rep/dem in the last one)
You can include numerous performance variables in the clustering but when interpreting you might want to use graphs that include variables that are the most correlated with Salary
You’ll need to standardize the variables before performing the clustering
Be specific about why you selected the players that you did, more detail is better
Use good coding practices, comment heavily, indent, don’t use for loops unless totally necessary and create modular sections that align with some outcome. If necessary create more than one script, list/load libraries at the top and don’t include libraries that aren’t used.

Standardize the variables before clustering

Here I will use a normalizing function to standardize all of the numeric variables that I will use before clustering

# use normalization function

nba_new <- names(select_if(new, is.numeric))# select function to find the numeric variables and create a character string  
nba_new

##  [1] "Age"      "G"        "GS"       "MP"       "FG"       "FGA"     
##  [7] "FG."      "X3P"      "X3PA"     "X3P."     "X2P"      "X2PA"    
## [13] "X2P."     "eFG."     "FT"       "FTA"      "FT."      "ORB"     
## [19] "DRB"      "TRB"      "AST"      "STL"      "BLK"      "TOV"     
## [25] "PF"       "PTS"      "X2020.21"

normalize <- function(x){
  # x is a numeric vector because the functions min and max require
  #numeric inputs
 (x - min(x)) / (max(x) - min(x))#numerator subtracts the minimum value of x from the entire column, denominator essentially calculates the range of x 
}

new[nba_new] <- lapply(new[nba_new], normalize)#use lapply again with the normalizer function we created. 

head(new)

##            Player Pos       Age  Tm         G        GS        MP         FG
## 1    Aaron Gordon  PF 0.3333333 ORL 0.4722222 0.5135135 0.4038462 0.23809524
## 2   Aaron Holiday  PG 0.2777778 IND 0.9166667 0.1621622 0.4889053 0.24867725
## 3   Aaron Nesmith  SF 0.1111111 BOS 0.4444444 0.0000000 0.2034024 0.06349206
## 4     Abdel Nader  SF 0.4444444 PHO 0.4722222 0.0000000 0.1974852 0.11375661
## 6 Al-Farouq Aminu  PF 0.6111111 ORL 0.1944444 0.1621622 0.1161243 0.04232804
## 7      Al Horford   C 0.8333333 OKC 0.6111111 0.6486486 0.4963018 0.36243386
##          FGA       FG.        X3P       X3PA  X3P.        X2P       X2PA
## 1 0.27595269 0.4588045 0.18343195 0.20243902 0.369 0.18237082 0.23475046
## 2 0.33245729 0.3683360 0.22485207 0.26097561 0.352 0.17325228 0.26987061
## 3 0.07884363 0.4103393 0.10059172 0.11463415 0.354 0.02431611 0.02402957
## 4 0.11432326 0.5589661 0.07692308 0.07804878 0.394 0.09422492 0.10166359
## 6 0.05256242 0.4071082 0.03550296 0.03414634 0.400 0.03343465 0.04805915
## 7 0.40604468 0.4830372 0.27810651 0.31951220 0.356 0.27659574 0.32902033
##        X2P.      eFG.         FT        FTA   FT.        ORB        DRB
## 1 0.6062581 0.5506003 0.16442953 0.21606648 0.613 0.24812030 0.29131653
## 2 0.5019557 0.4562607 0.09731544 0.10526316 0.725 0.05263158 0.10924370
## 3 0.6949153 0.6054889 0.02684564 0.02770083 0.667 0.07518797 0.09523810
## 4 0.7092568 0.6552316 0.06711409 0.06371191 0.800 0.02255639 0.12044818
## 6 0.5123859 0.4905660 0.02348993 0.01939058 0.778 0.06766917 0.07843137
## 7 0.6597132 0.5814751 0.04697987 0.04432133 0.778 0.17293233 0.38375350
##          TRB        AST        STL        BLK        TOV         PF        PTS
## 1 0.28723404 0.23188406 0.22222222 0.14414414 0.35570470 0.29921260 0.24666667
## 2 0.09361702 0.18260870 0.31746032 0.03603604 0.19463087 0.48818898 0.24190476
## 3 0.08936170 0.02028986 0.04761905 0.04504505 0.07382550 0.29133858 0.06857143
## 4 0.09361702 0.04347826 0.09523810 0.06306306 0.08724832 0.18897638 0.11238095
## 6 0.07446809 0.04347826 0.12698413 0.04504505 0.06711409 0.07086614 0.04190476
## 7 0.33617021 0.24347826 0.33333333 0.18918919 0.18120805 0.33858268 0.31809524
##     X2020.21
## 1 0.41965103
## 2 0.05116968
## 3 0.07713627
## 4 0.03733908
## 6 0.22327362
## 7 0.63815432

Select the variables to be included in the cluster

In order to first select variables to be included in the cluster, I will look at which variables are most correlated with salary and which I associate with performance. To do so, I will use the pairs.panels function in the psych package.

Which variables are correlated with salary?

columns <- c("X2020.21", "Age", "G", "GS", "MP", "FG", "FGA", "FG.", "X3P", "X3PA", "X3P.")
columns2 <- c("X2020.21", "X2P", "X2PA", "X2P.", "eFG.", "FT", "FTA", "FT.", "ORB")
columns3 <- c("X2020.21", "DRB", "TRB", "AST", "STL", "BLK", "TOV", "PF", "PTS")

numeric <- new[,columns]
numeric2 <- new[, columns2]
numeric3 <- new[,columns3]

pairs.panels(numeric)

pairs.panels(numeric2)

pairs.panels(numeric3)

clust_data_nba = new[, c("GS", "MP", "FG", "X3P", "X2P", "FT", "PTS", "AST")]

Variables that have a correlation around .50 and more with salary: GS (games started), MP (minutes played), FG (field goals), FGA (field goals attempted), X3P (3 point shots made), X3PA (3 point shot attempts), X2P (2 point shots made), X2PA (2 point shots attempted), FT (free throws), FTA (free throws attempted), PTS (points), AST (assists), TOV (turnovers)
Variables that I believe are associated with performance: GS (corr = 0.54), MP (corr = 0.46), FG (corr = 0.58), X3P (corr = 0.43), X2P (corr = 0.53), FT (corr = 0.57), PTS (corr = 0.59), AST (corr = 0.59)

–> The ones I selected were: Games Started (I imagine that if you start more games, you are a high performing player and a reliable one) and Points (We want a high performing player so they need to have a lot of points), and later Assists (Assists show a team player, who is still contributing to the team’s number of points)

Clustering

Run the clustering algorithm with 2 centers

set.seed(1)
kmeans_nba = kmeans(clust_data_nba, centers = 2, 
                        algorithm = "Lloyd")   #<- there are several ways of implementing k-means, see the help menu for a full list

# What did the kmeans function produce, 
# what does the new variable kmeans_obj contain?
head(kmeans_nba)

## $cluster
##   1   2   3   4   6   7   8  10  11  12  13  14  15  16  17  19  20  21  22  23 
##   2   2   2   2   2   1   2   2   2   2   2   2   2   2   2   1   2   2   1   2 
##  24  25  26  28  30  31  32  33  34  35  36  38  41  42  43  44  45  46  49  50 
##   1   1   2   1   2   2   1   2   1   2   2   2   2   2   1   1   2   1   1   1 
##  51  52  53  54  55  56  57  58  59  60  61  64  65  68  70  71  72  73  74  75 
##   2   2   2   1   2   1   2   2   2   2   2   1   2   2   1   2   1   2   2   2 
##  77  78  79  80  81  82  84  85  86  87  89  90  92  93  94  96  97  98  99 100 
##   1   2   2   2   1   2   1   2   2   2   2   2   1   2   2   2   1   2   1   2 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 116 117 118 119 120 121 
##   2   2   2   1   1   2   2   1   2   2   2   2   1   1   1   1   1   2   2   1 
## 122 123 124 125 126 127 128 129 130 131 132 133 134 137 138 139 140 141 142 143 
##   2   2   2   2   2   1   2   2   2   2   2   1   2   1   1   1   1   1   1   1 
## 144 146 147 148 149 150 151 153 154 155 156 158 159 160 161 162 164 166 167 168 
##   1   1   1   2   2   2   2   2   1   1   2   1   1   2   1   2   2   2   1   2 
## 170 172 173 174 175 176 177 178 179 180 181 182 183 184 186 187 188 189 192 193 
##   2   2   2   1   2   2   1   2   2   2   1   2   2   2   2   1   2   2   2   2 
## 194 196 197 198 199 200 201 202 204 205 206 207 209 210 212 214 215 216 217 218 
##   1   2   2   2   2   2   2   1   2   1   1   2   2   2   2   2   2   2   1   2 
## 219 220 221 222 223 224 227 230 231 232 233 234 237 238 239 240 241 242 243 244 
##   1   1   2   2   2   2   2   2   1   2   2   2   1   2   1   1   2   1   2   2 
## 245 246 247 248 249 250 251 252 253 256 258 259 260 262 263 264 265 266 267 269 
##   2   1   1   1   1   1   2   1   1   1   2   2   2   2   2   2   1   1   2   1 
## 270 271 272 274 278 279 281 282 284 285 286 287 288 289 290 291 292 293 294 295 
##   1   2   2   2   2   1   2   1   1   1   1   1   2   2   1   2   1   1   2   2 
## 296 297 298 299 301 303 304 306 307 308 309 311 312 313 314 315 316 317 318 319 
##   2   2   1   2   2   1   1   1   1   1   2   2   1   2   2   2   2   1   1   1 
## 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 337 338 340 341 342 
##   2   1   1   2   2   2   2   1   1   2   1   2   2   2   2   2   1   1   2   2 
## 343 345 347 348 349 350 351 352 353 354 356 357 358 359 360 362 363 365 368 369 
##   2   2   2   2   1   1   1   2   2   2   2   2   1   1   2   2   1   2   1   2 
## 370 372 374 375 376 377 378 379 381 385 386 387 388 389 390 391 392 393 395 396 
##   2   2   1   1   2   2   1   1   1   2   1   2   2   1   1   1   2   1   2   2 
## 397 398 399 400 403 406 407 409 410 411 413 414 416 417 418 419 420 421 422 423 
##   1   2   1   1   2   2   2   2   2   2   1   1   1   1   1   1   2   2   2   2 
## 424 426 427 428 429 430 431 432 433 435 436 438 439 440 441 442 443 444 445 447 
##   2   2   2   1   2   1   1   1   2   1   2   2   2   1   1   1   2   2   2   2 
## 448 449 450 451 452 453 454 455 457 458 459 460 461 462 463 464 466 467 468 469 
##   2   2   1   2   1   2   2   2   2   2   2   2   2   2   2   2   1   1   2   2 
## 470 472 473 475 476 477 478 479 481 482 483 486 487 488 489 494 495 496 497 498 
##   2   2   2   1   2   1   2   2   2   1   2   2   2   1   2   1   2   2   2   2 
## 499 501 502 503 504 506 507 508 509 511 512 513 514 515 516 519 520 521 
##   2   2   1   2   2   2   1   1   1   2   2   1   2   2   2   1   2   1 
## 
## $centers
##          GS        MP        FG       X3P       X2P         FT       PTS
## 1 0.7452485 0.7024766 0.4799795 0.3246039 0.3877635 0.27111929 0.4738249
## 2 0.1066694 0.2848392 0.1468707 0.1153959 0.1125082 0.07126088 0.1435923
##          AST
## 1 0.32398317
## 2 0.09867196
## 
## $totss
## [1] 172.5632
## 
## $withinss
## [1] 46.05278 27.77124
## 
## $tot.withinss
## [1] 73.82402
## 
## $betweenss
## [1] 98.7392

kmeans_nba$betweenss/kmeans_nba$totss

## [1] 0.5721915

We want the Between SS / Total SS to be above 50% (between_SS / total_SS = 57.2%) –> not bad and is above 50

Visualize the output

#Visualize the output

# Tell R to read the cluster labels as factors so that ggplot2 
# (the graphing package) can read them as category labels instead of 
# continuous variables (numeric variables).

#kmeans_nba
clusters_nba = as.factor(kmeans_nba$cluster)

# What does the kmeans_obj look like?
View(clusters_nba)

ggplot(new, aes(x = GS, y = PTS, shape = clusters_nba)) + geom_point(size = 6) + ggtitle("Normalized GS vs PTS for NBA Players") + xlab("Normalized Number of Games Started") + ylab("Normalized Number of Points Scored") + scale_shape_manual(name = "Cluster", labels = c("Cluster 1", "Cluster 2"), values = c("1", "2")) + theme_light()

This plots shows us several things:

1.) more spread out with the 1’s – looks like more points and games started

2.) the 2’s are tighter at the bottom with less games started and less points

3.) However, there are a good amount of 1’s that start a lot of games but score less points than those in cluster 2

Elbow Graph

I will use the elbow graph to determine if 2 clusters was the best option

#Use the function we created to evaluate several different number of clusters

# The function explained_variance wraps our code for calculating 
# the variance explained by clustering.
explained_variance = function(data_in, k){
  
  # Running the kmeans algorithm.
  set.seed(1)
  kmeans_obj = kmeans(data_in, centers = k, algorithm = "Lloyd", iter.max = 30)
  
  # Variance accounted for by clusters:
  # var_exp = intercluster variance / total variance
  var_exp = kmeans_obj$betweenss / kmeans_obj$totss
  var_exp  
}

# Recall the variable we are using for the data that we're clustering.

# The sapply() function plugs in several values into our explained_variance function.
#sapply() takes a vector, lapply() takes a dataframe
explained_var = sapply(1:10, explained_variance, data_in = clust_data_nba)

explained_var

##  [1] -6.258720e-15  5.721915e-01  7.047754e-01  7.752444e-01  8.003217e-01
##  [6]  8.224254e-01  8.388047e-01  8.519040e-01  8.566907e-01  8.627876e-01

# Data for ggplot2.
elbow_data = data.frame(k = 1:10, explained_var)
#View(elbow_data_Rep)

# Plotting data.
ggplot(elbow_data, 
       aes(x = k,  
           y = explained_var)) + 
  geom_point(size = 4) +           #<- sets the size of the data points
  geom_line(size = 1) +            #<- sets the thickness of the line
  xlab('k') + 
  ylab('Inter-cluster Variance / Total Variance') + 
  theme_light()

I would choose 2 – it’s right at the elbow point, and it has the lowest inter-cluster variance inside the cluster (lowest point not at 0)

Visualization with Salary Gradient

Next, I will make a similar plot to the normalized GS vs PTS plot before but this time I will add a gradient for salary

ggplot(new, aes(x = GS, y = PTS, color = X2020.21,  #<- tell R how to color the data points
                            shape = clusters_nba)) + geom_point(size = 6) +   ggtitle("Normalized GS vs PTS for NBA Players") + xlab("Normalized Number of Games Started") + ylab("Normalized Number of Points Scored") + scale_shape_manual(name = "Cluster", labels = c("Cluster 1", "Cluster 2"), values = c("1", "2")) + scale_color_continuous(name = "Salary") +  theme_light()

# want the dark blue 1s -- high performance but low salary

So, this plot shows us what type of players I want to look closer at. The players we would want for our team would be the darker 1’s in the top right corner, these players start a lot of games, score a lot of points, and have a low salary.

3D Plot to get the Names of Players

Making the 3D plot will allow me to see which players correspond to which, and it will allow me to add another variable of interest: number of assists (AST)

fig <- plot_ly(new, 
               type = "scatter3d",
               mode="markers",
               symbol = ~clusters_nba,
               x = ~GS, 
               y = ~PTS, 
               z = ~AST,
               color = ~X2020.21,
               colors = "Blues", 
               text = ~paste('Player: ', Player,
                             "Salary: ", X2020.21)
               )
fig

Choosing NBA Players

We want to choose players that have a high performance and a low salary. For this analysis, I defined high performance as being someone who starts a lot of games (GS), has a lot of points (PTS), and assists (AST). In the 3D model, the players we are looking for have a light blue color and are located in the top right of the model.

Therefore, the 3 Players I would Select are:

1.) Trae Young (0.919 for GS, 0.851 for PTS, 0.930 for AST, 0.150 for Salary)

2.) Luka Dončić (0.865 for GS, 0.870 for PTS, 0.832 for AST, 0.184 for Salary)

3.) De’Aaron Fox (0.946 for GS, 0.765 for PTS, 0.768 for AST, 0.185 for Salary)

Clustering Lab – NBA Salaries Part

Andrea Parrish

04/07/2022