#NBA Clustering Exercise Details:
clust_data_nba = nbastats1[, c("PTS","AST","salary")]
set.seed(50) #ensuring reproducibility
kmeansObjNBA <- kmeans(clust_data_nba, centers = 3)
kmeansObjNBA
## K-means clustering with 3 clusters of sizes 271, 89, 49
##
## Cluster means:
## PTS AST salary
## 1 204.0627 44.16605 3368922
## 2 366.8764 78.44944 13787278
## 3 584.3878 143.24490 31761019
##
## Clustering vector:
## [1] 2 1 1 1 3 1 1 1 1 1 2 1 3 2 1 3 1 3 2 1 1 1 1 1 1 3 1 3 1 1 1 3 1 1 3 2 1
## [38] 1 1 1 3 1 1 1 1 2 1 1 2 1 1 3 1 2 1 1 3 1 1 2 1 1 2 1 1 2 1 1 3 1 1 3 1 1
## [75] 2 2 2 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 2 3 1 1 2 1 1 1 1 2 2 1 1 1 1 1 3 1 1
## [112] 2 2 1 1 1 1 2 1 1 1 2 1 1 1 1 2 2 1 1 2 1 1 1 2 1 1 1 2 1 2 1 3 1 1 2 3 2
## [149] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 3 1 3 3 3 2 2 1 1 1 1
## [186] 1 1 1 3 1 2 1 1 2 2 1 1 3 2 2 3 1 1 3 2 2 1 1 1 1 1 1 2 3 1 2 1 1 1 2 3 3
## [223] 1 1 2 2 3 1 1 1 2 1 3 1 1 3 1 1 3 1 1 3 2 1 3 3 1 3 1 1 1 2 1 3 1 2 1 1 1
## [260] 1 1 1 1 2 2 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 2 1 1 2 1 1 1
## [297] 1 1 3 1 1 3 3 2 1 1 1 3 1 1 3 1 2 1 1 2 3 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1
## [334] 1 2 1 1 2 3 1 3 1 1 1 1 1 2 1 1 1 1 1 1 2 1 3 1 3 1 1 2 1 1 2 2 2 1 1 2 2
## [371] 2 1 1 1 1 2 1 3 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1 1 2
## [408] 1 2
##
## Within cluster sum of squares by cluster:
## [1] 1.137900e+15 1.273720e+15 1.321829e+15
## (between_SS / total_SS = 90.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
NBAclust = as.factor(kmeansObjNBA$cluster)
##FG Percentage vs. 3P%
#KMeans with Standardized PTS & AST
#Points per minute / Assist per minute
The core quesiton of this lab was to determine the relationship between player performance and salary, and to derive instances of potential value arbitrage for a free-agent signing or trade.
We feel that our correlation calculations and early kmean models with other variables made it clear that Points and Assists in standardized form are the most telling variables for the marginal production of a Player. Given that this graphic is in the context of player salary, any outliers beyond the clusters are instances of players who are distinct performers relative to their salary. The list above are just a handful of those instances. However, there are clear risks with this model and implementing the insights that come from it.Not only does this model not consider a defensive metric in its evaluation, but it also do not consider external conditions and circumstances that could possibly contribute to offensive performance, such as intangibles like team chemistry or fluidity, or tangible realities like the performance averages of a player’s teammates or the market value of his jersey. Next steps should include a detailed analysis of the data to ensure there are no errors and additionally, an improved performance metric to align and compare with player salary.
explained_variance = function(data_in, k){
set.seed(50)
kmeans_obj = kmeans(data_in, centers = k, algorithm = "Lloyd", iter.max = 50)
# Variance accounted for by clusters is equal to the intercluster variance
# divided by the total variance
var_exp = kmeans_obj$betweenss / kmeans_obj$totss
var_exp
}
explained_var_NBA = sapply(1:10, explained_variance, data_in = clust_data_nba)
elbow_data_NBA = data.frame(k = 1:10, explained_var_NBA)
# Plotting data.
nbaelbow <- ggplot(elbow_data_NBA,
aes(x = k,
y = explained_var_NBA)) +
geom_point(size = 3) +
geom_line(size = 1) +
xlab('k') +
ylab('Inter-cluster Variance / Total Variance') +
theme_light()