Hockey analytics has created a large array of statistics to measure player performance. How to properly weigh each of these to create a comprehensive overview of each player is not as widely discussed. Some, including myself, have tried to smartly roll them up into a single metric. However, possibly more appropriately, they can also be used to perform a clustering analysis, identifying player types.
To do this, it helps to first reduce the number of features we have for each player. This is best done with Principal Components Analysis (PCA), which boils down all metrics into a few by creating a ‘Principal Component’ based on finding the maximum variability in the dataset. Then finding the second dimension of maximum variability, and so on.
PCA aims captures as much variation in data as possible
It also discards useless information. Imagine our dataset had a metric that was ‘Good in the Room / 60,’ but since we didn’t know anything about what happens in the room, every player is rated as a 10. This metric would offer zero variation and would be discarded. Scoring rates, on the other hand, vary greatly between players and would help make up the first principal component, calculated to have the largest possible variance.
The smaller dataset can then be used to perform K-means clustering, identifying ‘k’ different player types based on the principal components for each player-season.
Note: This analysis is meant to combine some working knowledge of R (dplyr, subsetting), statistical tools (scaling), and hockey analytics. I try to comment code and explain as best I can, but am happy to handle any additional questions at cole92anderson@gmail.com or @crowdscoutsprts on twitter.
To re-create this analysis, download the data here.
### Load Data from downloaded folder
crowdscout_data_pred <- read.csv("~/Downloads/crowdscout_data_download.csv", header = TRUE)
### Look at column names (ignore that this analysis has already been done)
colnames(crowdscout_data_pred)
## [1] "Player" "shooterID"
## [3] "season" "Cluster"
## [5] "ClusterName" "Pos.Rank"
## [7] "DepthChart" "Predicted.CS"
## [9] "Predicted.CS.RF.Scaled" "Predicted.CS.LM.Scaled"
## [11] "Pos" "Games.Played_EV"
## [13] "Games.Played_PP" "Games.Played_SH"
## [15] "TOI_EV" "TOI_PP"
## [17] "TOI_SH" "Total.Shifts_EV"
## [19] "Total.Shifts_PP" "Total.Shifts_SH"
## [21] "OTF.Shift.Share_EV" "OTF.Shift.Share_PP"
## [23] "OTF.Shift.Share_SH" "Off.FO.Shift.Share_EV"
## [25] "Off.FO.Shift.Share_PP" "Off.FO.Shift.Share_SH"
## [27] "Def.FO.Shift.Share_EV" "Def.FO.Shift.Share_PP"
## [29] "Def.FO.Shift.Share_SH" "Neu.FO.Shift.Share_EV"
## [31] "Neu.FO.Shift.Share_PP" "Neu.FO.Shift.Share_SH"
## [33] "ixG60_EV" "ixG60_PP"
## [35] "ixG60_SH" "G60_EV"
## [37] "G60_PP" "G60_SH"
## [39] "A160_EV" "A160_PP"
## [41] "A160_SH" "A260_EV"
## [43] "A260_PP" "A260_SH"
## [45] "xGF60_EV" "xGF60_PP"
## [47] "xGF60_SH" "xGA60_EV"
## [49] "xGA60_PP" "xGA60_SH"
## [51] "team.time.wo_EV" "team.time.wo_PP"
## [53] "team.time.wo_SH" "xGF60_teamWO_EV"
## [55] "xGF60_teamWO_PP" "xGF60_teamWO_SH"
## [57] "xGA60_teamWO_EV" "xGA60_teamWO_PP"
## [59] "xGA60_teamWO_SH" "G60_Away_EV"
## [61] "G60_Away_PP" "G60_Away_SH"
## [63] "G60_Home_EV" "G60_Home_PP"
## [65] "G60_Home_SH" "ixG60_Away_EV"
## [67] "ixG60_Away_PP" "ixG60_Away_SH"
## [69] "ixG60_Home_EV" "ixG60_Home_PP"
## [71] "ixG60_Home_SH" "P160_Away_EV"
## [73] "P160_Away_PP" "P160_Away_SH"
## [75] "P160_Home_EV" "P160_Home_PP"
## [77] "P160_Home_SH" "P260_Away_EV"
## [79] "P260_Away_PP" "P260_Away_SH"
## [81] "P260_Home_EV" "P260_Home_PP"
## [83] "P260_Home_SH" "xGA60_Away_EV"
## [85] "xGA60_Away_PP" "xGA60_Away_SH"
## [87] "xGA60_Home_EV" "xGA60_Home_PP"
## [89] "xGA60_Home_SH" "xGF60_Away_EV"
## [91] "xGF60_Away_PP" "xGF60_Away_SH"
## [93] "xGF60_Home_EV" "xGF60_Home_PP"
## [95] "xGF60_Home_SH" "Player_Competition_EV"
## [97] "Player_Competition_PP" "Player_Competition_SH"
## [99] "Player_Teammates_EV" "Player_Teammates_PP"
## [101] "Player_Teammates_SH" "TeamWO_Competition_EV"
## [103] "TeamWO_Competition_PP" "TeamWO_Competition_SH"
## [105] "TeamWO_Teammates_EV" "TeamWO_Teammates_PP"
## [107] "TeamWO_Teammates_SH" "Player_Competition_Away_EV"
## [109] "Player_Competition_Away_PP" "Player_Competition_Away_SH"
## [111] "Player_Competition_Home_EV" "Player_Competition_Home_PP"
## [113] "Player_Competition_Home_SH" "Player_Teammates_Away_EV"
## [115] "Player_Teammates_Away_PP" "Player_Teammates_Away_SH"
## [117] "Player_Teammates_Home_EV" "Player_Teammates_Home_PP"
## [119] "Player_Teammates_Home_SH" "TeamWO_Competition_Away_EV"
## [121] "TeamWO_Competition_Away_PP" "TeamWO_Competition_Away_SH"
## [123] "TeamWO_Competition_Home_EV" "TeamWO_Competition_Home_PP"
## [125] "TeamWO_Competition_Home_SH" "TeamWO_Teammates_Away_EV"
## [127] "TeamWO_Teammates_Away_PP" "TeamWO_Teammates_Away_SH"
## [129] "TeamWO_Teammates_Home_EV" "TeamWO_Teammates_Home_PP"
## [131] "TeamWO_Teammates_Home_SH" "Share.of.Ice_EV"
## [133] "Share.of.Ice_PP" "Share.of.Ice_SH"
## [135] "xGF60_Rel_EV" "xGF60_Rel_PP"
## [137] "xGF60_Rel_SH" "xGA60_Rel_EV"
## [139] "xGA60_Rel_PP" "xGA60_Rel_SH"
## [141] "xGD60_EV" "xGD60_PP"
## [143] "xGD60_SH" "xGD60_teamWO_EV"
## [145] "xGD60_teamWO_PP" "xGD60_teamWO_SH"
## [147] "P160_EV" "P160_PP"
## [149] "P160_SH" "P60_EV"
## [151] "P60_PP" "P60_SH"
## [153] "Teammates_Diff_EV" "Teammates_Diff_PP"
## [155] "max.RF" "min.RF"
## [157] "max.LM" "min.LM"
## [159] "Player.Position" "Shoots"
## [161] "shooterDOB" "TOI"
## [163] "G60" "P60"
### Create position variables and
crowdscout_data_pred_pca <- crowdscout_data_pred %>%
mutate(D = ifelse(Player.Position == "D",1,0),
C = ifelse(Player.Position == "C",1,0),
W = ifelse(!Player.Position %in% c("C","D"),1,0),
Shooting_PP = G60_PP - ixG60_PP,
Shooting_EV = G60_EV - ixG60_EV,
Shooting_SH = G60_SH - ixG60_SH)
### Identify variables to scale and performance PCA on.
### Feel free to add other metrics based on columns available, or remove based on preference
pca_vars <- c("Total.Shifts_EV","Total.Shifts_PP","Total.Shifts_SH","OTF.Shift.Share_EV",
"OTF.Shift.Share_PP","OTF.Shift.Share_SH","Off.FO.Shift.Share_EV",
"Off.FO.Shift.Share_PP","Off.FO.Shift.Share_SH",
"Def.FO.Shift.Share_EV","Def.FO.Shift.Share_PP",
"Def.FO.Shift.Share_SH","ixG60_EV","ixG60_PP","ixG60_SH","G60_EV","G60_PP",
"G60_SH","A160_EV","A160_PP","A160_SH","xGF60_EV","xGF60_PP",
"xGF60_SH","xGA60_EV","xGA60_PP","xGA60_SH","Player_Competition_EV",
"Player_Teammates_EV","Player_Teammates_PP","Share.of.Ice_EV",
"Share.of.Ice_PP","Share.of.Ice_SH","xGF60_Rel_EV","xGF60_Rel_PP",
"xGF60_Rel_SH","xGA60_Rel_EV","xGA60_Rel_PP","xGA60_Rel_SH","P60_EV","P60_PP",
"P60_SH","Teammates_Diff_EV","Teammates_Diff_PP","D","C","W",
"Shooting_PP","Shooting_EV","Shooting_SH")
To factor in changes in scoring or any other metric over the course of the last decade, each season will be scaled separately.
# Create Function to Scale Each Season
scale_season_function <- function(year) {
### Filter data to year
season_data <- crowdscout_data_pred_pca %>%
filter(season == year)
### Scale variables to be used in analysis for that season
scaled_season <- scale(season_data[,pca_vars])
### Place scaled player level metrics next to player identifiers
season_scaled <- as.data.frame(cbind(season_data[,c("Player","shooterID","season","Pos","Predicted.CS")],scaled_season))
return(season_scaled)
}
## Find Unique Seasons in data
seasons <- unique(crowdscout_data_pred_pca$season)
## Scale Each Season Separately, Stack Output into master PCA dataset
pca_data <- plyr::rbind.fill(lapply(FUN=scale_season_function,seasons))
## Replace NAs with 0
pca_data[is.na(pca_data[,pca_vars]),pca_vars] <- 0
## Create unique player season identifier
id_vector <- as.vector(paste0(pca_data$Player,"-",substr(pca_data$season,7,8),"-",pca_data$shooterID))
## Label rows as player ID
rownames(pca_data) <- id_vector
## Verify variance is uniform
plot(sapply(pca_data[,pca_vars], var))
PCA Analysis is normally performed to reduce the size of data without losing information. We can reduce player metrics down in dimension by using the nFactors package. Ideally, we would try to explain a lot more of the variance, but there often trade offs involved in choosing the number of principal components (PCs) to keep. There are graphical and non-graphical ways to determine this (http://www.empowerstats.com/manuals/paper/scree.pdf#1) but general rules of thumbs are to have the eigenvalue greater than 1, and/or find an ‘elbow’ in the plot - a point where including another PC doesn’t explain variance as much as the last PC.
To determine the number of PCs to keep, we will also look at a Parallel Analysis of the dataset. Parallel analysis works by creating a random dataset with the same numbers of observations and variables as the original data, we have set the parallel() function below to run 100 times. A correlation matrix is computed from the randomly generated dataset and then eigenvalues of the correlation matrix are computed. When the eigenvalues from the random data are larger than the eigenvalues from the PCA or factor analysis it signals that the components or factors are mostly random noise, thus not to be included.
Based on the Scree Plot and Parallel Analysis below, we want to retain 16 PCs - where PC eigenvalues are greater than the Parallel Analysis random eigenvalues averaged over 100 replications.
# Determine Number of Factors to Extract
library(nFactors)
# Get eigenvalues
ev <- eigen(cor(pca_data[,pca_vars]))
# Run parallel analysis 100 times, get output
ap <- parallel(subject=nrow(pca_data[,pca_vars]),var=ncol(pca_data[,pca_vars]),
rep=100,cent=.05)
# Combine eigenvalues values and parallel analysis values and plot
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)
These 16 vectors contain scores for each player-season. We can pull the factor loadings from the ‘pca’ object and visualize which of our original on-ice metrics make up each component. We can also take a look at what these components look like with player-seasons attached to them as an example. Note PC1 seems to be made up of metrics that puts some notable scorers near the top.
## Principal Components on data
pca <- prcomp(pca_data[,pca_vars])
## Create matrix of original variables by PCs
loadings <- as.matrix(pca$rotation[,1:16])
## Create dataframe of player-season level components
comp <- data.frame(pca$x[,1:16])
## Visualize component loadings, blue denotes metric is highly correlated with the PC. The component loadings are correlation coefficients between the variables (rows) and component (columns). Analogous to Pearson's r, the squared component loading is the percent of variance in that variable explained by the factor. (ftp://statgen.ncsu.edu/pub/thorne/molevoclass/AtchleyOct19.pdf)
d3heatmap::d3heatmap(loadings, dendrogram = "none")
## Look at a sample of players and their PCs
player_components <- cbind(pca_data[,c("Player","shooterID","season")], comp)
## Players with highest PC1
player_components %>% arrange(-PC1) %>% head() %>% print()
## Player shooterID season PC1 PC2 PC3
## 1 BEN WALTER 8471372 20072008 10.933258 9.006279 33.5120646
## 2 NICK JOHNSON 8471280 20122013 8.740949 10.651292 25.0315108
## 3 EVGENI MALKIN 8471215 20112012 7.885107 4.656312 -0.4182223
## 4 JAMES NEAL 8471707 20112012 7.817560 3.149731 1.0976165
## 5 ALEX OVECHKIN 8471214 20092010 7.717287 6.113872 -3.8732694
## 6 JAMIE BENN 8473994 20152016 7.615739 1.527613 0.8322005
## PC4 PC5 PC6 PC7 PC8 PC9
## 1 -10.76394188 -31.3095006 -17.8543277 -9.1504618 6.60902841 10.67556150
## 2 -7.43354747 -21.0690673 -13.0372167 -7.0665458 7.20082925 8.64285147
## 3 -1.41171254 -0.4549992 1.2227250 0.3312946 0.78568735 0.26857619
## 4 0.50383731 3.8917850 -5.1585164 -0.3326766 -1.12957623 0.05299163
## 5 -1.19790940 -1.1744967 1.5181517 0.6269304 0.29067378 -0.59278354
## 6 0.08739307 0.1737666 -0.1526732 0.1232433 0.06669861 -0.08398483
## PC10 PC11 PC12 PC13 PC14 PC15
## 1 20.3258104 -6.7423909 -2.16999676 1.0159797 11.1848612 -13.2848456
## 2 18.8443771 -11.6250087 0.01123968 4.2851560 9.4538159 -11.1071870
## 3 0.1255588 0.4513855 -2.88780985 0.5211703 0.1168389 0.2538285
## 4 1.4939522 1.0814434 0.94512417 -1.2964290 1.2406700 -0.3744424
## 5 -0.5390334 -0.8470502 -0.92352589 -0.3373556 0.7619723 -0.1860636
## 6 -0.1404038 0.3135776 1.43546347 -0.8585913 0.8294723 -0.2260359
## PC16
## 1 -2.2520485
## 2 -1.9847774
## 3 1.3667700
## 4 -1.1080658
## 5 -0.7677967
## 6 0.3280049
After reducing the number of features using the PCA, we can run the k-means algorithm. K-means is an unsupervised (since we don’t know what the cluster of each player is before hand) algorithm, clustering each player to the nearest mean or mathematical center of a cluster. To find the optimal number of clusters, we iteratively test different numbers of clusters and record their respective Within Group Sum of Squares (WSS). By definition adding another cluster will lower WSS, so we look for the ‘elbow’ again, where adding another cluster doesn’t do much for us.
wss <- c()
set.seed(1234)
## Test kmeans with 1 to 25 centers, keeping the WSS from each test
for (i in 1:25) {
wss[i] <- sum(kmeans(pca_data[,pca_vars], centers=i, nstart = 25, iter.max = 1000)$withinss)
}
## Plot number of clusters and WSS, looking for 'elbow'
plot(1:25, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
While there is no extremely-sharp ‘elbow’ in the data, signifying a clear number of player-types in the last decade, there seems to be a slight bend at k=12 (sometimes data science is data art). This gives us a good number of clusters - certainly more interesting than trying to fit everyone into 3 clusters or coming up with names for 20 groups. Feel free to play around with this yourself if you like.
After clustering each player-season into 12 clusters, we can plot where each cluster falls within each of the PCs. It helps to visualize how clusters form, finding similar player-seasons in a multi-dimension space. Each color below represents a cluster.
# From scree plot elbow occurs at k = 13
# Apply k-means with k=13
kmeans_object <- kmeans(comp, 12, nstart=25, iter.max=1000)
# Plot cluster by PC
plot(comp, col=as.factor(kmeans_object$clust), pch=16)
Finally, we want to label each cluster to make them a little more accessible. I did this using a combination of methods.
So, for example, we see cluster 12 maps to Ovechkin, Kane, and Malkin. We also see in the ‘cluster_xwalk’ below it is relatively overweight looking at PC1 (+4). Looking at the component loadings above, we see PC1 is positively correlated with scoring and PP time, so this cluster likely has many offensive weapons.
In order to generalize, I ranked talent levels by position, noticing that clusters that are high scoring or skilled cohorts are rated highly (my predicted CrowdScout Score is built on user scouting inputs, and even the best fanalysts are more likely to observe offense than defense). The best clusters were made up of productive ‘Skilled’ players that were seemingly ability to ‘Drive’ team xG metrics and played in ‘All-Situations.’ The 2nd and 3rd highest skilled groups were termed ‘Favorable-Situation’ because they generally had a higher share of powerplay time and results and a lower share of shorthanded time and results. Further down, xG relative to team and usage often signalled sheltering or ‘Depth’-type players.
This is to say, you may want to read these cluster labels not as an air tight law of nature, rather what they represent relative to the other clusters.
# Vector of player clusters from 'kmeans_object' object
Cluster = kmeans_object$clust
# Print PCs (13 columns) making up each cluster (13 rows)
cluster_center = as.data.frame(kmeans_object$centers)
print(cluster_center)
## PC1 PC2 PC3 PC4 PC5 PC6
## 1 -4.9697572 0.4267106 -0.067949427 -0.6595092 0.19375676 0.13791757
## 2 0.8844018 2.7771605 9.512039877 -4.5811274 -7.73365148 -5.61258895
## 3 -0.4345101 -0.7531353 0.005337511 2.1227437 -1.10054853 -0.36237986
## 4 -1.0779345 -1.3799075 -0.182288914 -1.4171342 0.68750969 -0.35649757
## 5 -2.1610421 0.5074045 -0.852269081 0.3583863 3.37497852 -1.86620396
## 6 -3.0014025 1.2254760 0.822700595 0.8684013 -0.42108460 4.21414829
## 7 -3.0108648 1.0611701 0.204298321 0.2863108 -0.04860887 0.09934151
## 8 4.0321379 1.2079719 0.318906322 0.3505113 0.40179186 0.05271376
## 9 2.5637316 12.2655756 -23.346267853 -5.7188135 -13.15480668 -6.18208013
## 10 0.4944777 1.8472207 0.573996176 0.4105548 0.43268505 0.40879548
## 11 1.9795006 -3.2009686 -0.362383392 -0.5364423 -0.23752568 0.24170253
## 12 3.0526901 4.1992570 -3.946319415 -1.3333983 -1.95031422 -0.41565085
## PC7 PC8 PC9 PC10 PC11
## 1 1.09459669 -0.28377259 1.28268524 0.127942843 1.04077200
## 2 -1.21103142 -1.26475192 -0.33320148 -5.199340101 -2.07244019
## 3 0.11272742 0.17068638 -0.03146655 -0.122495200 0.50134770
## 4 -0.25458239 -0.27430725 -0.20549964 0.090775986 -0.63182910
## 5 -6.48229292 11.38020701 0.76745596 -3.274964218 1.20434936
## 6 -11.55311876 -6.19670982 -2.09956499 1.240517941 2.85246325
## 7 0.11006599 0.10895127 -0.48133202 0.003914345 0.04973259
## 8 0.30974182 -0.10754436 0.10958536 0.037754132 0.59370883
## 9 -2.94637587 -1.24749117 -0.56041481 -0.556506523 -1.19528646
## 10 -0.05863843 0.16531885 0.05537615 0.108629926 -0.62331483
## 11 -0.06100447 0.01702405 -0.01595944 0.010223351 -0.35880183
## 12 -0.17735624 0.06752644 0.10568371 -0.025613739 -0.14809753
## PC12 PC13 PC14 PC15 PC16
## 1 -0.56927633 -1.53261138 1.29298687 0.1209554922 -0.02207076
## 2 -1.17214247 -0.60685478 1.41327521 1.4936869898 -0.13914990
## 3 -0.04444384 0.48989548 -0.02645625 0.0564728470 0.04396016
## 4 -0.38637368 0.41792135 -0.23706326 -0.1557679297 -0.09271773
## 5 -1.27294557 -0.87308062 1.09501621 0.1312868356 0.20609165
## 6 -0.43742846 -1.45890791 1.08868974 -0.1932377816 -0.39072043
## 7 0.58830355 0.36759768 -0.62142641 0.0055161186 0.05782381
## 8 0.05497915 -0.13353263 0.04973056 -0.0004424651 0.13429013
## 9 -3.20144659 -0.85277045 0.47101946 -1.5184502439 7.31356433
## 10 0.26812125 0.18447536 0.04554889 -0.0094881337 0.09228903
## 11 -0.06952163 -0.38019112 0.06229165 -0.0134620970 0.03328358
## 12 -0.07606657 -0.03062333 0.03084002 0.3595985351 -2.07749520
# Join clusters to player season data
player_season_clusters <- data.frame(cbind(Cluster,pca_data[,c("Player","Pos","shooterID", "season", "Predicted.CS")]))
# Print top 3 examples of each cluster
player_season_clusters %>% arrange(-Cluster) %>% group_by(Cluster) %>% top_n(3) %>% print()
## # A tibble: 37 x 6
## # Groups: Cluster [12]
## Cluster Player Pos shooterID season Predicted.CS
## <int> <fctr> <fctr> <int> <int> <dbl>
## 1 12 ALEX OVECHKIN F 8471214 20142015 94.45475
## 2 12 PATRICK KANE F 8474141 20152016 93.76518
## 3 12 EVGENI MALKIN F 8471215 20132014 94.41843
## 4 11 RYAN SUTER D 8470600 20142015 98.00066
## 5 11 KRIS LETANG D 8471724 20152016 95.72291
## 6 11 KRIS LETANG D 8471724 20162017 96.71833
## 7 10 DANIEL BRIERE F 8464975 20082009 76.82660
## 8 10 PAUL KARIYA F 8459426 20082009 77.81999
## 9 10 ANDERS LEE F 8475314 20132014 75.25851
## 10 9 ANDRE BURAKOVSKY F 8477444 20162017 43.27903
## # ... with 27 more rows
## Create cluster name / number crosswalk
cluster_xwalk <- player_season_clusters %>%
mutate(Fwd = ifelse(Pos == "D",0,1)) %>%
# Summarise data to cluster level, calculating mean talent, share of forwards, and sample size
group_by(Cluster) %>%
summarise(Mean.Predicted.CS = mean(Predicted.CS),
SD.Predicted.CS = sd(Predicted.CS),
Fwd.Share = mean(Fwd), Count = n()) %>%
# Determine positional makeup of each cluster
mutate(Pos = ifelse(Fwd.Share < 0.15,"D",
ifelse(Fwd.Share > 0.85,"F","Both"))) %>%
# Join cluster centers to data
arrange(Cluster) %>%
cbind(cluster_center) %>%
# Rank cluster talent by position, and create names based on rules
group_by(Pos) %>%
# I developed these rules as cluster numbers may change, based on sample players and how the prominent PCs in each cluster map the eigenvalues back to the original metrics
mutate(ClusterRank = rank(-Mean.Predicted.CS),
ClusterName =
ifelse(Pos == "F" & ClusterRank == 1,
"Skilled All-Situation Offensive Driver",
ifelse(Pos == "D" & ClusterRank == 1,
"Skilled All-Situation Defensive Driver",
ifelse(Pos == "F" & ClusterRank == 2,
"Shooting Favorable-Situation Offensive Driver",
ifelse(Pos == "D" & ClusterRank == 2,
"Skilled Favorable-Situation Defensive Player",
ifelse(Pos == "F" & ClusterRank == 3,
"Skilled Favorable-Situation Offensive Driver",
ifelse(Pos == "D" & ClusterRank == 3,
"All-Situation Defensive Player",
ifelse(Pos == "F" & ClusterRank == 4,
"All-Situation Defensive Forward",
ifelse(Pos == "D" & ClusterRank == 4,
"Favorable-Situation Defensive Depth",
ifelse(Pos == "F" & ClusterRank == 5,
"Favorable-Situation Offensive Depth",
ifelse(Pos == "F" & ClusterRank == 6,
"Favorable-Situation Offensive Depth",
ifelse(Pos == "F" & ClusterRank == 7,
"Utility Depth",
"Utility Depth"))))))))))))
# Join crosswalk to data and rank skill by position, creating depth chart ranking and cluster
player_season_clusters <- pca_data[,c("Player","shooterID","season","Pos","Predicted.CS")] %>%
# Join cluster and cluster names
cbind(Cluster) %>%
left_join(cluster_xwalk[, c("Cluster","ClusterName")], by = "Cluster") %>%
mutate(Name = paste0(sapply(strsplit(as.character(Player), ' '), function(x) x[length(x)]))) %>%
# Rank players by position and create league ranking
group_by(season, Pos) %>%
mutate(Pos.Rank = rank(-Predicted.CS),
DepthChart = ifelse(Pos == "D" & Pos.Rank < 60,"1P D",
ifelse(Pos == "D" & Pos.Rank < 120,"2P D",
ifelse(Pos == "D" & Pos.Rank < 180,"3P D",
ifelse(Pos == "D","Other D",
ifelse(Pos != "D" & Pos.Rank < 90,"1L Fwd",
ifelse(Pos != "D" & Pos.Rank < 180,"2L Fwd",
ifelse(Pos != "D" & Pos.Rank < 270,"3L Fwd",
ifelse(Pos != "D" & Pos.Rank < 360,"4L Fwd",
ifelse(Pos != "D","Other Fwd","Other"))))))))))
Create a function that creates a plot for all player seasons estimated ability and player, labelling select players. I chose to display:
Now you can add any player name to the plot_players_cluster_function() function and plot their cluster and ability by season.
plot_players_cluster_function <- function(player_list, seasons=c("20072008","20082009","20092010","20102011","20112012","20122013","20132014","20142015","20152016","20162017")) {
plot <- player_season_clusters %>%
## Limit to season in function list
filter(season %in% seasons) %>%
## Display only players in player list
mutate(PlayerListed = ifelse(Player %in% player_list,paste0(Player,substr(season,7,8)),NA)) %>%
## Plot
ggplot(aes(y=Predicted.CS,x=reorder(ClusterName,-Predicted.CS), color=as.factor(DepthChart), shape=as.factor(Pos),
label=PlayerListed, color=as.factor(ClusterName)), size=20) +
geom_point(size=5, alpha=0.5) +
geom_label(color = 'grey50') +
guides(colour = guide_legend(override.aes = list(size=8))) +
labs(title = paste0(paste(player_list, sep=",", collapse=", "),"\nClusters and Estimated Ability by Season, data from www.crowdscoutsports.com"), y="Predicted CrowdScout Score (0-100)", shape="Position",color="Depth Chart",x="") +
theme_standard() +
scale_color_discrete()
return(plot)
}
plot_players_cluster_function(c("ALEX OVECHKIN"), c("20142015","20152016","20162017"))
plot_players_cluster_function(c("MARK SCHEIFELE"))
plot_players_cluster_function(c("MIKAEL BACKLUND"))
Clustering is fun but doesn’t really reveal too much on its own. However, this is a precursor to a Player Similarity Analysis I am completing (and will publish in a similar way, producing a Jupyter Notebook with commentary, instead of a traditional blog post). Knowing how a player is deployed and the type of results they generally create of course helps this exercise. This has more interesting applications like projecting future performance based on comparable players and their future performance.
Additional research can also use these player archetypes predict shift, game, or season-level results.