Skater Clustering Analysis

Hockey analytics has created a large array of statistics to measure player performance. How to properly weigh each of these to create a comprehensive overview of each player is not as widely discussed. Some, including myself, have tried to smartly roll them up into a single metric. However, possibly more appropriately, they can also be used to perform a clustering analysis, identifying player types.

To do this, it helps to first reduce the number of features we have for each player. This is best done with Principal Components Analysis (PCA), which boils down all metrics into a few by creating a ‘Principal Component’ based on finding the maximum variability in the dataset. Then finding the second dimension of maximum variability, and so on.

PCA aims captures as much variation in data as possible

It also discards useless information. Imagine our dataset had a metric that was ‘Good in the Room / 60,’ but since we didn’t know anything about what happens in the room, every player is rated as a 10. This metric would offer zero variation and would be discarded. Scoring rates, on the other hand, vary greatly between players and would help make up the first principal component, calculated to have the largest possible variance.

The smaller dataset can then be used to perform K-means clustering, identifying ‘k’ different player types based on the principal components for each player-season.

Note: This analysis is meant to combine some working knowledge of R (dplyr, subsetting), statistical tools (scaling), and hockey analytics. I try to comment code and explain as best I can, but am happy to handle any additional questions at cole92anderson@gmail.com or @crowdscoutsprts on twitter.

1. Data Preparation

To re-create this analysis, download the data here.

### Load Data from downloaded folder
crowdscout_data_pred <- read.csv("~/Downloads/crowdscout_data_download.csv", header = TRUE)

### Look at column names (ignore that this analysis has already been done)
colnames(crowdscout_data_pred)

##   [1] "Player"                     "shooterID"                 
##   [3] "season"                     "Cluster"                   
##   [5] "ClusterName"                "Pos.Rank"                  
##   [7] "DepthChart"                 "Predicted.CS"              
##   [9] "Predicted.CS.RF.Scaled"     "Predicted.CS.LM.Scaled"    
##  [11] "Pos"                        "Games.Played_EV"           
##  [13] "Games.Played_PP"            "Games.Played_SH"           
##  [15] "TOI_EV"                     "TOI_PP"                    
##  [17] "TOI_SH"                     "Total.Shifts_EV"           
##  [19] "Total.Shifts_PP"            "Total.Shifts_SH"           
##  [21] "OTF.Shift.Share_EV"         "OTF.Shift.Share_PP"        
##  [23] "OTF.Shift.Share_SH"         "Off.FO.Shift.Share_EV"     
##  [25] "Off.FO.Shift.Share_PP"      "Off.FO.Shift.Share_SH"     
##  [27] "Def.FO.Shift.Share_EV"      "Def.FO.Shift.Share_PP"     
##  [29] "Def.FO.Shift.Share_SH"      "Neu.FO.Shift.Share_EV"     
##  [31] "Neu.FO.Shift.Share_PP"      "Neu.FO.Shift.Share_SH"     
##  [33] "ixG60_EV"                   "ixG60_PP"                  
##  [35] "ixG60_SH"                   "G60_EV"                    
##  [37] "G60_PP"                     "G60_SH"                    
##  [39] "A160_EV"                    "A160_PP"                   
##  [41] "A160_SH"                    "A260_EV"                   
##  [43] "A260_PP"                    "A260_SH"                   
##  [45] "xGF60_EV"                   "xGF60_PP"                  
##  [47] "xGF60_SH"                   "xGA60_EV"                  
##  [49] "xGA60_PP"                   "xGA60_SH"                  
##  [51] "team.time.wo_EV"            "team.time.wo_PP"           
##  [53] "team.time.wo_SH"            "xGF60_teamWO_EV"           
##  [55] "xGF60_teamWO_PP"            "xGF60_teamWO_SH"           
##  [57] "xGA60_teamWO_EV"            "xGA60_teamWO_PP"           
##  [59] "xGA60_teamWO_SH"            "G60_Away_EV"               
##  [61] "G60_Away_PP"                "G60_Away_SH"               
##  [63] "G60_Home_EV"                "G60_Home_PP"               
##  [65] "G60_Home_SH"                "ixG60_Away_EV"             
##  [67] "ixG60_Away_PP"              "ixG60_Away_SH"             
##  [69] "ixG60_Home_EV"              "ixG60_Home_PP"             
##  [71] "ixG60_Home_SH"              "P160_Away_EV"              
##  [73] "P160_Away_PP"               "P160_Away_SH"              
##  [75] "P160_Home_EV"               "P160_Home_PP"              
##  [77] "P160_Home_SH"               "P260_Away_EV"              
##  [79] "P260_Away_PP"               "P260_Away_SH"              
##  [81] "P260_Home_EV"               "P260_Home_PP"              
##  [83] "P260_Home_SH"               "xGA60_Away_EV"             
##  [85] "xGA60_Away_PP"              "xGA60_Away_SH"             
##  [87] "xGA60_Home_EV"              "xGA60_Home_PP"             
##  [89] "xGA60_Home_SH"              "xGF60_Away_EV"             
##  [91] "xGF60_Away_PP"              "xGF60_Away_SH"             
##  [93] "xGF60_Home_EV"              "xGF60_Home_PP"             
##  [95] "xGF60_Home_SH"              "Player_Competition_EV"     
##  [97] "Player_Competition_PP"      "Player_Competition_SH"     
##  [99] "Player_Teammates_EV"        "Player_Teammates_PP"       
## [101] "Player_Teammates_SH"        "TeamWO_Competition_EV"     
## [103] "TeamWO_Competition_PP"      "TeamWO_Competition_SH"     
## [105] "TeamWO_Teammates_EV"        "TeamWO_Teammates_PP"       
## [107] "TeamWO_Teammates_SH"        "Player_Competition_Away_EV"
## [109] "Player_Competition_Away_PP" "Player_Competition_Away_SH"
## [111] "Player_Competition_Home_EV" "Player_Competition_Home_PP"
## [113] "Player_Competition_Home_SH" "Player_Teammates_Away_EV"  
## [115] "Player_Teammates_Away_PP"   "Player_Teammates_Away_SH"  
## [117] "Player_Teammates_Home_EV"   "Player_Teammates_Home_PP"  
## [119] "Player_Teammates_Home_SH"   "TeamWO_Competition_Away_EV"
## [121] "TeamWO_Competition_Away_PP" "TeamWO_Competition_Away_SH"
## [123] "TeamWO_Competition_Home_EV" "TeamWO_Competition_Home_PP"
## [125] "TeamWO_Competition_Home_SH" "TeamWO_Teammates_Away_EV"  
## [127] "TeamWO_Teammates_Away_PP"   "TeamWO_Teammates_Away_SH"  
## [129] "TeamWO_Teammates_Home_EV"   "TeamWO_Teammates_Home_PP"  
## [131] "TeamWO_Teammates_Home_SH"   "Share.of.Ice_EV"           
## [133] "Share.of.Ice_PP"            "Share.of.Ice_SH"           
## [135] "xGF60_Rel_EV"               "xGF60_Rel_PP"              
## [137] "xGF60_Rel_SH"               "xGA60_Rel_EV"              
## [139] "xGA60_Rel_PP"               "xGA60_Rel_SH"              
## [141] "xGD60_EV"                   "xGD60_PP"                  
## [143] "xGD60_SH"                   "xGD60_teamWO_EV"           
## [145] "xGD60_teamWO_PP"            "xGD60_teamWO_SH"           
## [147] "P160_EV"                    "P160_PP"                   
## [149] "P160_SH"                    "P60_EV"                    
## [151] "P60_PP"                     "P60_SH"                    
## [153] "Teammates_Diff_EV"          "Teammates_Diff_PP"         
## [155] "max.RF"                     "min.RF"                    
## [157] "max.LM"                     "min.LM"                    
## [159] "Player.Position"            "Shoots"                    
## [161] "shooterDOB"                 "TOI"                       
## [163] "G60"                        "P60"

### Create position variables and 
crowdscout_data_pred_pca <- crowdscout_data_pred %>% 
        mutate(D = ifelse(Player.Position == "D",1,0),
               C = ifelse(Player.Position == "C",1,0),
               W = ifelse(!Player.Position %in% c("C","D"),1,0),
               Shooting_PP = G60_PP - ixG60_PP,
               Shooting_EV = G60_EV - ixG60_EV,
               Shooting_SH = G60_SH - ixG60_SH)

### Identify variables to scale and performance PCA on.
### Feel free to add other metrics based on columns available, or remove based on preference
pca_vars <- c("Total.Shifts_EV","Total.Shifts_PP","Total.Shifts_SH","OTF.Shift.Share_EV",
              "OTF.Shift.Share_PP","OTF.Shift.Share_SH","Off.FO.Shift.Share_EV",
              "Off.FO.Shift.Share_PP","Off.FO.Shift.Share_SH",
              "Def.FO.Shift.Share_EV","Def.FO.Shift.Share_PP",
              "Def.FO.Shift.Share_SH","ixG60_EV","ixG60_PP","ixG60_SH","G60_EV","G60_PP",
              "G60_SH","A160_EV","A160_PP","A160_SH","xGF60_EV","xGF60_PP",
              "xGF60_SH","xGA60_EV","xGA60_PP","xGA60_SH","Player_Competition_EV",
              "Player_Teammates_EV","Player_Teammates_PP","Share.of.Ice_EV",
              "Share.of.Ice_PP","Share.of.Ice_SH","xGF60_Rel_EV","xGF60_Rel_PP",
              "xGF60_Rel_SH","xGA60_Rel_EV","xGA60_Rel_PP","xGA60_Rel_SH","P60_EV","P60_PP",
              "P60_SH","Teammates_Diff_EV","Teammates_Diff_PP","D","C","W",
              "Shooting_PP","Shooting_EV","Shooting_SH")

2a. Create function to scale and center data by season

To factor in changes in scoring or any other metric over the course of the last decade, each season will be scaled separately.

# Create Function to Scale Each Season
scale_season_function <- function(year) {
  
  ### Filter data to year
  season_data <- crowdscout_data_pred_pca %>% 
                      filter(season == year)
  
  ### Scale variables to be used in analysis for that season 
  scaled_season <- scale(season_data[,pca_vars])
  
  ### Place scaled player level metrics next to player identifiers
  season_scaled <- as.data.frame(cbind(season_data[,c("Player","shooterID","season","Pos","Predicted.CS")],scaled_season))
  
  return(season_scaled)
    
}

2a. Scale Each Season

## Find Unique Seasons in data
seasons <- unique(crowdscout_data_pred_pca$season)

## Scale Each Season Separately, Stack Output into master PCA dataset
pca_data <- plyr::rbind.fill(lapply(FUN=scale_season_function,seasons))

## Replace NAs with 0
pca_data[is.na(pca_data[,pca_vars]),pca_vars] <- 0

## Create unique player season identifier
id_vector <- as.vector(paste0(pca_data$Player,"-",substr(pca_data$season,7,8),"-",pca_data$shooterID))

## Label rows as player ID
rownames(pca_data) <- id_vector

## Verify variance is uniform
plot(sapply(pca_data[,pca_vars], var))

3. Find Principal Components using Scree Plot

PCA Analysis is normally performed to reduce the size of data without losing information. We can reduce player metrics down in dimension by using the nFactors package. Ideally, we would try to explain a lot more of the variance, but there often trade offs involved in choosing the number of principal components (PCs) to keep. There are graphical and non-graphical ways to determine this (http://www.empowerstats.com/manuals/paper/scree.pdf#1) but general rules of thumbs are to have the eigenvalue greater than 1, and/or find an ‘elbow’ in the plot - a point where including another PC doesn’t explain variance as much as the last PC.

To determine the number of PCs to keep, we will also look at a Parallel Analysis of the dataset. Parallel analysis works by creating a random dataset with the same numbers of observations and variables as the original data, we have set the parallel() function below to run 100 times. A correlation matrix is computed from the randomly generated dataset and then eigenvalues of the correlation matrix are computed. When the eigenvalues from the random data are larger than the eigenvalues from the PCA or factor analysis it signals that the components or factors are mostly random noise, thus not to be included.

Based on the Scree Plot and Parallel Analysis below, we want to retain 16 PCs - where PC eigenvalues are greater than the Parallel Analysis random eigenvalues averaged over 100 replications.

# Determine Number of Factors to Extract
library(nFactors)

# Get eigenvalues
ev <- eigen(cor(pca_data[,pca_vars])) 

# Run parallel analysis 100 times, get output
ap <- parallel(subject=nrow(pca_data[,pca_vars]),var=ncol(pca_data[,pca_vars]),
               rep=100,cent=.05)

# Combine eigenvalues values and parallel analysis values and plot
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)

4. Limit dataset to most useful Principal Components & determine individual variables that make up each PC

These 16 vectors contain scores for each player-season. We can pull the factor loadings from the ‘pca’ object and visualize which of our original on-ice metrics make up each component. We can also take a look at what these components look like with player-seasons attached to them as an example. Note PC1 seems to be made up of metrics that puts some notable scorers near the top.

## Principal Components on data
pca <- prcomp(pca_data[,pca_vars])

## Create matrix of original variables by PCs
loadings <- as.matrix(pca$rotation[,1:16])

## Create dataframe of player-season level components
comp <- data.frame(pca$x[,1:16])

## Visualize component loadings, blue denotes metric is highly correlated with the PC. The component loadings are correlation coefficients between the variables (rows) and component (columns). Analogous to Pearson's r, the squared component loading is the percent of variance in that variable explained by the factor. (ftp://statgen.ncsu.edu/pub/thorne/molevoclass/AtchleyOct19.pdf)
d3heatmap::d3heatmap(loadings, dendrogram = "none")

## Look at a sample of players and their PCs
player_components <- cbind(pca_data[,c("Player","shooterID","season")], comp)

## Players with highest PC1
player_components %>% arrange(-PC1) %>% head() %>% print()

##          Player shooterID   season       PC1       PC2        PC3
## 1    BEN WALTER   8471372 20072008 10.933258  9.006279 33.5120646
## 2  NICK JOHNSON   8471280 20122013  8.740949 10.651292 25.0315108
## 3 EVGENI MALKIN   8471215 20112012  7.885107  4.656312 -0.4182223
## 4    JAMES NEAL   8471707 20112012  7.817560  3.149731  1.0976165
## 5 ALEX OVECHKIN   8471214 20092010  7.717287  6.113872 -3.8732694
## 6    JAMIE BENN   8473994 20152016  7.615739  1.527613  0.8322005
##            PC4         PC5         PC6        PC7         PC8         PC9
## 1 -10.76394188 -31.3095006 -17.8543277 -9.1504618  6.60902841 10.67556150
## 2  -7.43354747 -21.0690673 -13.0372167 -7.0665458  7.20082925  8.64285147
## 3  -1.41171254  -0.4549992   1.2227250  0.3312946  0.78568735  0.26857619
## 4   0.50383731   3.8917850  -5.1585164 -0.3326766 -1.12957623  0.05299163
## 5  -1.19790940  -1.1744967   1.5181517  0.6269304  0.29067378 -0.59278354
## 6   0.08739307   0.1737666  -0.1526732  0.1232433  0.06669861 -0.08398483
##         PC10        PC11        PC12       PC13       PC14        PC15
## 1 20.3258104  -6.7423909 -2.16999676  1.0159797 11.1848612 -13.2848456
## 2 18.8443771 -11.6250087  0.01123968  4.2851560  9.4538159 -11.1071870
## 3  0.1255588   0.4513855 -2.88780985  0.5211703  0.1168389   0.2538285
## 4  1.4939522   1.0814434  0.94512417 -1.2964290  1.2406700  -0.3744424
## 5 -0.5390334  -0.8470502 -0.92352589 -0.3373556  0.7619723  -0.1860636
## 6 -0.1404038   0.3135776  1.43546347 -0.8585913  0.8294723  -0.2260359
##         PC16
## 1 -2.2520485
## 2 -1.9847774
## 3  1.3667700
## 4 -1.1080658
## 5 -0.7677967
## 6  0.3280049

5. Find optimal number of player clusters

After reducing the number of features using the PCA, we can run the k-means algorithm. K-means is an unsupervised (since we don’t know what the cluster of each player is before hand) algorithm, clustering each player to the nearest mean or mathematical center of a cluster. To find the optimal number of clusters, we iteratively test different numbers of clusters and record their respective Within Group Sum of Squares (WSS). By definition adding another cluster will lower WSS, so we look for the ‘elbow’ again, where adding another cluster doesn’t do much for us.

wss <- c()
set.seed(1234)
## Test kmeans with 1 to 25 centers, keeping the WSS from each test
for (i in 1:25) {
  wss[i] <- sum(kmeans(pca_data[,pca_vars], centers=i, nstart = 25, iter.max = 1000)$withinss) 
}

## Plot number of clusters and WSS, looking for 'elbow'
plot(1:25, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

6. Find optimal number of player clusters

While there is no extremely-sharp ‘elbow’ in the data, signifying a clear number of player-types in the last decade, there seems to be a slight bend at k=12 (sometimes data science is data art). This gives us a good number of clusters - certainly more interesting than trying to fit everyone into 3 clusters or coming up with names for 20 groups. Feel free to play around with this yourself if you like.

Before and after clustering in 2D separating into 3 clusters (now imagine 16D separating into 12 clusters) After clustering each player-season into 12 clusters, we can plot where each cluster falls within each of the PCs. It helps to visualize how clusters form, finding similar player-seasons in a multi-dimension space. Each color below represents a cluster.

# From scree plot elbow occurs at k = 13
# Apply k-means with k=13
kmeans_object <- kmeans(comp, 12, nstart=25, iter.max=1000)

# Plot cluster by PC
plot(comp, col=as.factor(kmeans_object$clust), pch=16)

7. Label each cluster

Finally, we want to label each cluster to make them a little more accessible. I did this using a combination of methods.

Looked at the players in each group, one has Crosby, another a fringe NHLer, etc
Looked at share of forwards in each cluster and mean CrowdScout Score (proxy for talent)
Looked at the PCs that make up the cluster center. Each of these PCs are made up of original metrics as we saw in the component loadings above.

So, for example, we see cluster 12 maps to Ovechkin, Kane, and Malkin. We also see in the ‘cluster_xwalk’ below it is relatively overweight looking at PC1 (+4). Looking at the component loadings above, we see PC1 is positively correlated with scoring and PP time, so this cluster likely has many offensive weapons.

In order to generalize, I ranked talent levels by position, noticing that clusters that are high scoring or skilled cohorts are rated highly (my predicted CrowdScout Score is built on user scouting inputs, and even the best fanalysts are more likely to observe offense than defense). The best clusters were made up of productive ‘Skilled’ players that were seemingly ability to ‘Drive’ team xG metrics and played in ‘All-Situations.’ The 2nd and 3rd highest skilled groups were termed ‘Favorable-Situation’ because they generally had a higher share of powerplay time and results and a lower share of shorthanded time and results. Further down, xG relative to team and usage often signalled sheltering or ‘Depth’-type players.

This is to say, you may want to read these cluster labels not as an air tight law of nature, rather what they represent relative to the other clusters.

# Vector of player clusters from 'kmeans_object' object
Cluster = kmeans_object$clust

# Print PCs (13 columns) making up each cluster (13 rows)
cluster_center = as.data.frame(kmeans_object$centers)
print(cluster_center)

##           PC1        PC2           PC3        PC4          PC5         PC6
## 1  -4.9697572  0.4267106  -0.067949427 -0.6595092   0.19375676  0.13791757
## 2   0.8844018  2.7771605   9.512039877 -4.5811274  -7.73365148 -5.61258895
## 3  -0.4345101 -0.7531353   0.005337511  2.1227437  -1.10054853 -0.36237986
## 4  -1.0779345 -1.3799075  -0.182288914 -1.4171342   0.68750969 -0.35649757
## 5  -2.1610421  0.5074045  -0.852269081  0.3583863   3.37497852 -1.86620396
## 6  -3.0014025  1.2254760   0.822700595  0.8684013  -0.42108460  4.21414829
## 7  -3.0108648  1.0611701   0.204298321  0.2863108  -0.04860887  0.09934151
## 8   4.0321379  1.2079719   0.318906322  0.3505113   0.40179186  0.05271376
## 9   2.5637316 12.2655756 -23.346267853 -5.7188135 -13.15480668 -6.18208013
## 10  0.4944777  1.8472207   0.573996176  0.4105548   0.43268505  0.40879548
## 11  1.9795006 -3.2009686  -0.362383392 -0.5364423  -0.23752568  0.24170253
## 12  3.0526901  4.1992570  -3.946319415 -1.3333983  -1.95031422 -0.41565085
##             PC7         PC8         PC9         PC10        PC11
## 1    1.09459669 -0.28377259  1.28268524  0.127942843  1.04077200
## 2   -1.21103142 -1.26475192 -0.33320148 -5.199340101 -2.07244019
## 3    0.11272742  0.17068638 -0.03146655 -0.122495200  0.50134770
## 4   -0.25458239 -0.27430725 -0.20549964  0.090775986 -0.63182910
## 5   -6.48229292 11.38020701  0.76745596 -3.274964218  1.20434936
## 6  -11.55311876 -6.19670982 -2.09956499  1.240517941  2.85246325
## 7    0.11006599  0.10895127 -0.48133202  0.003914345  0.04973259
## 8    0.30974182 -0.10754436  0.10958536  0.037754132  0.59370883
## 9   -2.94637587 -1.24749117 -0.56041481 -0.556506523 -1.19528646
## 10  -0.05863843  0.16531885  0.05537615  0.108629926 -0.62331483
## 11  -0.06100447  0.01702405 -0.01595944  0.010223351 -0.35880183
## 12  -0.17735624  0.06752644  0.10568371 -0.025613739 -0.14809753
##           PC12        PC13        PC14          PC15        PC16
## 1  -0.56927633 -1.53261138  1.29298687  0.1209554922 -0.02207076
## 2  -1.17214247 -0.60685478  1.41327521  1.4936869898 -0.13914990
## 3  -0.04444384  0.48989548 -0.02645625  0.0564728470  0.04396016
## 4  -0.38637368  0.41792135 -0.23706326 -0.1557679297 -0.09271773
## 5  -1.27294557 -0.87308062  1.09501621  0.1312868356  0.20609165
## 6  -0.43742846 -1.45890791  1.08868974 -0.1932377816 -0.39072043
## 7   0.58830355  0.36759768 -0.62142641  0.0055161186  0.05782381
## 8   0.05497915 -0.13353263  0.04973056 -0.0004424651  0.13429013
## 9  -3.20144659 -0.85277045  0.47101946 -1.5184502439  7.31356433
## 10  0.26812125  0.18447536  0.04554889 -0.0094881337  0.09228903
## 11 -0.06952163 -0.38019112  0.06229165 -0.0134620970  0.03328358
## 12 -0.07606657 -0.03062333  0.03084002  0.3595985351 -2.07749520

# Join clusters to player season data
player_season_clusters <- data.frame(cbind(Cluster,pca_data[,c("Player","Pos","shooterID", "season", "Predicted.CS")])) 

# Print top 3 examples of each cluster
player_season_clusters %>% arrange(-Cluster) %>% group_by(Cluster) %>% top_n(3) %>% print()

## # A tibble: 37 x 6
## # Groups:   Cluster [12]
##    Cluster           Player    Pos shooterID   season Predicted.CS
##      <int>           <fctr> <fctr>     <int>    <int>        <dbl>
##  1      12    ALEX OVECHKIN      F   8471214 20142015     94.45475
##  2      12     PATRICK KANE      F   8474141 20152016     93.76518
##  3      12    EVGENI MALKIN      F   8471215 20132014     94.41843
##  4      11       RYAN SUTER      D   8470600 20142015     98.00066
##  5      11      KRIS LETANG      D   8471724 20152016     95.72291
##  6      11      KRIS LETANG      D   8471724 20162017     96.71833
##  7      10    DANIEL BRIERE      F   8464975 20082009     76.82660
##  8      10      PAUL KARIYA      F   8459426 20082009     77.81999
##  9      10       ANDERS LEE      F   8475314 20132014     75.25851
## 10       9 ANDRE BURAKOVSKY      F   8477444 20162017     43.27903
## # ... with 27 more rows

## Create cluster name / number crosswalk
cluster_xwalk <- player_season_clusters %>%
          mutate(Fwd = ifelse(Pos == "D",0,1)) %>% 
          # Summarise data to cluster level, calculating mean talent, share of forwards, and sample size
          group_by(Cluster) %>% 
          summarise(Mean.Predicted.CS = mean(Predicted.CS), 
                    SD.Predicted.CS = sd(Predicted.CS), 
                    Fwd.Share = mean(Fwd), Count = n()) %>%
          # Determine positional makeup of each cluster  
          mutate(Pos = ifelse(Fwd.Share < 0.15,"D",
                       ifelse(Fwd.Share > 0.85,"F","Both"))) %>%
          # Join cluster centers to data
          arrange(Cluster) %>%
          cbind(cluster_center) %>%
          # Rank cluster talent by position, and create names based on rules
          group_by(Pos) %>%
          # I developed these rules as cluster numbers may change, based on sample players and how the prominent PCs in each cluster map the eigenvalues back to the original metrics
          mutate(ClusterRank = rank(-Mean.Predicted.CS),
                 ClusterName = 
                     ifelse(Pos == "F" & ClusterRank == 1,
                            "Skilled All-Situation Offensive Driver",
                     ifelse(Pos == "D" & ClusterRank == 1,
                            "Skilled All-Situation Defensive Driver",
                            
                     ifelse(Pos == "F" & ClusterRank == 2,
                            "Shooting Favorable-Situation Offensive Driver",
                     ifelse(Pos == "D" & ClusterRank == 2,
                            "Skilled Favorable-Situation Defensive Player",
                            
                     ifelse(Pos == "F" & ClusterRank == 3,
                            "Skilled Favorable-Situation Offensive Driver",
                     ifelse(Pos == "D" & ClusterRank == 3,
                            "All-Situation Defensive Player",
                            
                     ifelse(Pos == "F" & ClusterRank == 4,
                            "All-Situation Defensive Forward",
                     ifelse(Pos == "D" & ClusterRank == 4,
                            "Favorable-Situation Defensive Depth",
                            
                     ifelse(Pos == "F" & ClusterRank == 5,
                            "Favorable-Situation Offensive Depth",
                     ifelse(Pos == "F" & ClusterRank == 6,
                            "Favorable-Situation Offensive Depth",
                     ifelse(Pos == "F" & ClusterRank == 7,
                            "Utility Depth",
                            "Utility Depth"))))))))))))
                                              

# Join crosswalk to data and rank skill by position, creating depth chart ranking and cluster
player_season_clusters <- pca_data[,c("Player","shooterID","season","Pos","Predicted.CS")] %>%
          # Join cluster and cluster names
          cbind(Cluster) %>%
          left_join(cluster_xwalk[, c("Cluster","ClusterName")], by = "Cluster") %>%
          mutate(Name = paste0(sapply(strsplit(as.character(Player), ' '), function(x) x[length(x)]))) %>%
          # Rank players by position and create league ranking
            group_by(season, Pos) %>%
            mutate(Pos.Rank = rank(-Predicted.CS),
                   DepthChart = ifelse(Pos == "D" & Pos.Rank < 60,"1P D",
                           ifelse(Pos == "D" & Pos.Rank < 120,"2P D",
                           ifelse(Pos == "D" & Pos.Rank < 180,"3P D",
                           ifelse(Pos == "D","Other D",
                           ifelse(Pos != "D" & Pos.Rank < 90,"1L Fwd",
                           ifelse(Pos != "D" & Pos.Rank < 180,"2L Fwd",
                           ifelse(Pos != "D" & Pos.Rank < 270,"3L Fwd",
                           ifelse(Pos != "D" & Pos.Rank < 360,"4L Fwd",
                           ifelse(Pos != "D","Other Fwd","Other"))))))))))

8. Create function to plot player by cluster, season, ability, and positional ranking

Create a function that creates a plot for all player seasons estimated ability and player, labelling select players. I chose to display:

Alex Ovechkin because he’s awesome
Mark Scheifele because a) he’s had one of the more interesting ascents in the dataset, jumping around clusters as he matured and b) I need practice spelling his name
Mikael Backlund because he also had an interesting development arc.

Now you can add any player name to the plot_players_cluster_function() function and plot their cluster and ability by season.

plot_players_cluster_function <- function(player_list, seasons=c("20072008","20082009","20092010","20102011","20112012","20122013","20132014","20142015","20152016","20162017")) {

plot <- player_season_clusters %>%
    ## Limit to season in function list
    filter(season %in% seasons) %>%
    ## Display only players in player list
    mutate(PlayerListed = ifelse(Player %in% player_list,paste0(Player,substr(season,7,8)),NA)) %>%
    ## Plot
    ggplot(aes(y=Predicted.CS,x=reorder(ClusterName,-Predicted.CS), color=as.factor(DepthChart), shape=as.factor(Pos), 
               label=PlayerListed, color=as.factor(ClusterName)), size=20) +
    geom_point(size=5, alpha=0.5) +
    geom_label(color = 'grey50') +
    guides(colour = guide_legend(override.aes = list(size=8))) +
    labs(title = paste0(paste(player_list, sep=",", collapse=", "),"\nClusters and Estimated Ability by Season, data from www.crowdscoutsports.com"), y="Predicted CrowdScout Score (0-100)", shape="Position",color="Depth Chart",x="") +
    theme_standard() + 
    scale_color_discrete()

return(plot)
}

plot_players_cluster_function(c("ALEX OVECHKIN"), c("20142015","20152016","20162017"))

plot_players_cluster_function(c("MARK SCHEIFELE"))

plot_players_cluster_function(c("MIKAEL BACKLUND"))

9. Next steps

Clustering is fun but doesn’t really reveal too much on its own. However, this is a precursor to a Player Similarity Analysis I am completing (and will publish in a similar way, producing a Jupyter Notebook with commentary, instead of a traditional blog post). Knowing how a player is deployed and the type of results they generally create of course helps this exercise. This has more interesting applications like projecting future performance based on comparable players and their future performance.

Additional research can also use these player archetypes predict shift, game, or season-level results.