Clustering - Hierarchical and K-Means

1 Hierarchical Clustering of 82 American Baseball Players

1.1 Introduction

This study adopts Hierarchical and K-Means approaches, to cluster 82 American baseball players who have been elected into the Hall of Fame. Because of the election process, Hall of Fame Players may not always be chosen on the basis of their performance on specific metrics. A clustering examination may identify a common pattern among those elected, suggesting a standard practice, or may fail to identify a common pattern, suggesting a preferential or non-standard basis of election.

The Hierarchical Clustering approach is the first undertaken and uses five numerical variables: Career Hits (‘Hits’); Career Runs (‘Runs’); Career Home Runs (‘Home_runs’); Runs batted in (‘Rbi’); Career Stolen Bases (‘Stolen_bases’). The K-Means approach is undertaken in a later part of the study.

1.2 Scaling the Data

Because of the variance in means and standard deviations, it is necessary to scale by way of z-scores the variables to achieve a comparable scale. Because of the scaled z-scores, we can speak of high and low performance on a number of variables across a cluster and between clusters and so make comparisons across a cluster and between clusters.

The Categorical variable, ‘PlayerID’, was removed from the data. The Euclidean distance between each pair of baseball players was calculated.

1.3 Hierarchical Clustering - Dendogram

1.4 Hierarchical Clustering - Heat Map

As shown in Diagram 1, there are light-coloured blocks around the diagonal descending in diminishing size from top left to bottom right. Consequently, we conjecture that there may be three or four weak clusters in the dataset.

Diagram 1. Hierarchical Clustering Heat Map

1.5 Four Cluster Solution and its Quality

Silhouette of 82 units in 4 clusters from silhouette.default(x = clustersb, dist = db) :
 Cluster sizes and average silhouette widths:
   17    25    30    10 
0.320 0.208 0.296 0.433 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -0.177   0.210   0.326   0.291   0.422   0.573 

As shown from the output above, the Clusters range in size from 30 to 10 players.

The Average Silhouette Width Score (-1.0 to +1.0) measures the extent to which individual players identify with their cluster and differ from those in other clusters.

Clusters’ Average Silhouette Width Scores are 0.433 (10 Players), 0.320 (17 Players), 0.296 (30 Players), and 0.208 (25 Players), which suggests that three clusters have a ‘weak’ or possibly ‘artificial’ structure and one (Cluster 2 with 25 players) has ‘no substantial’ structure. There is no evidence of a ‘reasonable’ clustering structure (0.51 - 0.7) in any individual cluster.

The overall Mean Silhouette Width of 0.291, suggests that the overall clustering is ‘weak’ and possibly ‘artificial’.

1.6 Common Properties of Clusters

To identify properties of the clusters that may distinguish one from another, the Mean of all the scaled variables (z-scores) is calculated, as shown in Table 1

Table 1. Clusters by Mean (z-Scores) of their Variables.

Cluster hits runs home_runs rbi stolen-bases
C1 0.804 0.973 0.984 1.260 -0.329
C2 0.617 0.430 -0.963 -0.662 1.116
C3 -0.427 -0.396 0.456 0.222 -0.532
C4 -1.629 -1.541 -0.634 -1.153 -0.634

Three clusters’ (1, 3, 4) have variable means (as z-scores) that are relatively well separated particularly for ‘hits’, ‘runs’ ‘home runs’ and ‘rbi’, but not for ‘stolen_bases’ where they converge around -0.5 with two slightly less than -0.5. However, Cluster 2 scores 1.116 on ‘Stolen Bases’. In addition, the signs of the z scores are not consistently positive or negative. This is more clearly seen in the Line Graph in Diagram 2.

Diagram 2. Baseball Player Clusters by Attributes

Cluster 1 performs best on all metrics except Stolen Bases where it is second best. Cluster 4 performs weakest on all metrics, except home runs, where it is second weakest. Cluster 3 performs second weakest in ‘Hits’ and ‘Runs’ and second best in ‘Home Runs’ and ‘Runs Batted-In’ (Rbi). Clusters 1, 3, and 4 display a similar pattern of improvement from ‘Hits’, ‘Runs’, to ‘Home_Runs’. Cluster 1 continues to improve on ‘Rbi’ score, while the other two display declines. Cluster 4 increases its score on ‘Stolen Bases’ and Cluster 1 and 3 display declines to their lowest score on all metrics.

Cluster 2 is at odds with the patterns of Clusters 1, 3 and 4. While it has second highest score on ‘hits’ and ‘runs’, it has the weakest scores on ‘Home Runs’ and second weakest on ‘Rbi’. Yet it has the highest score of all on ‘Stolen Bases’.

It is important to notice that the order of the Clusters in the Legend is 1, 2, 3, 4 as Red, Blue, Green and Purple as the same colours will reflect different cluster the K-Means clustering approach.

1.7 Conclusion

While the quality of clustering, assessed by silhouette means, suggested ‘weak’ and ‘possibly artificial’ clustering, the line graph suggests some common patterns of scores on ‘Hits’, ‘Runs’, ‘Home_Runs’ with a convergence on -0.5 score on ’Stolen Bases among Clusters 1, 3 and 4. These three clusters are well seperated and well discriminated as they do not cross over each other. However, Cluster 2 differs from these three, on both metric measurement (z-scores) and pattern as it oscillates from worst to best, on certain variables, across the other three clusters.

2 K -Means Clustering of 82 American Baseball Players

2.1 Introduction

This study also adopts a K-Means approach to clustering the 82 American baseball players drawing on the same variables: Career Hits (‘Hits’); Career Runs (‘Runs’); Career Home Runs (‘Home_runs’); Runs batted in (‘Rbi’); Career Stolen Bases (‘Stolen_bases’).

2.2 Scaling the Data

Because of the variance in means and standard deviations of the variables it is necessary to achieve a comparable scale. Consequently, the data was scaled and ‘PlayerID’ was removed. The Euclidean distance between each pair of baseball players was calculated.

2.3 Four Clusters, Quality and Profile

Set.seed(101) was run for reproducibility. Four clusters were extracted using k-means on the scaled data in order to derive the variables Means (z-scores) per Cluster. The quality of the clusters and of the overall clustering structure are shown below.

Silhouette of 82 units in 4 clusters from silhouette.default(x = kmeansb$cluster, dist = dkmeans) :
 Cluster sizes and average silhouette widths:
   19    22    13    28 
0.269 0.218 0.342 0.332 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -0.063   0.184   0.290   0.288   0.419   0.550 

Clusters range in size from 28 to 13 players.

The quality of the individual clusters is reflected in their individual average silhouette widths, which range from: 0.342 (13 Players); 0.332 (28 Players); 0.269 (19 Players); 0.218 (22 Players). Cluster 2 with 22 players has the weakest Silhouette Width Score. The Cluster Silhouette width scores suggests that three clusters have a ‘weak’ or ‘artificial’ structure and one (Cluster 2) has ‘no substantial’ structure. There is no evidence of a ‘reasonable’ clustering structure (0.51 - 0.7)

The overall Mean Silhouette Width of 0.288 suggests that the overall clustering structure is weak and possibly artificial.

The number of baseball players in each cluster and the cluster centroids per variable, are presented in Table 2.

Table 2. Clusters by Mean (z-Scores) of their Variables.

Cluster Size Hits Runs Home_Runs Rbi Stolen-Bases
C 1 28 -0.204 -0.162 0.566 0.380 -0.485
C 2 22 0.752 0.554 -0.980 -0.581 1.210
C 3 19 -1.151 -1.201 -0.570 -0.873 -0.465
C 4 13 0.849 1.168 1.273 1.441 -0.324

As noted the Clusters range in size from 28 to 13 players.

Cluster 1 and 2 display an appreciable difference in all metrics with Cluster 2 showing a very high result (mean z-score = 1.210) on ‘Stolen_Bases’, which is even higher than previously noted in Table 1 (mean z-score = 1.116) . Cluster 1 and 3 scores differ appreciably on ‘Hits’, ‘Runs’, ‘Home Runs’, ‘Rbis’ but are similar in terms of ‘Stolen Bases’. Clusters 1 and 4 differ appreciably on all metrics except Stolen Bases where they are quite similar. Cluster 2 and 3 differ appreciably from each other on all metrics. Cluster 2 and 4 differ appreciably from each other on all metrics except Hits. Cluster 3 and 4 differ appreciable from each other on all metrics except Stolen Bases

As identified in the Hierarchical clustering, three clusters converge around a similar score on Stolen bases, albeit all three are more than -0.5. Cluster 2 shows different patterns to the other three, as was also noted in the Hierarchical study. These patterns are more clearly seen in Diagram 3.

Diagram 3. Baseball Player Clusters by Attributes

2.4 Conclusion

Based on the order of the colour scheme in the Hierarchical approach , clusters 1, 2, 3, 4 were represented by Red, Blue, Green and Purple. Consequently, the respective clusters in the K-Means analysis are Clusters 4, 2, 1, 3.

Similarly to the results of the Hierarchical analysis, the quality of clustering, assessed by silhouette means, suggested ‘weak’ and ‘possibly artificial’ clustering, while the line graph suggested some common patterns of scores on ‘Hits’, ‘Runs’, ‘Home_Runs’ with a convergence on -0.5 score on ’Stolen Bases among Clusters 4, 1, and 3. These three clusters are well separated and well discriminated as they do not cross over each other. However, Cluster 2 differs from these three, on both metric measurement (z-scores) and pattern as it oscillates from worst to best, on certain variables, and oscillates across the other three clusters.

2.5 Overall Conclusion

Any comparisons of individual clusters need to recognise the different colour schemes of the clusters. Consequently, the Hierarchical Clusters from 1, 2, 3, 4 need to be compared to the K-Means clusters of 4, 2, 1, 3. Only cluster 2 is in the same sequence of both analyses.

Compared to the K-Means Clustering algorithm, the Hierarchical Clustering algorithm had a slightly larger mean of Individual Silhouette Widths of 0.291 compared to 0.288, suggesting a slightly better clustering quality, albeit ‘weak’ if not ‘artificial’.

However, in terms of profile of clusters produced, three K-Means Clusters, namely Cluster 4 (0.342 - 13 players), Cluster 2 (0.218 - 22 players), Cluster 1 (0.332 - 28 players), had larger Average Silhouette Widths when compared with their comparators in the Hierarchical Analysis, namely Cluster 1 (0.320 - 17 players), Cluster 2 (0.208 - 25 players), Cluster 3 (0.296 - 30 players). Only K-Means Cluster 3 (0.269 - 19 players), had a lesser Average Silhouette Width than its Hierarchical comparative Cluster 4 (0.433 - 10 players).

There is mixed evidence as to which algorithm provided the best quality of clustering, albeit evidence of ‘weak’ if not ‘artificial’ structure. It is also noteworthy that comparator clusters had, with K-Means clusters reported first: 13 and 17 players; 22 and 25 players; 28 and 30 players; 19 and 10 players.

It is also possible that Cluster 2 which contrasted with all other clusters in both analyses, may represent players who catch the eye of electors because of their performance on ‘stolen bases’, while displaying no other ‘stand-out’ features. Such player capture the public eye because they ‘steal a base’, to which they are perceived to be ‘un-entitled’ and which results from ‘taking a gamble’ as they exploit the pitcher’s or catcher’s slow or inaccurate response. Spectators and commentators are generally loud in their approval of such a ‘steal’ and such a ‘gamble’. This approval may influence electors.

Further study could both examine the quality of clustering when such players were removed from the sample and the persistence of players in their comparator clusters, both of which are beyond the scope of the current study.