Part B – Clustering Soccer Players

Author

Dylan Lynch

1. Importing the Dataset

The FIFA dataset contains information on 1,000 soccer players, including personal details like name, age, nationality, club value and wage, along with 34 performance attributes scored from 1 to 100.

For this analysis, we’re focusing on six specific attributes that capture a player’s speed and technical ability:

  • Acceleration
  • Ball control
  • Dribbling
  • Shot power
  • Short passing
  • Sprint speed

2. Should the Data Be Scaled?

Before running any clustering algorithm, it’s worth considering whether the data needs to be standardised.

Summary Statistics for the Six Attributes
acceleration ball_control dribbling shot_power short_passing sprint_speed
Min. :26.00 Min. :12.00 Min. :10.00 Min. :12.00 Min. :15.00 Min. :28.00
1st Qu.:62.00 1st Qu.:69.00 1st Qu.:61.00 1st Qu.:65.00 1st Qu.:70.00 1st Qu.:63.75
Median :72.00 Median :78.00 Median :75.00 Median :75.00 Median :76.00 Median :72.00
Mean :69.39 Mean :71.47 Mean :67.17 Mean :68.62 Mean :71.82 Mean :69.91
3rd Qu.:79.00 3rd Qu.:82.00 3rd Qu.:81.00 3rd Qu.:80.00 3rd Qu.:80.00 3rd Qu.:79.00
Max. :96.00 Max. :95.00 Max. :97.00 Max. :94.00 Max. :92.00 Max. :96.00

Looking at the summary statistics, all six variables are measured on the same 1-100 scale, which might suggest scaling isn’t necessary. However, scaling is still a good idea for a couple of reasons.

Firstly, hierarchical clustering relies on calculating distances between players. Even though the variables share the same scale, any differences in how spread out the values are (their variance) can cause some attributes to have more influence on the distance calculations than others. A variable with a wider spread will naturally dominate.

Secondly, K-means clustering works by minimising the variance within clusters. Again, variables with larger variance will carry more weight in determining cluster membership.

By scaling the data (converting each variable to have a mean of 0 and standard deviation of 1), we ensure that all six attributes contribute equally to the analysis. This gives us a fairer and more balanced clustering result.


3. Hierarchical Clustering

3a. Creating the Distance Matrix

The first step in hierarchical clustering is to calculate how similar (or different) each player is to every other player. We do this by computing a distance matrix using Euclidean distance, which essentially measures the straight-line distance between players in our six-dimensional attribute space.

3b. Performing the Clustering

With the distance matrix ready, we can now run the hierarchical clustering algorithm. I’ve used Ward’s method here, which tends to create compact, similarly-sized clusters by minimising the total within-cluster variance at each step.

3c. Visualising the Results

Dendrogram

A dendrogram is essentially a tree diagram that shows how players are progressively merged into clusters. The height at which branches join indicates how different those groups are from each other.

The coloured rectangles show where we’ve cut the tree to create four clusters. You can see there are clear groupings, with some large clusters and some smaller, more distinct ones.

Heatmap

A heatmap provides another way to visualise the clustering structure. Each row represents a player, each column represents an attribute, and the colours show whether scores are high (warmer colours) or low (cooler colours).

Does the heatmap provide evidence of clustering structure?

Yes, quite clearly. Looking at the heatmap, you can see distinct horizontal bands of colour running across the rows. This tells us that groups of players share similar patterns across all six attributes. For instance, there’s a band of players who score highly on everything (lots of red/orange), another group who are more moderate across the board, and some players who sit at the lower end of the scale. These visible patterns confirm that there’s genuine structure in the data that clustering can capture.

3d. Creating a 4-Cluster Solution

Now let’s formally create four clusters by cutting the dendrogram at the appropriate height.

Cluster Sizes:

Cluster Number of Players
1 446
2 222
3 107
4 225

Assessing Cluster Quality

To evaluate how well-defined our clusters are, we can calculate silhouette scores. The silhouette score for each player measures how similar they are to others in their own cluster compared to players in the nearest neighbouring cluster. Scores range from -1 to 1, where:

  • Values close to 1 mean the player fits their cluster well
  • Values around 0 suggest the player sits on the boundary between clusters
  • Negative values indicate the player might be in the wrong cluster

Average Silhouette Width: 0.295

As a rough guide, an average silhouette width above 0.5 indicates good clustering structure, 0.25-0.5 suggests moderate structure, and below 0.25 means the clusters may be somewhat artificial.

3e. Profiling the Clusters

Now comes the interesting part — understanding what makes each cluster different.

i. How Do Clusters Differ on the Six Performance Attributes?

Average Attribute Scores by Cluster (Hierarchical)
hc_cluster Acceleration Ball_Control Dribbling Shot_Power Short_Passing Sprint_Speed Count
1 81.4 80.6 80.1 77.1 77.1 81.1 446
2 67.1 80.8 76.5 77.4 81.4 65.8 222
3 48.5 23.7 16.1 25.1 33.0 49.2 107
4 57.9 66.9 56.6 63.8 70.4 61.6 225

Looking at these averages, we can start to characterise each cluster:

  • Cluster with highest scores represents elite players who excel across all six attributes. These are likely the top-tier players in the dataset.
  • Cluster with moderate-high scores contains good quality players who are strong but not exceptional.
  • Cluster with moderate-low scores includes average or developing players with room for improvement.
  • Cluster with lowest scores contains players who score below average across most attributes.

ii. How Do Clusters Differ on Age, Value, and Wage?

Age, Value and Wage by Cluster (Hierarchical)
hc_cluster Mean_Age Mean_Value Mean_Wage
1 26.3 20885202 77580.7
2 27.8 18587838 73563.1
3 29.1 14350935 51766.4
4 28.0 13142222 57151.1

These results make intuitive sense. The clusters with higher performance scores tend to have higher market values and wages, which is exactly what we’d expect — better players command higher prices. The relationship with age is often more nuanced; elite players might be slightly older (in their prime years), while lower-performing clusters might include a mix of older declining players and younger players still developing.


4. K-means Clustering

4a. Performing K-means Clustering

K-means takes a different approach to clustering. Rather than building a hierarchy, it starts by randomly placing four cluster centres in the data, then iteratively assigns each player to the nearest centre and updates the centres based on the players assigned to them. This process repeats until the clusters stabilise.

We use set.seed(101) to ensure the results are reproducible, since K-means involves random initialisation.

Cluster Sizes:

Cluster Number of Players
1 127
2 381
3 384
4 108

4b. Assessing Cluster Quality

Average Silhouette Width: 0.331

This score tells us how cohesive and well-separated the K-means clusters are. We can compare this directly to the hierarchical clustering result to see which method performed better.

4c. Profiling the K-means Clusters

i. How Do Clusters Differ on the Six Performance Attributes?

Average Attribute Scores by Cluster (K-means)
km_cluster Acceleration Ball_Control Dribbling Shot_Power Short_Passing Sprint_Speed Count
1 49.6 66.9 55.0 63.3 71.0 52.3 127
2 68.2 76.5 71.3 73.9 77.1 69.0 381
3 82.9 81.4 81.3 77.3 77.7 82.4 384
4 48.7 23.9 16.3 25.3 33.1 49.4 108

Similar to the hierarchical results, we can see clear differentiation between clusters. The K-means algorithm has also identified groups ranging from elite performers to below-average players based on their attribute scores.

ii. How Do Clusters Differ on Age, Value, and Wage?

Age, Value and Wage by Cluster (K-means)
km_cluster Mean_Age Mean_Value Mean_Wage
1 29.1 10957480 49448.8
2 27.4 16885564 69007.9
3 26.1 22302865 81153.6
4 29.1 14301389 51805.6

Again, we see the expected pattern where higher-performing clusters are associated with higher market values and wages. This consistency across both clustering methods gives us confidence that we’re capturing real structure in the data.


5. Comparing Hierarchical and K-means Clustering

5a. Which Algorithm Produced Higher Quality Clusters?

Comparison of Clustering Quality
Method Average_Silhouette
Hierarchical (Ward’s) 0.295
K-means 0.331

Based on the silhouette scores, K-means clustering produced slightly better-defined clusters. The higher silhouette width indicates that players within each cluster are more similar to each other and more distinct from players in other clusters.

That said, the difference between the two methods is often relatively small. Both approaches are capturing the same underlying structure in the data.

5b. Do Both Algorithms Produce Similar Cluster Profiles?

To understand how the two methods compare in terms of which players they group together, we can look at a cross-tabulation:

Cross-tabulation of Cluster Assignments
1 2 3 4
0 77 369 0
12 195 15 0
0 0 0 107
115 109 0 1

This table shows how players assigned to each hierarchical cluster are distributed across the K-means clusters. Large numbers along the diagonal would indicate strong agreement between methods, while scattered numbers suggest the methods are grouping players differently.

Key Observations:

  1. Similar overall patterns: Both methods successfully identify the same general player types — there are elite players who score highly across all attributes, average players in the middle, and weaker players at the lower end. This consistency is reassuring and suggests the clustering structure is genuine rather than an artefact of the algorithm.

  2. Differences in boundaries: While the broad groups are similar, the exact boundaries between clusters differ. This is expected because the algorithms work in fundamentally different ways. Hierarchical clustering builds a tree by progressively merging the most similar players, whereas K-means tries to find spherical clusters that minimise within-cluster variance.

  3. Cluster sizes: You may notice that the cluster sizes differ between methods. K-means tends to produce more evenly-sized clusters because it optimises for compact, spherical groups. Hierarchical clustering can produce more uneven sizes depending on the natural structure of the data.

  4. Practical implications: For most practical purposes, either method would give useful results. The choice between them often comes down to whether you need a fixed number of clusters (K-means) or want to explore the hierarchical structure of your data (hierarchical clustering).

In summary, both algorithms tell a similar story about the FIFA players in this dataset: there are distinct groups based on their speed and technical abilities, and these groups correspond meaningfully to player value and wages in the real world. ```