| acceleration | ball_control | dribbling | shot_power | short_passing | sprint_speed | |
|---|---|---|---|---|---|---|
| Min. :26.00 | Min. :12.00 | Min. :10.00 | Min. :12.00 | Min. :15.00 | Min. :28.00 | |
| 1st Qu.:62.00 | 1st Qu.:69.00 | 1st Qu.:61.00 | 1st Qu.:65.00 | 1st Qu.:70.00 | 1st Qu.:63.75 | |
| Median :72.00 | Median :78.00 | Median :75.00 | Median :75.00 | Median :76.00 | Median :72.00 | |
| Mean :69.39 | Mean :71.47 | Mean :67.17 | Mean :68.62 | Mean :71.82 | Mean :69.91 | |
| 3rd Qu.:79.00 | 3rd Qu.:82.00 | 3rd Qu.:81.00 | 3rd Qu.:80.00 | 3rd Qu.:80.00 | 3rd Qu.:79.00 | |
| Max. :96.00 | Max. :95.00 | Max. :97.00 | Max. :94.00 | Max. :92.00 | Max. :96.00 |
Part B – Clustering Soccer Players
1. Importing the Dataset
The FIFA dataset contains information on 1,000 soccer players, including personal details like name, age, nationality, club value and wage, along with 34 performance attributes scored from 1 to 100.
For this analysis, we’re focusing on six specific attributes that capture a player’s speed and technical ability:
- Acceleration
- Ball control
- Dribbling
- Shot power
- Short passing
- Sprint speed
2. Should the Data Be Scaled?
Before running any clustering algorithm, it’s worth considering whether the data needs to be standardised.
Looking at the summary statistics, all six variables are measured on the same 1-100 scale, which might suggest scaling isn’t necessary. However, scaling is still a good idea for a couple of reasons.
Firstly, hierarchical clustering relies on calculating distances between players. Even though the variables share the same scale, any differences in how spread out the values are (their variance) can cause some attributes to have more influence on the distance calculations than others. A variable with a wider spread will naturally dominate.
Secondly, K-means clustering works by minimising the variance within clusters. Again, variables with larger variance will carry more weight in determining cluster membership.
By scaling the data (converting each variable to have a mean of 0 and standard deviation of 1), we ensure that all six attributes contribute equally to the analysis. This gives us a fairer and more balanced clustering result.
3. Hierarchical Clustering
3a. Creating the Distance Matrix
The first step in hierarchical clustering is to calculate how similar (or different) each player is to every other player. We do this by computing a distance matrix using Euclidean distance, which essentially measures the straight-line distance between players in our six-dimensional attribute space.
3b. Performing the Clustering
With the distance matrix ready, we can now run the hierarchical clustering algorithm. I’ve used Ward’s method here, which tends to create compact, similarly-sized clusters by minimising the total within-cluster variance at each step.
3c. Visualising the Results
Dendrogram
A dendrogram is essentially a tree diagram that shows how players are progressively merged into clusters. The height at which branches join indicates how different those groups are from each other.
The coloured rectangles show where we’ve cut the tree to create four clusters. You can see there are clear groupings, with some large clusters and some smaller, more distinct ones.
Heatmap
A heatmap provides another way to visualise the clustering structure. Each row represents a player, each column represents an attribute, and the colours show whether scores are high (warmer colours) or low (cooler colours).
Does the heatmap provide evidence of clustering structure?
Yes, quite clearly. Looking at the heatmap, you can see distinct horizontal bands of colour running across the rows. This tells us that groups of players share similar patterns across all six attributes. For instance, there’s a band of players who score highly on everything (lots of red/orange), another group who are more moderate across the board, and some players who sit at the lower end of the scale. These visible patterns confirm that there’s genuine structure in the data that clustering can capture.
3d. Creating a 4-Cluster Solution
Now let’s formally create four clusters by cutting the dendrogram at the appropriate height.
Cluster Sizes:
| Cluster | Number of Players |
|---|---|
| 1 | 446 |
| 2 | 222 |
| 3 | 107 |
| 4 | 225 |
Assessing Cluster Quality
To evaluate how well-defined our clusters are, we can calculate silhouette scores. The silhouette score for each player measures how similar they are to others in their own cluster compared to players in the nearest neighbouring cluster. Scores range from -1 to 1, where:
- Values close to 1 mean the player fits their cluster well
- Values around 0 suggest the player sits on the boundary between clusters
- Negative values indicate the player might be in the wrong cluster
Average Silhouette Width: 0.295
As a rough guide, an average silhouette width above 0.5 indicates good clustering structure, 0.25-0.5 suggests moderate structure, and below 0.25 means the clusters may be somewhat artificial.
3e. Profiling the Clusters
Now comes the interesting part — understanding what makes each cluster different.
i. How Do Clusters Differ on the Six Performance Attributes?
| hc_cluster | Acceleration | Ball_Control | Dribbling | Shot_Power | Short_Passing | Sprint_Speed | Count |
|---|---|---|---|---|---|---|---|
| 1 | 81.4 | 80.6 | 80.1 | 77.1 | 77.1 | 81.1 | 446 |
| 2 | 67.1 | 80.8 | 76.5 | 77.4 | 81.4 | 65.8 | 222 |
| 3 | 48.5 | 23.7 | 16.1 | 25.1 | 33.0 | 49.2 | 107 |
| 4 | 57.9 | 66.9 | 56.6 | 63.8 | 70.4 | 61.6 | 225 |
Looking at these averages, we can start to characterise each cluster:
- Cluster with highest scores represents elite players who excel across all six attributes. These are likely the top-tier players in the dataset.
- Cluster with moderate-high scores contains good quality players who are strong but not exceptional.
- Cluster with moderate-low scores includes average or developing players with room for improvement.
- Cluster with lowest scores contains players who score below average across most attributes.
ii. How Do Clusters Differ on Age, Value, and Wage?
| hc_cluster | Mean_Age | Mean_Value | Mean_Wage |
|---|---|---|---|
| 1 | 26.3 | 20885202 | 77580.7 |
| 2 | 27.8 | 18587838 | 73563.1 |
| 3 | 29.1 | 14350935 | 51766.4 |
| 4 | 28.0 | 13142222 | 57151.1 |
These results make intuitive sense. The clusters with higher performance scores tend to have higher market values and wages, which is exactly what we’d expect — better players command higher prices. The relationship with age is often more nuanced; elite players might be slightly older (in their prime years), while lower-performing clusters might include a mix of older declining players and younger players still developing.
4. K-means Clustering
4a. Performing K-means Clustering
K-means takes a different approach to clustering. Rather than building a hierarchy, it starts by randomly placing four cluster centres in the data, then iteratively assigns each player to the nearest centre and updates the centres based on the players assigned to them. This process repeats until the clusters stabilise.
We use set.seed(101) to ensure the results are reproducible, since K-means involves random initialisation.
Cluster Sizes:
| Cluster | Number of Players |
|---|---|
| 1 | 127 |
| 2 | 381 |
| 3 | 384 |
| 4 | 108 |
4b. Assessing Cluster Quality
Average Silhouette Width: 0.331
This score tells us how cohesive and well-separated the K-means clusters are. We can compare this directly to the hierarchical clustering result to see which method performed better.
4c. Profiling the K-means Clusters
i. How Do Clusters Differ on the Six Performance Attributes?
| km_cluster | Acceleration | Ball_Control | Dribbling | Shot_Power | Short_Passing | Sprint_Speed | Count |
|---|---|---|---|---|---|---|---|
| 1 | 49.6 | 66.9 | 55.0 | 63.3 | 71.0 | 52.3 | 127 |
| 2 | 68.2 | 76.5 | 71.3 | 73.9 | 77.1 | 69.0 | 381 |
| 3 | 82.9 | 81.4 | 81.3 | 77.3 | 77.7 | 82.4 | 384 |
| 4 | 48.7 | 23.9 | 16.3 | 25.3 | 33.1 | 49.4 | 108 |
Similar to the hierarchical results, we can see clear differentiation between clusters. The K-means algorithm has also identified groups ranging from elite performers to below-average players based on their attribute scores.
ii. How Do Clusters Differ on Age, Value, and Wage?
| km_cluster | Mean_Age | Mean_Value | Mean_Wage |
|---|---|---|---|
| 1 | 29.1 | 10957480 | 49448.8 |
| 2 | 27.4 | 16885564 | 69007.9 |
| 3 | 26.1 | 22302865 | 81153.6 |
| 4 | 29.1 | 14301389 | 51805.6 |
Again, we see the expected pattern where higher-performing clusters are associated with higher market values and wages. This consistency across both clustering methods gives us confidence that we’re capturing real structure in the data.
5. Comparing Hierarchical and K-means Clustering
5a. Which Algorithm Produced Higher Quality Clusters?
| Method | Average_Silhouette |
|---|---|
| Hierarchical (Ward’s) | 0.295 |
| K-means | 0.331 |
Based on the silhouette scores, K-means clustering produced slightly better-defined clusters. The higher silhouette width indicates that players within each cluster are more similar to each other and more distinct from players in other clusters.
That said, the difference between the two methods is often relatively small. Both approaches are capturing the same underlying structure in the data.
5b. Do Both Algorithms Produce Similar Cluster Profiles?
To understand how the two methods compare in terms of which players they group together, we can look at a cross-tabulation:
| 1 | 2 | 3 | 4 |
|---|---|---|---|
| 0 | 77 | 369 | 0 |
| 12 | 195 | 15 | 0 |
| 0 | 0 | 0 | 107 |
| 115 | 109 | 0 | 1 |
This table shows how players assigned to each hierarchical cluster are distributed across the K-means clusters. Large numbers along the diagonal would indicate strong agreement between methods, while scattered numbers suggest the methods are grouping players differently.
Key Observations:
Similar overall patterns: Both methods successfully identify the same general player types — there are elite players who score highly across all attributes, average players in the middle, and weaker players at the lower end. This consistency is reassuring and suggests the clustering structure is genuine rather than an artefact of the algorithm.
Differences in boundaries: While the broad groups are similar, the exact boundaries between clusters differ. This is expected because the algorithms work in fundamentally different ways. Hierarchical clustering builds a tree by progressively merging the most similar players, whereas K-means tries to find spherical clusters that minimise within-cluster variance.
Cluster sizes: You may notice that the cluster sizes differ between methods. K-means tends to produce more evenly-sized clusters because it optimises for compact, spherical groups. Hierarchical clustering can produce more uneven sizes depending on the natural structure of the data.
Practical implications: For most practical purposes, either method would give useful results. The choice between them often comes down to whether you need a fixed number of clusters (K-means) or want to explore the hierarchical structure of your data (hierarchical clustering).
In summary, both algorithms tell a similar story about the FIFA players in this dataset: there are distinct groups based on their speed and technical abilities, and these groups correspond meaningfully to player value and wages in the real world. ```