Clustering FIFA Soccer Players

Sports Analytics & Insight — Classification & Clustering Assignment, Part B

Author

Luka Kelentric

Published

April 28, 2026

1 Introduction

Part B of the assignment uses clustering methods to separate 1000 FIFA soccer players into different groups based on six performance attributes: acceleration, ball_control, dribbling, shot_power, short_passing, and sprint_speed. Two clustering algorithms are applied and compared:

Hierarchical clustering using hclust() with Ward’s method.
K-means clustering using kmeans() with set.seed(101).

Each four-cluster solution is assessed using silhouette scores, profiled on the six attributes plus age, value and wage, and the two methods are then compared.

Code

library(tidyverse)
library(rpart)
library(rattle)
library(TTR)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(tidyr)
library(gt)
library(scales)
library(janitor)
library(cluster)

2 Question 1: Importing the Data

Code

#importing the dataset
fifa <- read.csv("fifa_dataset.csv")

#creating an additional data set with only the variables that will be used for this analysis
fifa_attrs <- select(fifa, acceleration, ball_control, dribbling, shot_power, short_passing, sprint_speed)

The full FIFA dataset is loaded and an additional data set (fifa_attrs) containing only the six attributes used for clustering is created. The full dataset is retained because age, value, and wage will be needed later to profile the clusters.

3 Question 2: Does the Data Need to be Scaled?

Code

summary(fifa_attrs)

  acceleration    ball_control     dribbling       shot_power   
 Min.   :26.00   Min.   :12.00   Min.   :10.00   Min.   :12.00  
 1st Qu.:62.00   1st Qu.:69.00   1st Qu.:61.00   1st Qu.:65.00  
 Median :72.00   Median :78.00   Median :75.00   Median :75.00  
 Mean   :69.39   Mean   :71.47   Mean   :67.17   Mean   :68.62  
 3rd Qu.:79.00   3rd Qu.:82.00   3rd Qu.:81.00   3rd Qu.:80.00  
 Max.   :96.00   Max.   :95.00   Max.   :97.00   Max.   :94.00  
 short_passing    sprint_speed  
 Min.   :15.00   Min.   :28.00  
 1st Qu.:70.00   1st Qu.:63.75  
 Median :76.00   Median :72.00  
 Mean   :71.82   Mean   :69.91  
 3rd Qu.:80.00   3rd Qu.:79.00  
 Max.   :92.00   Max.   :96.00

The data does not need to be scaled. All six attributes are measured on the same 1-100 FIFA rating scale, as confirmed by the summary() output above. Because they share a common scale, no attribute will dominate the Euclidean distance calculation simply because of its measurement units (which is what scaling exists to prevent).

4 Question 3: Hierarchical Clustering

4.1 Q3a — Distance matrix

Code

#computing of euclidean distance between every pair of players
d_fifa <- dist(fifa_attrs)

dist() produces a 1000 × 1000 matrix where each entry is the Euclidean distance between two players measured across the six clustering attributes.

4.2 Q3b — Hierarchical clustering with `hclust()`

Code

#Hierarchical clustering - Ward method
h_fifa <- hclust(d_fifa, method = 'ward.D')

Ward’s method ('ward.D') is used because it produces compact, balanced clusters that are easier to interpret when profiling, which is the next step in the analysis.

4.3 Q3c — Visualisation: dendrogram and heatmap

Code

plot(h_fifa, hang = -1, labels = FALSE,
     main = "Hierarchical Clustering Dendrogram (Ward's Method)")

Dendrogram of the hierarchical clustering using Ward’s method.

Code

heatmap(as.matrix(d_fifa),
        Rowv = as.dendrogram(h_fifa),
        Colv = 'Rowv',
        labRow = FALSE, labCol = FALSE,
        main = "Heatmap of Distances")

Heatmap of the pairwise distance matrix, ordered by the dendrogram.

4.3.1 Q3c.i — Does the heatmap show clustering structure?

The heatmap shows clear evidence of clustering structure. Distinct square blocks of low-distance (dark) colour are visible along the diagonal, separated by lighter regions of higher distance. Each diagonal block represents a group of players who are all close to one another in the six-attribute space, while lighter off-diagonal regions represent the larger distances between players in different blocks.

In particular, one block stands out as especially dark and well-separated from the rest — this corresponds to the very tightly defined goalkeeper / non-outfield cluster identified later (silhouette score ~0.69). The remaining blocks are slightly less distinct but still clearly visible, indicating moderate clustering structure overall.

4.4 Q3d — Four-cluster solution and quality assessment

Code

#Four cluster solution
clusters_h <- cutree(h_fifa, k = 4)

#Quality assessment through silhouette scores
sil_h <- silhouette(clusters_h, d_fifa)
summary(sil_h)

Silhouette of 1000 units in 4 clusters from silhouette.default(x = clusters_h, dist = d_fifa) :
 Cluster sizes and average silhouette widths:
       492        193        107        208 
0.30592133 0.28830631 0.69464229 0.08230691 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3653  0.1461  0.3160  0.2976  0.4576  0.7958

The four-cluster hierarchical solution has an overall mean silhouette score of approximately 0.30, which falls in the “moderate structure” range (0.25–0.50). The per-cluster silhouettes are uneven: one cluster (107 players) has an excellent score of approximately 0.69, these players are tightly grouped and well-separated from everyone else, while the largest cluster (492 players) scores around 0.31, another scores 0.29, and one cluster of 208 players scores only 0.08, meaning those players sit very close to the boundary of another cluster and are only weakly assigned.

The minimum individual silhouette is −0.37, indicating a small number of players who would actually be a better fit in a different cluster. Overall, the solution captures real but not highly distinct structure in the data.

4.5 Q3e — Profiling of the hierarchical clusters

Code

#attach the cluster IDs to the whole data set and label them acordingly
fifa_clus_h <- cbind(fifa, clusters_h)
fifa_clus_h <- mutate(fifa_clus_h, Cluster = case_when(clusters_h == 1 ~ 'C1',
                                                       clusters_h == 2 ~ 'C2',
                                                       clusters_h == 3 ~ 'C3',
                                                       clusters_h == 4 ~ 'C4'))

4.5.1 Q3e.i — How do clusters differ on the six attributes?

Code

#average of each of the 6 attributes by cluster
fifa_clus_h_means <- fifa_clus_h %>%
  group_by(Cluster) %>%
  summarise(acceleration  = mean(acceleration),
            ball_control  = mean(ball_control),
            dribbling     = mean(dribbling),
            shot_power    = mean(shot_power),
            short_passing = mean(short_passing),
            sprint_speed  = mean(sprint_speed))

fifa_clus_h_means

# A tibble: 4 × 7
  Cluster acceleration ball_control dribbling shot_power short_passing
  <chr>          <dbl>        <dbl>     <dbl>      <dbl>         <dbl>
1 C1              80.8         81.1      80.4       77.0          77.9
2 C2              65.2         79.0      73.9       77.2          79.4
3 C3              48.5         23.7      16.1       25.1          33.0
4 C4              57.0         66.3      55.8       63.3          70.3
# ℹ 1 more variable: sprint_speed <dbl>

Code

#tidy + line graph
fifa_clus_h_tidy <- fifa_clus_h_means %>%
  pivot_longer(cols = c(acceleration, ball_control, dribbling, shot_power, short_passing, sprint_speed),
               names_to = "Attribute", values_to = "Average_Value")

#reorder attributes for a more digestible line graph
fifa_clus_h_tidy$Attribute <- factor(fifa_clus_h_tidy$Attribute,
                                     levels = c("acceleration", "sprint_speed", "ball_control", "dribbling", "short_passing", "shot_power"))

ggplot(fifa_clus_h_tidy, aes(x = Attribute, y = Average_Value, group = Cluster, colour = Cluster)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
  ylab("Mean Score (1-100)") +
  ggtitle("Hierarchical Clusters - Mean Score per Attribute")

The four hierarchical clusters describe four clearly different player profiles:

Cluster C1 — Elite all-rounders (492 players). Score highly across every attribute (means in the high 70s to low 80s), with acceleration = 81 and ball_control = 81. These are the fastest, most technically gifted players in the dataset.
Cluster C2 — Technical playmakers (193 players). High on the skill-based attributes (ball_control = 79, short_passing = 79, shot_power = 77) but markedly lower on the speed-based attributes (acceleration and sprint_speed both ≈ 65). These are skilled but slower players.
Cluster C3 — Goalkeepers / non-outfield players (107 players). Very low on every outfield attribute (ball_control = 24, dribbling = 16, shot_power = 25), with only moderate acceleration and sprint_speed (~49). This profile fits goalkeepers, who are not measured on outfield skills.
Cluster C4 — Average squad players (208 players). Mid-range across all attributes (50s–70s), with no standout strengths or weaknesses. These are typical squad-rotation players.

4.5.2 Q3e.ii — How do clusters differ on age, club value, and wage?

Code

#profile by age, value, wage
fifa_clus_h_demo <- fifa_clus_h %>%
  group_by(Cluster) %>%
  summarise(mean_age   = mean(age),
            mean_value = mean(value),
            mean_wage  = mean(wage))

fifa_clus_h_demo

# A tibble: 4 × 4
  Cluster mean_age mean_value mean_wage
  <chr>      <dbl>      <dbl>     <dbl>
1 C1          26.3  21355081.    79226.
2 C2          28.1  16098446.    65772.
3 C3          29.1  14350935.    51766.
4 C4          28.0  13387500     58260.

Code

#bar charts for age, value and wage
fifa_clus_h_demo_tidy <- fifa_clus_h_demo %>%
  pivot_longer(cols = c(mean_age, mean_value, mean_wage), names_to = "Variable", values_to = "Mean")

ggplot(fifa_clus_h_demo_tidy, aes(x = Cluster, y = Mean, fill = Cluster)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Variable, scales = "free_y") +
  ggtitle("Hierarchical Clusters - Age, Value, Wage")

The demographic profiles align closely with the attribute profiles:

C1 (Elite all-rounders) are the youngest (mean age ≈ 26.3) and command the highest market value (~€21.4 M) and highest wage (~€79 K/week). This matches the commercial reality that elite young players are the most marketable and best-paid assets.
C2 (Technical playmakers) are slightly older (~28.1 years) with mid-range value (~€16.1 M) and wage (~€66 K/week).
C3 (Goalkeepers) are the oldest (~29.1 years) and earn the lowest mean wage (~€52 K/week). This is consistent with the reality that goalkeepers tend to peak later in their careers and earn less than equivalent outfielders.
C4 (Average squad players) have the lowest market value (~€13.4 M) despite mid-range age (~28 years), reflecting their squad-rotation rather than star status.

5 Question 4: K-means Clustering

5.1 Q4a — Carry out K-means with four clusters

Code

set.seed(101)
kmeans_fifa <- kmeans(fifa_attrs, centers = 4)

#inspection of the result
kmeans_fifa$cluster   # cluster assignment for each player

   [1] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 4 3 4 2 2 4 2 3 2 4 2 2 2 2 3 2 2 3 2 2
  [38] 3 3 2 1 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 3 2 2 3 2 3 1 2 2 3 2 2 2 1 4 2 1 2
  [75] 2 2 2 2 3 1 4 1 1 3 1 2 3 4 2 2 2 3 2 2 1 2 2 2 2 2 2 2 1 1 2 4 2 2 2 2 3
 [112] 2 2 1 1 2 2 2 1 2 3 2 2 2 3 3 2 2 4 3 1 3 3 2 1 4 3 3 2 2 2 2 2 2 3 3 3 2
 [149] 3 3 3 2 3 1 2 2 3 3 4 3 4 2 2 1 2 2 3 3 4 2 3 3 3 3 3 2 4 4 4 2 2 2 2 3 2
 [186] 4 4 2 2 3 2 2 1 2 2 2 2 2 2 1 3 2 2 2 2 2 3 2 2 3 3 1 2 2 3 2 2 3 3 2 2 2
 [223] 1 2 4 2 3 2 3 2 2 4 2 3 2 2 3 3 2 2 3 2 2 4 2 2 2 2 3 3 1 1 1 4 3 3 2 2 3
 [260] 2 2 2 3 1 2 2 3 3 3 2 2 2 3 2 2 2 2 3 1 1 3 3 1 3 2 2 2 3 3 2 2 3 2 2 3 3
 [297] 4 1 2 2 2 1 2 2 3 4 2 2 2 3 2 2 2 2 2 2 2 4 4 1 3 2 4 2 3 3 2 2 1 3 3 2 2
 [334] 2 2 2 3 4 3 3 2 2 3 2 2 1 1 3 2 2 3 2 2 2 1 2 2 2 3 3 2 2 1 3 3 2 2 1 3 1
 [371] 3 2 3 2 2 1 2 2 3 2 2 2 3 1 3 2 2 1 2 3 2 3 3 3 2 2 2 2 3 1 3 2 2 2 2 2 2
 [408] 2 2 3 2 2 1 2 1 3 2 3 4 2 3 2 2 2 3 2 2 2 2 2 2 4 1 1 4 3 3 2 3 3 3 2 2 2
 [445] 3 1 1 3 2 2 2 2 4 2 2 2 2 3 1 2 3 2 3 1 1 1 2 2 3 2 2 2 2 2 3 3 2 2 4 4 4
 [482] 2 2 3 4 4 2 3 2 2 3 2 3 2 2 2 3 3 2 2 3 3 2 3 3 3 2 2 2 3 3 3 3 3 2 1 4 4
 [519] 4 3 3 3 2 2 3 1 3 1 2 2 2 1 1 2 3 1 1 1 1 2 2 2 2 3 2 2 2 2 3 3 3 2 2 1 3
 [556] 2 2 4 2 2 2 2 2 3 2 2 2 2 1 2 1 2 2 3 3 2 2 3 2 2 3 3 2 2 2 3 4 4 3 1 3 3
 [593] 4 2 2 2 3 2 1 3 3 1 1 2 3 2 2 2 2 2 3 3 2 3 3 3 3 3 2 3 3 1 1 2 2 4 1 2 2
 [630] 3 3 1 2 3 2 2 1 3 3 2 2 2 2 2 1 1 3 1 2 2 2 1 4 1 1 1 2 3 3 2 3 3 1 3 3 3
 [667] 1 1 1 1 2 2 2 1 1 3 4 2 1 2 3 4 4 4 3 3 3 4 4 2 2 2 2 3 1 1 2 3 1 1 2 2 2
 [704] 2 2 4 4 3 3 3 2 2 2 3 1 2 2 2 2 3 3 3 2 2 2 2 3 3 3 2 3 3 3 3 2 3 2 2 2 1
 [741] 1 3 1 3 1 1 3 1 1 4 3 1 1 1 4 4 4 2 3 1 2 1 1 1 3 1 1 4 4 2 2 2 2 2 2 2 2
 [778] 2 2 3 2 3 1 1 1 3 3 3 3 3 1 1 2 1 3 1 3 1 1 4 4 3 3 3 3 3 3 1 2 3 1 4 4 2
 [815] 2 3 3 3 2 1 1 2 2 1 2 4 1 4 3 4 3 2 2 4 4 2 2 2 3 3 1 1 2 3 1 1 1 3 1 1 1
 [852] 2 4 1 4 3 4 4 4 4 4 2 2 3 3 3 1 3 1 2 2 4 4 2 2 3 4 4 2 2 2 2 3 3 3 1 1 4
 [889] 4 4 4 4 3 3 1 2 3 2 1 3 1 3 1 1 4 3 3 1 3 1 1 1 3 2 3 4 4 3 4 4 4 4 3 3 1
 [926] 1 4 4 4 4 3 3 1 1 3 1 3 2 2 1 3 1 1 1 3 4 3 1 1 1 1 3 3 2 3 4 3 1 3 4 2 3
 [963] 3 3 3 2 2 3 1 3 4 3 4 1 3 3 1 1 1 3 4 1 3 4 1 1 1 1 1 4 4 4 1 2 3 3 3 3 2
[1000] 1

Code

kmeans_fifa$centers   # the 4 centroids (mean of each attribute per cluster)

  acceleration ball_control dribbling shot_power short_passing sprint_speed
1     58.16667     64.53448  53.31034   60.32759      68.72989     62.16092
2     82.05568     81.30858  81.06497   77.12297      77.85151     81.35731
3     65.00346     78.59862  73.58478   76.90657      78.97924     65.09689
4     48.31132     23.44340  15.88679   25.06604      32.84906     49.17925

Code

kmeans_fifa$size      # number of players in each cluster

[1] 174 431 289 106

Code

kmeans_fifa$iter      # iterations until the algorithm converged

[1] 4

set.seed(101) is required. K-means is fed the raw attribute data (not the distance matrix) and asked to find four cluster centroids.

5.2 Q4b — Assessing the quality of the K-means solution

Code

sil_k <- silhouette(kmeans_fifa$cluster, d_fifa)
summary(sil_k)

Silhouette of 1000 units in 4 clusters from silhouette.default(x = kmeans_fifa$cluster, dist = d_fifa) :
 Cluster sizes and average silhouette widths:
      174       431       289       106 
0.1777123 0.3802751 0.2177741 0.6883372 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1060  0.1807  0.3313  0.3307  0.4788  0.7898

The K-means solution achieves an overall mean silhouette of approximately 0.33, again falling in the “moderate structure” range. Per-cluster silhouettes are 0.18, 0.38, 0.22, and 0.69. The cluster with silhouette ~0.69 (106 players) is the same goalkeeper-style cluster identified by hierarchical clustering, extremely tight and well separated. The remaining three clusters all score in a respectable 0.18–0.38 band, with no cluster as weak as the hierarchical solution’s 0.08 cluster.

Cluster sizes (174, 431, 289, 106) are also more evenly balanced than the hierarchical solution (107, 193, 208, 492). Both observations suggest K-means has produced a slightly cleaner partition.

5.3 Q4c — Profile the K-means clusters

Code

#attaching cluster IDs
fifa_clus_k <- fifa %>%
  mutate(clusters_k = kmeans_fifa$cluster) %>%
  mutate(Cluster = case_when(clusters_k == 1 ~ 'C1',
                             clusters_k == 2 ~ 'C2',
                             clusters_k == 3 ~ 'C3',
                             clusters_k == 4 ~ 'C4'))

5.3.1 Q4c.i — How do clusters differ on the six attributes?

Code

#profile on the 6 attributes (same method as hierarchical)
fifa_clus_k_means <- fifa_clus_k %>%
  group_by(Cluster) %>%
  summarise(acceleration  = mean(acceleration),
            ball_control  = mean(ball_control),
            dribbling     = mean(dribbling),
            shot_power    = mean(shot_power),
            short_passing = mean(short_passing),
            sprint_speed  = mean(sprint_speed))

fifa_clus_k_means

# A tibble: 4 × 7
  Cluster acceleration ball_control dribbling shot_power short_passing
  <chr>          <dbl>        <dbl>     <dbl>      <dbl>         <dbl>
1 C1              58.2         64.5      53.3       60.3          68.7
2 C2              82.1         81.3      81.1       77.1          77.9
3 C3              65.0         78.6      73.6       76.9          79.0
4 C4              48.3         23.4      15.9       25.1          32.8
# ℹ 1 more variable: sprint_speed <dbl>

Code

fifa_clus_k_tidy <- fifa_clus_k_means %>%
  pivot_longer(cols = c(acceleration, ball_control, dribbling, shot_power, short_passing, sprint_speed),
               names_to = "Attribute", values_to = "Average_Value")

fifa_clus_k_tidy$Attribute <- factor(fifa_clus_k_tidy$Attribute,
                                     levels = c("acceleration", "sprint_speed", "ball_control", "dribbling", "short_passing", "shot_power"))

ggplot(fifa_clus_k_tidy, aes(x = Attribute, y = Average_Value, group = Cluster, colour = Cluster)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
  ylab("Mean Score (1-100)") +
  ggtitle("K-means Clusters - Mean Score per Attribute")

K-means produces four clusters with profiles almost identical to those from hierarchical clustering, although the cluster numbering (C1–C4) is shuffled because the algorithms assign IDs differently:

K-means C2 — Elite all-rounders (431 players). Mean attributes ~77–82 across the board. Equivalent to hierarchical C1.
K-means C3 — Technical playmakers (289 players). Skill scores in the high 70s with speed scores ~65. Equivalent to hierarchical C2.
K-means C1 — Average squad players (174 players). Mid-range scores ~53–69. Equivalent to hierarchical C4.
K-means C4 — Goalkeepers / non-outfield (106 players). Very low outfield skill scores (16–33). Equivalent to hierarchical C3.

5.3.2 Q4c.ii — How do clusters differ on age, club value, and wage?

Code

#profile by age, value, wage
fifa_clus_k_demo <- fifa_clus_k %>%
  group_by(Cluster) %>%
  summarise(mean_age   = mean(age),
            mean_value = mean(value),
            mean_wage  = mean(wage))

fifa_clus_k_demo

# A tibble: 4 × 4
  Cluster mean_age mean_value mean_wage
  <chr>      <dbl>      <dbl>     <dbl>
1 C1          27.7  13317241.    58540.
2 C2          26.2  21964037.    80708.
3 C3          28.1  15971626.    65225.
4 C4          29.0  14475000     51972.

Code

fifa_clus_k_demo_tidy <- fifa_clus_k_demo %>%
  pivot_longer(cols = c(mean_age, mean_value, mean_wage), names_to = "Variable", values_to = "Mean")

ggplot(fifa_clus_k_demo_tidy, aes(x = Cluster, y = Mean, fill = Cluster)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Variable, scales = "free_y") +
  ggtitle("K-means Clusters - Age, Value, Wage")

The K-means demographic profiles also closely mirror the hierarchical results:

C2 (Elite all-rounders) — youngest (~26.2 years), most valuable (~€22.0 M), highest wage (~€81 K/week).
C3 (Technical playmakers) — ~28.1 years, ~€16.0 M, ~€65 K/week.
C1 (Average squad players) — ~27.7 years, ~€13.3 M, ~€59 K/week.
C4 (Goalkeepers) — oldest (~29 years), lowest wage (~€52 K/week).

The matched clusters from the two methods agree on every demographic to within a few percentage points.

6 Question 5: Comparing Hierarchical vs K-means

Code

#comparing overall silhouette scores
mean_sil_h <- mean(sil_h[, 3])      # hierarchical average silhouette
mean_sil_k <- mean(sil_k[, 3])      # k-means average silhouette

mean_sil_h

[1] 0.297603

Code

mean_sil_k

[1] 0.330721

Code

#cross-tabulate the two cluster assignments
table(Hierarchical = clusters_h, KMeans = kmeans_fifa$cluster)

            KMeans
Hierarchical   1   2   3   4
           1   3 428  61   0
           2   1   3 189   0
           3   1   0   0 106
           4 169   0  39   0

6.1 Q5a — Which algorithm produced the highest quality clusters?

K-means produced the higher-quality clustering solution. Its overall mean silhouette score was 0.331, compared with 0.298 for the hierarchical (Ward’s) solution, about 3.3 percentage points higher. K-means also achieved more balanced cluster sizes (range 106-431) than hierarchical (range 107-492) and avoided the very weak cluster in the hierarchical solution (per-cluster silhouette of only 0.08, indicating those players are barely closer to their assigned cluster than to others). Both scores fall in the “moderate structure” range, meaning that genuine but not strongly distinct groupings exist in the data.

6.2 Q5b — Did both algorithms produce clusters with similar profiles?

Yes — despite the difference in algorithm, both methods produced four clusters with strikingly similar profiles. Each method identified the same four player archetypes:

Elite all-rounders — high on every attribute (77–82), youngest, most valuable, highest paid.
Technical playmakers — high skill (77–80), lower speed (65), mid-tier value and wage.
Average squad players — mid-range scores (53–70), lowest market value.
Goalkeepers / non-outfield players — very low outfield-skill scores (16–33), oldest, lowest wage.

The matched clusters’ attribute means agreed to within 1–2 points on every variable, and demographic profiles agreed to within a few percent. The cross-tabulation confirms this: most players are concentrated along a “diagonal” mapping between the two methods (with the row-column labels permuted), meaning both algorithms put the same players in the same conceptual groups.

The most noticeable difference is in the boundary between the elite and technical clusters: hierarchical placed 492 players in the elite cluster vs K-means’ 431, with the difference (60 players) re-classified into the technical cluster (193 vs 289). K-means is more conservative about who qualifies as elite, which is partly what gives it the silhouette advantage. The goalkeeper cluster, in contrast, is effectively identical across methods (107 vs 106 players, near-identical attribute means and a per-cluster silhouette of 0.69 in both cases) this is the clearest group in the dataset.

1 Introduction

2 Question 1: Importing the Data

3 Question 2: Does the Data Need to be Scaled?

4 Question 3: Hierarchical Clustering

4.1 Q3a — Distance matrix

4.2 Q3b — Hierarchical clustering with hclust()

4.3 Q3c — Visualisation: dendrogram and heatmap

4.3.1 Q3c.i — Does the heatmap show clustering structure?

4.4 Q3d — Four-cluster solution and quality assessment

4.5 Q3e — Profiling of the hierarchical clusters

4.5.1 Q3e.i — How do clusters differ on the six attributes?

4.5.2 Q3e.ii — How do clusters differ on age, club value, and wage?

5 Question 4: K-means Clustering

5.1 Q4a — Carry out K-means with four clusters

5.2 Q4b — Assessing the quality of the K-means solution

5.3 Q4c — Profile the K-means clusters

5.3.1 Q4c.i — How do clusters differ on the six attributes?

5.3.2 Q4c.ii — How do clusters differ on age, club value, and wage?

6 Question 5: Comparing Hierarchical vs K-means

6.1 Q5a — Which algorithm produced the highest quality clusters?

6.2 Q5b — Did both algorithms produce clusters with similar profiles?

4.2 Q3b — Hierarchical clustering with `hclust()`