Code
library(tidyverse)
library(rpart)
library(rattle)
library(TTR)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(tidyr)
library(gt)
library(scales)
library(janitor)
library(cluster)Sports Analytics & Insight — Classification & Clustering Assignment, Part B
Part B of the assignment uses clustering methods to separate 1000 FIFA soccer players into different groups based on six performance attributes: acceleration, ball_control, dribbling, shot_power, short_passing, and sprint_speed. Two clustering algorithms are applied and compared:
hclust() with Ward’s method.kmeans() with set.seed(101).Each four-cluster solution is assessed using silhouette scores, profiled on the six attributes plus age, value and wage, and the two methods are then compared.
library(tidyverse)
library(rpart)
library(rattle)
library(TTR)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(tidyr)
library(gt)
library(scales)
library(janitor)
library(cluster)#importing the dataset
fifa <- read.csv("fifa_dataset.csv")
#creating an additional data set with only the variables that will be used for this analysis
fifa_attrs <- select(fifa, acceleration, ball_control, dribbling, shot_power, short_passing, sprint_speed)The full FIFA dataset is loaded and an additional data set (fifa_attrs) containing only the six attributes used for clustering is created. The full dataset is retained because age, value, and wage will be needed later to profile the clusters.
summary(fifa_attrs) acceleration ball_control dribbling shot_power
Min. :26.00 Min. :12.00 Min. :10.00 Min. :12.00
1st Qu.:62.00 1st Qu.:69.00 1st Qu.:61.00 1st Qu.:65.00
Median :72.00 Median :78.00 Median :75.00 Median :75.00
Mean :69.39 Mean :71.47 Mean :67.17 Mean :68.62
3rd Qu.:79.00 3rd Qu.:82.00 3rd Qu.:81.00 3rd Qu.:80.00
Max. :96.00 Max. :95.00 Max. :97.00 Max. :94.00
short_passing sprint_speed
Min. :15.00 Min. :28.00
1st Qu.:70.00 1st Qu.:63.75
Median :76.00 Median :72.00
Mean :71.82 Mean :69.91
3rd Qu.:80.00 3rd Qu.:79.00
Max. :92.00 Max. :96.00
The data does not need to be scaled. All six attributes are measured on the same 1-100 FIFA rating scale, as confirmed by the summary() output above. Because they share a common scale, no attribute will dominate the Euclidean distance calculation simply because of its measurement units (which is what scaling exists to prevent).
#computing of euclidean distance between every pair of players
d_fifa <- dist(fifa_attrs)dist() produces a 1000 × 1000 matrix where each entry is the Euclidean distance between two players measured across the six clustering attributes.
hclust()#Hierarchical clustering - Ward method
h_fifa <- hclust(d_fifa, method = 'ward.D')Ward’s method ('ward.D') is used because it produces compact, balanced clusters that are easier to interpret when profiling, which is the next step in the analysis.
plot(h_fifa, hang = -1, labels = FALSE,
main = "Hierarchical Clustering Dendrogram (Ward's Method)")heatmap(as.matrix(d_fifa),
Rowv = as.dendrogram(h_fifa),
Colv = 'Rowv',
labRow = FALSE, labCol = FALSE,
main = "Heatmap of Distances")The heatmap shows clear evidence of clustering structure. Distinct square blocks of low-distance (dark) colour are visible along the diagonal, separated by lighter regions of higher distance. Each diagonal block represents a group of players who are all close to one another in the six-attribute space, while lighter off-diagonal regions represent the larger distances between players in different blocks.
In particular, one block stands out as especially dark and well-separated from the rest — this corresponds to the very tightly defined goalkeeper / non-outfield cluster identified later (silhouette score ~0.69). The remaining blocks are slightly less distinct but still clearly visible, indicating moderate clustering structure overall.
#Four cluster solution
clusters_h <- cutree(h_fifa, k = 4)
#Quality assessment through silhouette scores
sil_h <- silhouette(clusters_h, d_fifa)
summary(sil_h)Silhouette of 1000 units in 4 clusters from silhouette.default(x = clusters_h, dist = d_fifa) :
Cluster sizes and average silhouette widths:
492 193 107 208
0.30592133 0.28830631 0.69464229 0.08230691
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3653 0.1461 0.3160 0.2976 0.4576 0.7958
The four-cluster hierarchical solution has an overall mean silhouette score of approximately 0.30, which falls in the “moderate structure” range (0.25–0.50). The per-cluster silhouettes are uneven: one cluster (107 players) has an excellent score of approximately 0.69, these players are tightly grouped and well-separated from everyone else, while the largest cluster (492 players) scores around 0.31, another scores 0.29, and one cluster of 208 players scores only 0.08, meaning those players sit very close to the boundary of another cluster and are only weakly assigned.
The minimum individual silhouette is −0.37, indicating a small number of players who would actually be a better fit in a different cluster. Overall, the solution captures real but not highly distinct structure in the data.
#attach the cluster IDs to the whole data set and label them acordingly
fifa_clus_h <- cbind(fifa, clusters_h)
fifa_clus_h <- mutate(fifa_clus_h, Cluster = case_when(clusters_h == 1 ~ 'C1',
clusters_h == 2 ~ 'C2',
clusters_h == 3 ~ 'C3',
clusters_h == 4 ~ 'C4'))#average of each of the 6 attributes by cluster
fifa_clus_h_means <- fifa_clus_h %>%
group_by(Cluster) %>%
summarise(acceleration = mean(acceleration),
ball_control = mean(ball_control),
dribbling = mean(dribbling),
shot_power = mean(shot_power),
short_passing = mean(short_passing),
sprint_speed = mean(sprint_speed))
fifa_clus_h_means# A tibble: 4 × 7
Cluster acceleration ball_control dribbling shot_power short_passing
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1 80.8 81.1 80.4 77.0 77.9
2 C2 65.2 79.0 73.9 77.2 79.4
3 C3 48.5 23.7 16.1 25.1 33.0
4 C4 57.0 66.3 55.8 63.3 70.3
# ℹ 1 more variable: sprint_speed <dbl>
#tidy + line graph
fifa_clus_h_tidy <- fifa_clus_h_means %>%
pivot_longer(cols = c(acceleration, ball_control, dribbling, shot_power, short_passing, sprint_speed),
names_to = "Attribute", values_to = "Average_Value")
#reorder attributes for a more digestible line graph
fifa_clus_h_tidy$Attribute <- factor(fifa_clus_h_tidy$Attribute,
levels = c("acceleration", "sprint_speed", "ball_control", "dribbling", "short_passing", "shot_power"))
ggplot(fifa_clus_h_tidy, aes(x = Attribute, y = Average_Value, group = Cluster, colour = Cluster)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
ylab("Mean Score (1-100)") +
ggtitle("Hierarchical Clusters - Mean Score per Attribute")The four hierarchical clusters describe four clearly different player profiles:
acceleration = 81 and ball_control = 81. These are the fastest, most technically gifted players in the dataset.ball_control = 79, short_passing = 79, shot_power = 77) but markedly lower on the speed-based attributes (acceleration and sprint_speed both ≈ 65). These are skilled but slower players.ball_control = 24, dribbling = 16, shot_power = 25), with only moderate acceleration and sprint_speed (~49). This profile fits goalkeepers, who are not measured on outfield skills.#profile by age, value, wage
fifa_clus_h_demo <- fifa_clus_h %>%
group_by(Cluster) %>%
summarise(mean_age = mean(age),
mean_value = mean(value),
mean_wage = mean(wage))
fifa_clus_h_demo# A tibble: 4 × 4
Cluster mean_age mean_value mean_wage
<chr> <dbl> <dbl> <dbl>
1 C1 26.3 21355081. 79226.
2 C2 28.1 16098446. 65772.
3 C3 29.1 14350935. 51766.
4 C4 28.0 13387500 58260.
#bar charts for age, value and wage
fifa_clus_h_demo_tidy <- fifa_clus_h_demo %>%
pivot_longer(cols = c(mean_age, mean_value, mean_wage), names_to = "Variable", values_to = "Mean")
ggplot(fifa_clus_h_demo_tidy, aes(x = Cluster, y = Mean, fill = Cluster)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Variable, scales = "free_y") +
ggtitle("Hierarchical Clusters - Age, Value, Wage")The demographic profiles align closely with the attribute profiles:
set.seed(101)
kmeans_fifa <- kmeans(fifa_attrs, centers = 4)
#inspection of the result
kmeans_fifa$cluster # cluster assignment for each player [1] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 4 3 4 2 2 4 2 3 2 4 2 2 2 2 3 2 2 3 2 2
[38] 3 3 2 1 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 3 2 2 3 2 3 1 2 2 3 2 2 2 1 4 2 1 2
[75] 2 2 2 2 3 1 4 1 1 3 1 2 3 4 2 2 2 3 2 2 1 2 2 2 2 2 2 2 1 1 2 4 2 2 2 2 3
[112] 2 2 1 1 2 2 2 1 2 3 2 2 2 3 3 2 2 4 3 1 3 3 2 1 4 3 3 2 2 2 2 2 2 3 3 3 2
[149] 3 3 3 2 3 1 2 2 3 3 4 3 4 2 2 1 2 2 3 3 4 2 3 3 3 3 3 2 4 4 4 2 2 2 2 3 2
[186] 4 4 2 2 3 2 2 1 2 2 2 2 2 2 1 3 2 2 2 2 2 3 2 2 3 3 1 2 2 3 2 2 3 3 2 2 2
[223] 1 2 4 2 3 2 3 2 2 4 2 3 2 2 3 3 2 2 3 2 2 4 2 2 2 2 3 3 1 1 1 4 3 3 2 2 3
[260] 2 2 2 3 1 2 2 3 3 3 2 2 2 3 2 2 2 2 3 1 1 3 3 1 3 2 2 2 3 3 2 2 3 2 2 3 3
[297] 4 1 2 2 2 1 2 2 3 4 2 2 2 3 2 2 2 2 2 2 2 4 4 1 3 2 4 2 3 3 2 2 1 3 3 2 2
[334] 2 2 2 3 4 3 3 2 2 3 2 2 1 1 3 2 2 3 2 2 2 1 2 2 2 3 3 2 2 1 3 3 2 2 1 3 1
[371] 3 2 3 2 2 1 2 2 3 2 2 2 3 1 3 2 2 1 2 3 2 3 3 3 2 2 2 2 3 1 3 2 2 2 2 2 2
[408] 2 2 3 2 2 1 2 1 3 2 3 4 2 3 2 2 2 3 2 2 2 2 2 2 4 1 1 4 3 3 2 3 3 3 2 2 2
[445] 3 1 1 3 2 2 2 2 4 2 2 2 2 3 1 2 3 2 3 1 1 1 2 2 3 2 2 2 2 2 3 3 2 2 4 4 4
[482] 2 2 3 4 4 2 3 2 2 3 2 3 2 2 2 3 3 2 2 3 3 2 3 3 3 2 2 2 3 3 3 3 3 2 1 4 4
[519] 4 3 3 3 2 2 3 1 3 1 2 2 2 1 1 2 3 1 1 1 1 2 2 2 2 3 2 2 2 2 3 3 3 2 2 1 3
[556] 2 2 4 2 2 2 2 2 3 2 2 2 2 1 2 1 2 2 3 3 2 2 3 2 2 3 3 2 2 2 3 4 4 3 1 3 3
[593] 4 2 2 2 3 2 1 3 3 1 1 2 3 2 2 2 2 2 3 3 2 3 3 3 3 3 2 3 3 1 1 2 2 4 1 2 2
[630] 3 3 1 2 3 2 2 1 3 3 2 2 2 2 2 1 1 3 1 2 2 2 1 4 1 1 1 2 3 3 2 3 3 1 3 3 3
[667] 1 1 1 1 2 2 2 1 1 3 4 2 1 2 3 4 4 4 3 3 3 4 4 2 2 2 2 3 1 1 2 3 1 1 2 2 2
[704] 2 2 4 4 3 3 3 2 2 2 3 1 2 2 2 2 3 3 3 2 2 2 2 3 3 3 2 3 3 3 3 2 3 2 2 2 1
[741] 1 3 1 3 1 1 3 1 1 4 3 1 1 1 4 4 4 2 3 1 2 1 1 1 3 1 1 4 4 2 2 2 2 2 2 2 2
[778] 2 2 3 2 3 1 1 1 3 3 3 3 3 1 1 2 1 3 1 3 1 1 4 4 3 3 3 3 3 3 1 2 3 1 4 4 2
[815] 2 3 3 3 2 1 1 2 2 1 2 4 1 4 3 4 3 2 2 4 4 2 2 2 3 3 1 1 2 3 1 1 1 3 1 1 1
[852] 2 4 1 4 3 4 4 4 4 4 2 2 3 3 3 1 3 1 2 2 4 4 2 2 3 4 4 2 2 2 2 3 3 3 1 1 4
[889] 4 4 4 4 3 3 1 2 3 2 1 3 1 3 1 1 4 3 3 1 3 1 1 1 3 2 3 4 4 3 4 4 4 4 3 3 1
[926] 1 4 4 4 4 3 3 1 1 3 1 3 2 2 1 3 1 1 1 3 4 3 1 1 1 1 3 3 2 3 4 3 1 3 4 2 3
[963] 3 3 3 2 2 3 1 3 4 3 4 1 3 3 1 1 1 3 4 1 3 4 1 1 1 1 1 4 4 4 1 2 3 3 3 3 2
[1000] 1
kmeans_fifa$centers # the 4 centroids (mean of each attribute per cluster) acceleration ball_control dribbling shot_power short_passing sprint_speed
1 58.16667 64.53448 53.31034 60.32759 68.72989 62.16092
2 82.05568 81.30858 81.06497 77.12297 77.85151 81.35731
3 65.00346 78.59862 73.58478 76.90657 78.97924 65.09689
4 48.31132 23.44340 15.88679 25.06604 32.84906 49.17925
kmeans_fifa$size # number of players in each cluster[1] 174 431 289 106
kmeans_fifa$iter # iterations until the algorithm converged[1] 4
set.seed(101) is required. K-means is fed the raw attribute data (not the distance matrix) and asked to find four cluster centroids.
sil_k <- silhouette(kmeans_fifa$cluster, d_fifa)
summary(sil_k)Silhouette of 1000 units in 4 clusters from silhouette.default(x = kmeans_fifa$cluster, dist = d_fifa) :
Cluster sizes and average silhouette widths:
174 431 289 106
0.1777123 0.3802751 0.2177741 0.6883372
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1060 0.1807 0.3313 0.3307 0.4788 0.7898
The K-means solution achieves an overall mean silhouette of approximately 0.33, again falling in the “moderate structure” range. Per-cluster silhouettes are 0.18, 0.38, 0.22, and 0.69. The cluster with silhouette ~0.69 (106 players) is the same goalkeeper-style cluster identified by hierarchical clustering, extremely tight and well separated. The remaining three clusters all score in a respectable 0.18–0.38 band, with no cluster as weak as the hierarchical solution’s 0.08 cluster.
Cluster sizes (174, 431, 289, 106) are also more evenly balanced than the hierarchical solution (107, 193, 208, 492). Both observations suggest K-means has produced a slightly cleaner partition.
#attaching cluster IDs
fifa_clus_k <- fifa %>%
mutate(clusters_k = kmeans_fifa$cluster) %>%
mutate(Cluster = case_when(clusters_k == 1 ~ 'C1',
clusters_k == 2 ~ 'C2',
clusters_k == 3 ~ 'C3',
clusters_k == 4 ~ 'C4'))#profile on the 6 attributes (same method as hierarchical)
fifa_clus_k_means <- fifa_clus_k %>%
group_by(Cluster) %>%
summarise(acceleration = mean(acceleration),
ball_control = mean(ball_control),
dribbling = mean(dribbling),
shot_power = mean(shot_power),
short_passing = mean(short_passing),
sprint_speed = mean(sprint_speed))
fifa_clus_k_means# A tibble: 4 × 7
Cluster acceleration ball_control dribbling shot_power short_passing
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1 58.2 64.5 53.3 60.3 68.7
2 C2 82.1 81.3 81.1 77.1 77.9
3 C3 65.0 78.6 73.6 76.9 79.0
4 C4 48.3 23.4 15.9 25.1 32.8
# ℹ 1 more variable: sprint_speed <dbl>
fifa_clus_k_tidy <- fifa_clus_k_means %>%
pivot_longer(cols = c(acceleration, ball_control, dribbling, shot_power, short_passing, sprint_speed),
names_to = "Attribute", values_to = "Average_Value")
fifa_clus_k_tidy$Attribute <- factor(fifa_clus_k_tidy$Attribute,
levels = c("acceleration", "sprint_speed", "ball_control", "dribbling", "short_passing", "shot_power"))
ggplot(fifa_clus_k_tidy, aes(x = Attribute, y = Average_Value, group = Cluster, colour = Cluster)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
ylab("Mean Score (1-100)") +
ggtitle("K-means Clusters - Mean Score per Attribute")K-means produces four clusters with profiles almost identical to those from hierarchical clustering, although the cluster numbering (C1–C4) is shuffled because the algorithms assign IDs differently:
#profile by age, value, wage
fifa_clus_k_demo <- fifa_clus_k %>%
group_by(Cluster) %>%
summarise(mean_age = mean(age),
mean_value = mean(value),
mean_wage = mean(wage))
fifa_clus_k_demo# A tibble: 4 × 4
Cluster mean_age mean_value mean_wage
<chr> <dbl> <dbl> <dbl>
1 C1 27.7 13317241. 58540.
2 C2 26.2 21964037. 80708.
3 C3 28.1 15971626. 65225.
4 C4 29.0 14475000 51972.
fifa_clus_k_demo_tidy <- fifa_clus_k_demo %>%
pivot_longer(cols = c(mean_age, mean_value, mean_wage), names_to = "Variable", values_to = "Mean")
ggplot(fifa_clus_k_demo_tidy, aes(x = Cluster, y = Mean, fill = Cluster)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Variable, scales = "free_y") +
ggtitle("K-means Clusters - Age, Value, Wage")The K-means demographic profiles also closely mirror the hierarchical results:
The matched clusters from the two methods agree on every demographic to within a few percentage points.
#comparing overall silhouette scores
mean_sil_h <- mean(sil_h[, 3]) # hierarchical average silhouette
mean_sil_k <- mean(sil_k[, 3]) # k-means average silhouette
mean_sil_h[1] 0.297603
mean_sil_k[1] 0.330721
#cross-tabulate the two cluster assignments
table(Hierarchical = clusters_h, KMeans = kmeans_fifa$cluster) KMeans
Hierarchical 1 2 3 4
1 3 428 61 0
2 1 3 189 0
3 1 0 0 106
4 169 0 39 0
K-means produced the higher-quality clustering solution. Its overall mean silhouette score was 0.331, compared with 0.298 for the hierarchical (Ward’s) solution, about 3.3 percentage points higher. K-means also achieved more balanced cluster sizes (range 106-431) than hierarchical (range 107-492) and avoided the very weak cluster in the hierarchical solution (per-cluster silhouette of only 0.08, indicating those players are barely closer to their assigned cluster than to others). Both scores fall in the “moderate structure” range, meaning that genuine but not strongly distinct groupings exist in the data.
Yes — despite the difference in algorithm, both methods produced four clusters with strikingly similar profiles. Each method identified the same four player archetypes:
The matched clusters’ attribute means agreed to within 1–2 points on every variable, and demographic profiles agreed to within a few percent. The cross-tabulation confirms this: most players are concentrated along a “diagonal” mapping between the two methods (with the row-column labels permuted), meaning both algorithms put the same players in the same conceptual groups.
The most noticeable difference is in the boundary between the elite and technical clusters: hierarchical placed 492 players in the elite cluster vs K-means’ 431, with the difference (60 players) re-classified into the technical cluster (193 vs 289). K-means is more conservative about who qualifies as elite, which is partly what gives it the silhouette advantage. The goalkeeper cluster, in contrast, is effectively identical across methods (107 vs 106 players, near-identical attribute means and a per-cluster silhouette of 0.69 in both cases) this is the clearest group in the dataset.