Football leagues vary widely in their economic resources, organizational structures, and competitive environments. While some leagues are characterized by high market values and strong internationalization, others operate with more limited financial capacity and a predominantly domestic focus. These differences make direct comparisons challenging when relying solely on rankings or isolated performance indicators.
This paper applies clustering analysis to identify groups of football leagues based on structural and economic characteristics. Using standardized indicators such as league size, player composition, goal-scoring intensity, and market value, the study employs unsupervised learning methods to uncover patterns of similarity among leagues without imposing predefined classifications. Multiple clustering techniques are used to assess the robustness of the results and to capture league heterogeneity at different levels of aggregation.
By providing an empirical classification of football leagues, this analysis contributes to a clearer understanding of structural differences across leagues and offers a data-driven framework for comparative research in professional football.
The dataset used in this study comprises information on 187 football leagues worldwide. League-level data were collected from the Transfermarkt website, a widely used source for football-related economic and structural information. In several countries, the dataset includes more than one league level, reflecting the structure of Transfermarkt’s league classifications. Additionally, in some countries the competitive season is divided into Apertura and Clausura tournaments; for the purposes of this analysis, these were combined into a single full-season observation.
Data were collected in December 2025, at a point when leagues operating on a calendar-year schedule (spring–autumn) had completed their seasons, while leagues following a winter–spring schedule were approximately midway through their respective competitions. This timing ensures broad coverage across leagues while acknowledging differences in seasonal structures.
The dataset includes four key features used for the clustering analysis:
library(sf)
library(dplyr)
library(ggplot2)
library(cluster)
library(clustertend)
library(hopkins)
library(factoextra)
library(fpc)
library(gridExtra)
library(dendextend)
library(tidyverse)
leagues <- read.csv('Scrapping/leagues_normalized.csv')
head(leagues)
## Country League_name League_tier Goals_per_match
## 1 Albania Kategoria Superiore 1 2.020
## 2 Albania Kategoria e Parë 2 2.250
## 3 Algeria Ligue 1 1 2.060
## 4 Andorra Primera Divisió 1 2.850
## 5 Andorra Segona Divisió 2 4.920
## 6 Argentina Torneo Apertura 1 1.965
## Foreign_Player_Share Players_per_club Value_per_club
## 1 0.4888060 26.80000 4841000
## 2 0.3645485 24.91667 1211667
## 3 0.1007026 26.68750 6056250
## 4 0.8785047 21.40000 2617000
## 5 0.4024390 30.75000 163750
## 6 0.1589189 30.83333 32348333
summary(leagues)
## Country League_name League_tier Goals_per_match
## Length:187 Length:187 Min. :1.000 Min. :1.500
## Class :character Class :character 1st Qu.:1.000 1st Qu.:2.455
## Mode :character Mode :character Median :1.000 Median :2.690
## Mean :1.374 Mean :2.783
## 3rd Qu.:2.000 3rd Qu.:3.140
## Max. :2.000 Max. :5.340
## Foreign_Player_Share Players_per_club Value_per_club
## Min. :0.0000 Min. : 7.571 Min. : 714
## 1st Qu.:0.1429 1st Qu.:26.408 1st Qu.: 1589688
## Median :0.2449 Median :28.125 Median : 3900000
## Mean :0.2792 Mean :28.258 Mean : 18380874
## 3rd Qu.:0.3767 3rd Qu.:29.800 3rd Qu.: 10275179
## Max. :0.8785 Max. :45.350 Max. :626500000
l_data <- leagues[, c(4:7)]
l_data_z <- as.data.frame(lapply(l_data, scale))
head(l_data_z)
## Goals_per_match Foreign_Player_Share Players_per_club Value_per_club
## 1 -1.3807193 1.1899693 -0.3824057 -0.2274248
## 2 -0.9646218 0.4844137 -0.8764967 -0.2883855
## 3 -1.3083545 -1.0137492 -0.4119200 -0.2070126
## 4 0.1208498 3.4027462 -1.7990914 -0.2647806
## 5 3.8657271 0.6995628 0.6538737 -0.3059870
## 6 -1.4802209 -0.6831867 0.6757361 0.2346068
l_data_matrix <- as.matrix(l_data_z)
set.seed(3)
hopkins_vals <- replicate(50, round(hopkins(l_data_matrix), digits=2))
mean(hopkins_vals)
## [1] 0.9888
The Hopkins statistic equals 0.9888, indicating a strong tendency toward cluster formation and suggesting that the data are suitable for clustering analysis.
The first algorithm applied in this study is K-means clustering, whose objective is to minimize the within-cluster sum of squared distances between observations and their corresponding cluster centroids.
As a first step in the clustering process, the optimal number of clusters is determined using the elbow method and the silhouette statistic.
plot_wws_kmeans <- fviz_nbclust(l_data_z, kmeans, method = "wss")
plot_silhouette_kmeans <- fviz_nbclust(l_data_z, kmeans, method = "silhouette")
grid.arrange(plot_wws_kmeans, plot_silhouette_kmeans, ncol=2)
Both methods suggest that four clusters represent the optimal solution. However, the average silhouette width is relatively low, indicating that the difference between three and four clusters may be small. To ensure robustness, a three-cluster solution is also evaluated using the average silhouette width and the Calinski–Harabasz index to determine which number of clusters provides a better fit to the data.
set.seed(3)
kmeans4 <- kmeans(l_data_z, 4, 25)
sil_kmeans4 <- silhouette(kmeans4$cluster, dist(l_data_z))
fviz_silhouette(sil_kmeans4)
## cluster size ave.sil.width
## 1 1 42 0.23
## 2 2 71 0.25
## 3 3 5 0.36
## 4 4 69 0.36
round(calinhara(l_data_z, kmeans4$cluster), digits = 2)
## [1] 68
The four-cluster solution achieves an average silhouette width of 0.29 and a Calinski–Harabasz index of 68, indicating a moderate clustering structure.
set.seed(3)
kmeans3 <- kmeans(l_data_z, 3)
sil_kmeans3 <- silhouette(kmeans3$cluster, dist(l_data_z))
fviz_silhouette(sil_kmeans3)
## cluster size ave.sil.width
## 1 1 115 0.19
## 2 2 67 0.37
## 3 3 5 0.40
round(calinhara(l_data_z, kmeans3$cluster), digits = 2)
## [1] 63.12
The three-cluster solution achieves an average silhouette width of 0.26 and a Calinski–Harabasz index of 63.12, both of which are lower than the corresponding values for the four-cluster solution. Therefore, four clusters are selected as the final solution for the K-means algorithm.
kmeans_clusters <- fviz_cluster(kmeans4, data = l_data_z, frame = FALSE, geom = "point", main = "K-means (k=4)")
kmeans_clusters
Cluster labels are assigned to the dataset, and information on the silhouette width and Calinski–Harabasz index is stored for subsequent analysis.
leagues$kmeans_clusters <- kmeans4$cluster
kmeans_desc <- c("K-means", round(mean(sil_kmeans4[,3]),2), round(calinhara(l_data_z, kmeans4$cluster), digits = 2))
The next algorithm applied is Partitioning Around Medoids (PAM), whose objective is to minimize the sum of dissimilarities between observations and their assigned medoids.
As in the previous analysis, the clustering process begins with determining the optimal number of clusters using the elbow method and the silhouette statistic.
plot_wws_pam <- fviz_nbclust(l_data_z, pam, method = "wss")
plot_silhouette_pam <- fviz_nbclust(l_data_z, pam, method = "silhouette")
grid.arrange(plot_wws_pam, plot_silhouette_pam, ncol=2)
The elbow plot indicates a clear reduction in within-cluster variance up to five clusters, after which improvements become marginal. The silhouette analysis reaches its maximum at eight clusters; however, the increase relative to lower values of k is small, suggesting a plateau rather than a distinct optimum. To ensure robustness, both five- and eight-cluster solutions are evaluated using the average silhouette width and the Calinski–Harabasz index to determine which number of clusters provides a better fit to the data.
set.seed(3)
pam8 <- pam(l_data_z, 8)
sil_pam8 <- silhouette(pam8$cluster, dist(l_data_z))
fviz_silhouette(sil_pam8)
## cluster size ave.sil.width
## 1 1 45 0.22
## 2 2 31 0.19
## 3 3 42 0.36
## 4 4 4 0.38
## 5 5 11 0.40
## 6 6 36 0.24
## 7 7 13 0.12
## 8 8 5 0.33
round(calinhara(l_data_z, pam8$cluster), digits = 2)
## [1] 60.64
The eight-cluster solution achieves an average silhouette width of 0.26 and a Calinski–Harabasz index of 60.64.
set.seed(3)
pam5 <- pam(l_data_z, 5)
sil_pam5 <- silhouette(pam5$cluster, dist(l_data_z))
fviz_silhouette(sil_pam5)
## cluster size ave.sil.width
## 1 1 49 0.27
## 2 2 47 0.35
## 3 3 43 0.13
## 4 4 43 0.24
## 5 5 5 0.33
round(calinhara(l_data_z, pam5$cluster), digits = 2)
## [1] 68.74
The five-cluster solution achieves an average silhouette width of 0.26 and a Calinski–Harabasz index of 68.74. The Calinski-Harabasz index is slightly higher than in the eight-cluster solution. Therefore, the five-cluster solution is selected as the final model.
pam_clusters <- fviz_cluster(pam5, data = l_data_z, frame = FALSE, geom = "point", main = "PAM (k=5)")
pam_clusters
Cluster labels are assigned to the dataset, and information on the silhouette width and Calinski–Harabasz index is stored for subsequent analysis.
leagues$pam_clusters <- pam5$cluster
pam_desc <- c("PAM", round(mean(sil_pam5[,3]),2), round(calinhara(l_data_z, pam5$cluster), digits = 2))
The final algorithm applied in this study is hierarchical clustering, implemented using both agglomerative and divisive approaches. Given that the K-means and PAM analyses identified four and five clusters, respectively, both cluster numbers are subsequently examined within the hierarchical clustering framework for comparison.
As an initial step, the most appropriate linkage method is selected for agglomerative hierarchical clustering.
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(l_data_z, method = x)$ac
}
map_dbl(m, ac)
## average single complete ward
## 0.9452725 0.9209757 0.9498708 0.9675469
The Ward linkage method is identified as the most suitable approach, achieving the highest agglomerative coefficient, with a value close to 0.968.
aggl <- agnes(l_data_z, method = "ward")
round(aggl$ac, digits=2)
## [1] 0.97
pltree(aggl, cex = 0.6, hang = -1, main = "Dendrogram - agglomerative approach")
set.seed(3)
hc_agg5 <- cutree(aggl, k = 5)
sil_agg5 <- silhouette(hc_agg5, dist(l_data_z))
fviz_silhouette(sil_agg5)
## cluster size ave.sil.width
## 1 1 40 0.34
## 2 2 116 0.21
## 3 3 4 0.51
## 4 4 22 0.42
## 5 5 5 0.33
round(calinhara(l_data_z, hc_agg5), digits = 2)
## [1] 61.45
The hierarchical clustering solution achieves an average silhouette width of 0.27 and a Calinski–Harabasz index of 61.45, indicating a moderate clustering structure.
set.seed(3)
hc_agg4 <- cutree(aggl, k = 4)
sil_agg4 <- silhouette(hc_agg4, dist(l_data_z))
fviz_silhouette(sil_agg4)
## cluster size ave.sil.width
## 1 1 40 0.36
## 2 2 120 0.16
## 3 3 22 0.45
## 4 4 5 0.33
round(calinhara(l_data_z, hc_agg4), digits = 2)
## [1] 58.39
The hierarchical clustering solution with four clusters achieves lower values for both the average silhouette width and the Calinski–Harabasz index. Therefore, the five-cluster solution is selected for the final visualization.
hc_agg_clusters <- fviz_cluster(list(data = l_data_z, cluster = hc_agg5), frame = FALSE, geom = "point", main = "Hierarchical Agglomerative (k=5)")
hc_agg_clusters
Cluster labels are assigned to the dataset, and information on the silhouette width and Calinski–Harabasz index is stored for subsequent analysis.
leagues$agg_clusters <- hc_agg5
hc_agg_desc <- c("Agglomerative hierarichical", round(mean(sil_agg5[,3]),2), round(calinhara(l_data_z, hc_agg5), digits = 2))
div <- diana(l_data_z)
round(div$dc, digits=2)
## [1] 0.94
The divisive coefficient equals 0.94, indicating that the divisive hierarchical clustering captures a meaningful underlying structure in the data.
pltree(div, cex = 0.6, hang = -1, main = "dendrogram - divisive approach")
set.seed(3)
hc_div5 <- cutree(div, k = 5)
sil_div5 <- silhouette(hc_div5, dist(l_data_z))
fviz_silhouette(sil_div5)
## cluster size ave.sil.width
## 1 1 66 0.24
## 2 2 115 0.28
## 3 3 4 0.52
## 4 4 1 0.00
## 5 5 1 0.00
round(calinhara(l_data_z, hc_div5), digits = 2)
## [1] 40.38
The hierarchical clustering solution achieves an average silhouette width of 0.27 and a Calinski–Harabasz index of 40.38, indicating a moderate clustering structure.
set.seed(3)
hc_div4 <- cutree(div, k = 4)
sil_div4 <- silhouette(hc_div4, dist(l_data_z))
fviz_silhouette(sil_div4)
## cluster size ave.sil.width
## 1 1 66 0.27
## 2 2 119 0.23
## 3 3 1 0.00
## 4 4 1 0.00
round(calinhara(l_data_z, hc_div4), digits = 2)
## [1] 36.9
The hierarchical clustering solution with four clusters achieves lower values for both the average silhouette width and the Calinski–Harabasz index. Therefore, the five-cluster solution is selected for the final visualization.
hc_div_clusters <- fviz_cluster(list(data = l_data_z, cluster = hc_div5), frame = FALSE, geom = "point", main = "Hierarchical Divisive (k=5)")
hc_div_clusters
Cluster labels are assigned to the dataset, and information on the silhouette width and Calinski–Harabasz index is stored for subsequent analysis.
leagues$div_clusters <- hc_div5
hc_div_desc <- c("Divisive hierarichical", round(mean(sil_div5[,3]),2), round(calinhara(l_data_z, hc_div5), digits = 2))
clustering_desc <- rbind(kmeans_desc, pam_desc, hc_agg_desc, hc_div_desc)
rownames(clustering_desc) <- NULL
clustering_desc
## [,1] [,2] [,3]
## [1,] "K-means" "0.29" "68"
## [2,] "PAM" "0.26" "68.74"
## [3,] "Agglomerative hierarichical" "0.27" "61.45"
## [4,] "Divisive hierarichical" "0.27" "40.38"
Among the four methods, K-means achieves the highest average silhouette width and one of the largest Calinski–Harabasz index values, indicating the best overall clustering quality. The remaining methods yield slightly lower validation scores, supporting the selection of K-means as the primary clustering approach.
grid.arrange(kmeans_clusters, pam_clusters, hc_agg_clusters, hc_div_clusters, nrow = 2)
The cluster visualizations reveal broadly consistent structural patterns across methods, with a clear separation between high-value leagues and smaller, more domestic leagues along the first principal dimension. K-means and PAM produce more compact and well-separated clusters, while hierarchical methods capture coarser groupings with greater overlap between clusters. Overall, the visual evidence aligns with the internal validation results, supporting the selection of K-means as the primary clustering approach.
Cluster 4 in the K-means method, Cluster 5 in the PAM method, and Cluster 5 in the agglomerative hierarchical method correspond to the same group of leagues, comprising the five strongest European leagues: the Premier League (England), La Liga (Spain), Serie A (Italy), the Bundesliga (Germany), and Ligue 1 (France). In contrast, under the divisive hierarchical clustering approach, Cluster 4 contains only the Premier League, highlighting its substantial structural and economic differences relative to the other leagues.
Cluster 1 in the K-means method and Cluster 3 in the PAM method correspond to a similar group of the weakest leagues in the dataset. In the agglomerative and divisive hierarchical clustering methods, Cluster 3 consists of only four leagues, which represent the weakest group overall: Segona Divisió (second tier of Andorra), the Bhutan Premier League (top tier of Bhutan), PFL (top tier of Philippines), and Liga Puerto Rico (top tier of Puerto Rico).
In the K-means solution, Cluster 4 consists predominantly of leagues from Africa, South America, Asia, and Oceania, while Cluster 2 is mainly composed of leagues from Europe, the Near East, and North America.
In the PAM solution, Cluster 4 includes the relatively stronger leagues from Africa, South America, Asia, and Oceania. Consequently, Cluster 1 is dominated by top-tier leagues from Europe (including the English Championship as a second-tier league), the Near East, and North America. Cluster 2 primarily comprises second-tier leagues from various regions worldwide.
In the agglomerative hierarchical clustering solution, Clusters 4 and 1 contain fewer leagues than in the PAM solution, as weaker leagues are reassigned to Cluster 2. Additionally, several leagues previously belonging to Cluster 3 are also moved to Cluster 2. As a result, Cluster 3, which represents the weakest leagues, consists of substantially fewer observations.
In the divisive hierarchical clustering solution, Clusters 1 and 2 together contain 176 leagues out of 187, accounting for more than 94% of all clustered leagues. Weaker leagues are distributed across these two clusters while largely preserving continental divisions. Cluster 3 is identical to Cluster 3 in the agglomerative solution. However, one league is separated into a new Cluster 5, which contains only a single observation: the Nepal Super League. In the agglomerative method, this league was not grouped with the weakest leagues. This separation may indicate by the league’s small size and limited number of registered players.
Overall, the clustering analyses reveal consistent structural patterns across methods, with clear distinctions between top-tier, mid-level, and weaker football leagues. K-means and PAM provide more granular and compact cluster structures, while hierarchical methods capture broader divisions within the data. Despite methodological differences, the results consistently highlight substantial economic and structural disparities among leagues, supporting the robustness of the identified groupings.