Introduction

The UEFA Nations League is a biennial international football competition founded in 2018 contested by the senior men’s national teams of the member associations of UEFA.

Source: enskibarinn.is

Source: enskibarinn.is

Before the first edition of the competition all 55 European national teams were divided into four divisions A-D:

DIVISION A: Belgium, Croatia, England, France, Germany, Iceland, Italy, Netherlands, Poland, Portugal, Spain, Switzerland

DIVISION B: Austria, Bosnia and Herzegovina, Czech Republic, Denmark, Northern Ireland, Republic of Ireland, Russia, Slovakia, Sweden, Turkey, Ukraine, Wales

DIVISION C: Albania, Bulgaria, Cyprus, Estonia, Finland, Greece, Hungary, Israel, Lithuania, Montenegro, Norway, Romania, Scotland, Serbia, Slovenia

DIVISION D: Andorra, Armenia, Azerbaijan, Belarus, Faroe Islands, Georgia, Gibraltar, Kazakhstan, Kosovo, Latvia, Liechtenstein, Luxembourg, Malta, Moldova, North Macedonia, San Marino

The aim of this paper is to examine whether dividing all European national teams into four groups was optimal or another number of groups would be more proper. Additional aim is to check which teams should be allocated together depending on individual players profiles. The research was conducted using various clustering methods.

Dataset

Dataset from game Fifa19 containing profiles of over 18000 players from all over the world has been used to in this research (https://www.kaggle.com/karangadiya/fifa19). Dataset consists of 89 variables describing every player. At the beginning of the study the size of dataset has been reduced to 10923 players (only from team members of UEFA Nations League) and 44 variables as remaining 45 wouldn’t be useful for the aims of this study. The research will be conducted for 52 national football teams (out of 55 team members of UEFA Nations League) as used dataset does not contain players from San Marino, Gibraltar and North Macedonia.

Firstly, several variables (market value, wage, height and weight) had to be processed into numeric and comparable forms in order to group players into national teams. For this purpose Regular Expressions have been used. Additionally, rows with missing values have been removed.

players$Value <- ifelse(str_extract(players$Value,"[K,M]")=="K",
                        as.numeric(str_extract(players$Value,"[0-9]+.?[0-9]+"))/1000,
                        as.numeric(str_extract(players$Value,"[0-9]+.?[0-9]+"))) 
players$Wage <- as.numeric(str_extract(players$Wage,"[0-9]+"))
players$Height <- as.numeric(str_extract(str_extract(players$Height,"[0-9]{1}'"),"[0-9]{1}"))*30.48+
                  as.numeric(str_extract(str_extract(players$Height,"'[0-9]{1,2}"),"[0-9]{1,2}"))*2.54
players$Weight <- as.numeric(str_extract(players$Weight,"[0-9]+"))*0.45359237
players <- na.omit(players)

The following table shows illustrative countries with the exact number of players in dataset.

##           Nationality no_players
## 1             Albania         34
## 2             Andorra          1
## 3             Armenia          8
## 4             Austria        277
## 5          Azerbaijan          5
## 6             Belarus          4
## 7             Belgium        242
## 8  Bosnia Herzegovina         54
## 9            Bulgaria         11
## 10            Croatia        122

Not every player can play in a national team as only the best receive an appointment from the coach. In order to show real potential of every national team only up to 22 most valued players from each country have been chosen for further analysis.

players_top <-sqldf("SELECT * FROM (
                    SELECT *, row_number() over (partition by Nationality order by Value desc) as country_rank
                    FROM players )
                    WHERE country_rank <= 22")

Then chosen players have been grouped into countries and mean value of each variable for each country has been calculated.

nations <- players_top %>% group_by(Nationality) %>% summarise_all(mean)

Descriptive statistics of chosen variables:

##      Value             Wage             Height       BallControl   
##  Min.   : 0.290   Min.   :  1.000   Min.   :167.6   Min.   :40.20  
##  1st Qu.: 1.620   1st Qu.:  6.688   1st Qu.:181.4   1st Qu.:61.14  
##  Median : 4.989   Median : 16.227   Median :183.1   Median :65.93  
##  Mean   : 9.375   Mean   : 32.230   Mean   :182.7   Mean   :65.55  
##  3rd Qu.:10.458   3rd Qu.: 37.932   3rd Qu.:184.3   3rd Qu.:70.05  
##  Max.   :43.773   Max.   :159.545   Max.   :190.5   Max.   :82.86
## Number of teams: 52
## Number of variables: 43

Before clustering the data has been scaled.

nations_scale <- scale(nations_analysis)

Clustering

Clustering is an unsupervised learning method used to group a set of objects into clusters. Objects in each cluster should be more similar to each other than to objects in other clusters. In the case of football national teams examined in this paper, divisions in Nations League can be treated as clusters.

Clustering algorithms

The most popular clustering methods are K-means, PAM, Clara and hierarchical clustering.

K-means method classifies n observations into defined beforehand k clusters. The goal of this method is to minimize the differences within cluster and maximize the differences between clusters. K-means algorithm assigns each observation to the cluster with the closest center (mean) and then recalculates the new centers of clusters as long as all observations are grouped into the required number of clusters. K-means method is simple and flexible but as it’s not very sophisticated it does not guarantee finding the optimal clusters.

PAM (Partitioning Around Medoids) method is very similar to K-means. The main difference between those algorithms is that PAM chooses data points as clusters centers (medoids) while in K-means method clusters centers do not have to be data points. PAM method is suitable for small datasets and handles presence of outliers better than K-means.

CLARA (Clustering Large Applications) method is an extension to PAM algorithm. It uses the sampling approach in order to deal with datasets containing more than several thousand observations.

Hierarchical clustering can be divided into two types: agglomerative and divisive. In the agglomerative method each object is assigned to its own cluster at the beginning. Then step by step algorithms joins two most similar clusters until there is just a single cluster left. In the divisive method we start with one cluster and at each step is’s being divided. In the case of hierarchical clustering it’s not needed to set the number of clusters beforehand.

In further analysis K-means, PAM and agglomerative hierarchical clustering will be conducted. CLARA algorithm will not be used as Nations League teams dataset contains small amount of objects.

K-means and PAM - appropriate number of clusters

In order to choose appropriate number of clusters Silhouette statistic has been computed.

nations_s_kmeans <- fviz_nbclust(nations_scale,kmeans,method = "silhouette") +ggtitle("K-MEANS")
nations_s_pam <- fviz_nbclust(nations_scale,pam,method = "silhouette")+ggtitle("PAM")
grid.arrange(nations_s_kmeans,nations_s_pam,ncol=2,top="Optimal number of clusters - silhouette width")

Both plots displayed above show that in K-means and PAM methods number of clusters should be set to 2.

K-means and PAM - 2 clusters

grid.arrange(kmeans_k2p,kmeans_k2s,ncol=2)

Clustering results:

Group 1: Austria, Belgium, Bosnia and Herzegovina, Croatia, Czech Republic, Denmark, England, France, Germany, Iceland, Italy, Netherlands, Norway, Portugal, Republic of Ireland, Russia, Scotland, Serbia, Spain, Sweden, Switzerland, Turkey, Ukraine, Wales

Group 2: Albania, Andorra, Armenia, Azerbaijan, Belarus, Bulgaria, Cyprus, Estonia, Faroe Islands, Finland, Georgia, Greece, Hungary, Israel, Kazakhstan, Kosovo, Latvia, Liechtenstein, Lithuania, Luxembourg, Malta, Moldova, Montenegro, Northern Ireland, Poland, Romania, Slovakia, Slovenia

The above lists show (based od K-means clustering method) how European national football teams should be divided if there were only 2 divisions in the competition.

grid.arrange(pam_k2p,pam_k2s,ncol=2)

Clustering results:

Group 2: Austria, Belgium, Croatia, Denmark, England, France, Germany, Italy, Netherlands, Portugal, Russia, Scotland, Serbia, Spain, Turkey

Group 1: Albania, Andorra, Armenia, Azerbaijan, Belarus, Bosnia and Herzegovina, Bulgaria, Cyprus, Czech Republic, Estonia, Faroe Islands, Finland, Georgia, Greece, Hungary, Iceland, Israel, Kazakhstan, Kosovo, Latvia, Liechtenstein, Lithuania, Luxembourg, Malta, Moldova, Montenegro, Northern Ireland, Norway, Poland, Republic of Ireland, Romania, Slovakia, Slovenia, Sweden, Switzerland, Ukraine, Wales

The above lists show (based od PAM clustering method) how European national football teams should be divided if there were only 2 divisions in the competition. Both K-means and PAM algorithms give similar results. The most visible difference is the number of teams in each cluster.

K-means and PAM - 4 clusters

grid.arrange(kmeans_k4p,kmeans_k4s,ncol=2)

Clustering results:

Group 3: Belgium, Croatia, England, France, Germany, Italy, Netherlands, Portugal, Spain, Turkey

Group 4: Austria, Bosnia and Herzegovina, Czech Republic, Denmark, Finland, Greece, Iceland, Israel, Northern Ireland, Norway, Poland, Republic of Ireland, Romania, Russia, Scotland, Serbia, Slovakia, Slovenia, Sweden, Switzerland, Ukraine, Wales

Group 2: Belarus, Bulgaria, Estonia, Hungary, Latvia, Lithuania, Moldova, Montenegro

Group 1: Albania, Andorra, Armenia, Azerbaijan, Cyprus, Faroe Islands, Georgia, Kazakhstan, Kosovo, Liechtenstein, Luxembourg, Malta

There are visible differences between this division and actual groups of Nations League mentioned at the beginning of this paper. There are also many similarities, especially in the strongest and the weakest groups.

grid.arrange(pam_k4p,pam_k4s,ncol=2)

Clustering results:

Group 4: Belgium, Croatia, England, France, Germany, Italy, Netherlands, Portugal, Spain, Turkey, Scotland

Group 2: Austria, Bosnia and Herzegovina, Czech Republic, Denmark, Norway, Russia, Serbia, Sweden, Switzerland, Ukraine

Group 3: Belarus, Bulgaria, Estonia, Hungary, Latvia, Lithuania, Malta, Moldova, Montenegro, Greece, Poland, Romania, Slovenia

Group 1: Albania, Andorra, Armenia, Azerbaijan, Cyprus, Faroe Islands, Georgia, Kazakhstan, Kosovo, Liechtenstein, Luxembourg, Finland, Iceland, Israel, Northern Ireland, Republic of Ireland, Slovakia, Wales

In the case of PAM clustering results differ much more from actual Nations League teams division than in the case of K-means method. Silhouette width statistics for both methods also confirms which algorithm divided all teams “better”.

Hierarchical clustering - agglomerative approach

Before applying hierarchical clustering algorithm it’s necessary to choose linkage method. In this paper hierarchical clustering has been conducted using four different linkage methods.

Single linkage

Single linkage method defines the distance between two clusters as the distance between two closest objects from each cluster.

hier_sing <- eclust(nations_analysis,k=4,FUNcluster="hclust",hc_metric="euclidean",hc_method="single")
hier_sing$labels <- nations_list$Nationality
plot(hier_sing,cex=0.5,hang=-1,main="Single linkage")
rect.hclust(hier_sing,k=4,border=2:5)

Displayed dendrogram indicates that single linkage method is not appropriate for this research as there are 3 clusters with just one team in each cluster and 4th cluster contains remaining 49 teams.

Complete linkage

Complete linkage method defines the distance between two clusters as the distance between two least similar objects from each cluster.

hier_comp <- eclust(nations_analysis,k=4,FUNcluster="hclust",hc_metric="euclidean",hc_method="complete")
hier_comp$labels <- nations_list$Nationality
plot(hier_comp,cex=0.5,hang=-1,main="Complete linkage")
rect.hclust(hier_comp,k=4,border=2:5)

Complete linkage method dendrogram looks much better but there are still very visible differences between clusters size. The biggest one contains 33 teams while the smallest just 4.

Average linkage

Average linkage method defines the distance between two clusters as the average distance between every pair of objects from each cluster.

# average linkage
hier_avg <- eclust(nations_analysis,k=4,FUNcluster="hclust",hc_metric="euclidean",hc_method="average")
hier_avg$labels <- nations_list$Nationality
plot(hier_avg,cex=0.5,hang=-1,main="Average linkage")
rect.hclust(hier_avg,k=4,border=2:5)

Dendrogram created using average linkage method doesn’t look good either. There are again big differences in size between clusters.

Ward’s method

Ward’s method aim is to minimize the total sum of squared distances of each observation from the average observation in each cluster.

# Wards method
hier_ward <- eclust(nations_analysis,k=4,FUNcluster="hclust",hc_metric="euclidean",hc_method="ward.D2")
hier_ward$labels <- nations_list$Nationality
plot(hier_ward,cex=0.5,hang=-1,main ="Ward's method")
rect.hclust(hier_ward,k=4,border=2:5)

The best looking dendrogram was created using Ward’s method. There are clearly specified 4 clusters with more acceptable differences in size. The assignment of specific teams together in clusters does not exactly reflect Nations League divisions but there are visible distinct similarities, for example all 8 teams from red cluster are in division A.

Conclusions

The first aim of this paper was to find out what number of divisions in UEFA Nations League would be optimal based on players profiles. UEFA board decided to divide 55 national teams into 4 groups. However, performed clustering analysis shows that division into 2 groups would be more proper. The next aim was to compare actual teams assignment into Nations League divisions A-D with results received by applying various clustering algorithms with fixed number of clusters k=4. On the one hand received assignments to clusters differ between used algorithms. Differences are visible in the size of clusters and in the allocation of particular teams. On the other hand there are many similarities between received divisions and actual Nations League groups what can be considered as success as whole research was based on the dataset from a game. The final conclusion from conducted research is that polish football fans should be very proud of their representatives as polish national team plays in the strongest division A and no clustering algorithm used in this study has assigned Poland together with the best teams.