Data Preperation and Understanding

Data has been taken from Kaggle website. This dataset includes top football leagues scorers their goals ,Country, Club, matches played, substitution, min ,Goals, xG, and so on.

You can learn more about data from following website: https://www.kaggle.com/mohamedhanyyy/top-football-leagues-scorers?search=football/

We will use clustering on this data to categorize players into clusters. Then we will check our our result with the top strikers to see if our clustering worked correctly.

players <- read.csv2("Data.csv", sep = ",", dec = ".", encoding = "UTF-8")
head(players)
##   Country  League  Club      Player.Names Matches_Played Substitution Mins
## 1   Spain La Liga (BET)   Juanmi Callejon             19           16 1849
## 2   Spain La Liga (BAR) Antoine Griezmann             36            0 3129
## 3   Spain La Liga (ATL)       Luis Suarez             34            1 2940
## 4   Spain La Liga (CAR)      Ruben Castro             32            3 2842
## 5   Spain La Liga (VAL)     Kevin Gameiro             21           10 1745
## 6   Spain La Liga (JUV) Cristiano Ronaldo             29            0 2634
##   Goals    xG xG.Per.Avg.Match Shots OnTarget Shots.Per.Avg.Match
## 1    11  6.62             0.34    48       20                2.47
## 2    16 11.86             0.36    88       41                2.67
## 3    28 23.21             0.75   120       57                3.88
## 4    13 14.06             0.47   117       42                3.91
## 5    13 10.65             0.58    50       23                2.72
## 6    25 24.68             0.89   162       60                5.84
##   On.Target.Per.Avg.Match Year
## 1                    1.03 2016
## 2                    1.24 2016
## 3                    1.84 2016
## 4                    1.40 2016
## 5                    1.25 2016
## 6                    2.16 2016

There are so many columns in our data. We need extract more important columns that reflects the payers’ form more accurate.

players <- players[,c(4,5,6,7,8,11,12,13,14,15)]
head(players)
##        Player.Names Matches_Played Substitution Mins Goals Shots OnTarget
## 1   Juanmi Callejon             19           16 1849    11    48       20
## 2 Antoine Griezmann             36            0 3129    16    88       41
## 3       Luis Suarez             34            1 2940    28   120       57
## 4      Ruben Castro             32            3 2842    13   117       42
## 5     Kevin Gameiro             21           10 1745    13    50       23
## 6 Cristiano Ronaldo             29            0 2634    25   162       60
##   Shots.Per.Avg.Match On.Target.Per.Avg.Match Year
## 1                2.47                    1.03 2016
## 2                2.67                    1.24 2016
## 3                3.88                    1.84 2016
## 4                3.91                    1.40 2016
## 5                2.72                    1.25 2016
## 6                5.84                    2.16 2016

Even though these statistics give more information about players, it is difficult to see from these statistics who are better strikers. In this project, I will use clustering to determine the best strikers.

Let’s first, get information about the general structure of our data.

dim(players)
## [1] 660  10
str(players)
## 'data.frame':    660 obs. of  10 variables:
##  $ Player.Names           : chr  "Juanmi Callejon" "Antoine Griezmann" "Luis Suarez" "Ruben Castro" ...
##  $ Matches_Played         : int  19 36 34 32 21 29 23 30 25 31 ...
##  $ Substitution           : int  16 0 1 3 10 0 6 0 7 7 ...
##  $ Mins                   : int  1849 3129 2940 2842 1745 2634 1967 2694 2354 2904 ...
##  $ Goals                  : int  11 16 28 13 13 25 11 13 19 11 ...
##  $ Shots                  : int  48 88 120 117 50 162 69 105 78 64 ...
##  $ OnTarget               : int  20 41 57 42 23 60 34 42 37 26 ...
##  $ Shots.Per.Avg.Match    : num  2.47 2.67 3.88 3.91 2.72 5.84 3.33 3.7 3.15 2.09 ...
##  $ On.Target.Per.Avg.Match: num  1.03 1.24 1.84 1.4 1.25 2.16 1.64 1.48 1.49 0.85 ...
##  $ Year                   : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
summary(players)
##  Player.Names       Matches_Played   Substitution         Mins     
##  Length:660         Min.   : 2.00   Min.   : 0.000   Min.   : 264  
##  Class :character   1st Qu.:14.00   1st Qu.: 0.000   1st Qu.:1364  
##  Mode  :character   Median :24.00   Median : 2.000   Median :2246  
##                     Mean   :22.37   Mean   : 3.224   Mean   :2071  
##                     3rd Qu.:31.00   3rd Qu.: 5.000   3rd Qu.:2822  
##                     Max.   :38.00   Max.   :26.000   Max.   :4177  
##      Goals           Shots           OnTarget      Shots.Per.Avg.Match
##  Min.   : 2.00   Min.   :  5.00   Min.   :  2.00   Min.   :0.800      
##  1st Qu.: 8.00   1st Qu.: 37.75   1st Qu.: 17.00   1st Qu.:2.335      
##  Median :11.00   Median : 62.00   Median : 26.00   Median :2.845      
##  Mean   :11.78   Mean   : 64.18   Mean   : 28.37   Mean   :2.948      
##  3rd Qu.:14.00   3rd Qu.: 86.00   3rd Qu.: 37.00   3rd Qu.:3.382      
##  Max.   :37.00   Max.   :208.00   Max.   :102.00   Max.   :7.200      
##  On.Target.Per.Avg.Match      Year     
##  Min.   :0.240           Min.   :2016  
##  1st Qu.:0.980           1st Qu.:2017  
##  Median :1.250           Median :2019  
##  Mean   :1.316           Mean   :2018  
##  3rd Qu.:1.540           3rd Qu.:2019  
##  Max.   :3.630           Max.   :2020

Data is consist of the statistics of players between 2016 - 2020. That’s why there are duplicates in “Player.Names” columns. For more accurate, let’s extract player statistics of each year.

players_2016 <- players[players$Year == 2016, -10]
players_2017 <- players[players$Year == 2017, -10]
players_2018 <- players[players$Year == 2018, -10]
players_2019 <- players[players$Year == 2019, -10]
players_2020 <- players[players$Year == 2020, -10]

Finding optimal number of clusters

I will determine the number of cluster based on elbow method.

library(ClusterR)
## Loading required package: gtools
opt2016 <- Optimal_Clusters_KMeans(players_2016[,2:9], max_clusters=10, plot_clusters = TRUE)

opt2017 <- Optimal_Clusters_KMeans(players_2017[,2:9], max_clusters=10, plot_clusters = TRUE)

opt2018 <- Optimal_Clusters_KMeans(players_2018[,2:9], max_clusters=10, plot_clusters = TRUE)

opt2019 <- Optimal_Clusters_KMeans(players_2019[,2:9], max_clusters=10, plot_clusters = TRUE)

opt2020 <- Optimal_Clusters_KMeans(players_2020[,2:9], max_clusters=10, plot_clusters = TRUE)

As you can see from the graphs, the optimal number of clusters for each data is 3. Let’s also use total withinss to determine number of clusters.

wss_2016 <- sapply(1:10, function(k){kmeans(players_2016[,2:9], k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:10, wss_2016, type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K", ylab="Total within-clusters sum of squares")

wss_2017 <- sapply(1:10, function(k){kmeans(players_2017[,2:9], k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:10, wss_2017, type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K", ylab="Total within-clusters sum of squares")

wss_2018 <- sapply(1:10, function(k){kmeans(players_2018[,2:9], k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:10, wss_2018, type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K", ylab="Total within-clusters sum of squares")

wss_2019 <- sapply(1:10, function(k){kmeans(players_2019[,2:9], k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:10, wss_2019, type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K", ylab="Total within-clusters sum of squares")

wss_2020 <- sapply(1:10, function(k){kmeans(players_2020[,2:9], k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:10, wss_2020, type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K", ylab="Total within-clusters sum of squares")

These graphs also approve that the optimal number of clusters is 3.

Building clusters using K-means

Let’s first normalize our data.

pl_scale_2016 <- as.data.frame(lapply(players_2016[,2:9], scale))
pl_scale_2017 <- as.data.frame(lapply(players_2017[,2:9], scale))
pl_scale_2018 <- as.data.frame(lapply(players_2018[,2:9], scale))
pl_scale_2019 <- as.data.frame(lapply(players_2019[,2:9], scale))
pl_scale_2020 <- as.data.frame(lapply(players_2020[,2:9], scale))

Now let’s construct clusters on data using k-means algorithm.

2016 Players

library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
clust2016 <- kmeans(pl_scale_2016, centers = 3)
players_2016$cluster <- clust2016$cluster

plot(players_2016[,c(2,6)], col = players_2016$cluster, pch=".", cex=3)

fviz_cluster(list(data=players_2016[,c(2,6)], cluster=clust2016$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())

head(players_2016)
##        Player.Names Matches_Played Substitution Mins Goals Shots OnTarget
## 1   Juanmi Callejon             19           16 1849    11    48       20
## 2 Antoine Griezmann             36            0 3129    16    88       41
## 3       Luis Suarez             34            1 2940    28   120       57
## 4      Ruben Castro             32            3 2842    13   117       42
## 5     Kevin Gameiro             21           10 1745    13    50       23
## 6 Cristiano Ronaldo             29            0 2634    25   162       60
##   Shots.Per.Avg.Match On.Target.Per.Avg.Match cluster
## 1                2.47                    1.03       3
## 2                2.67                    1.24       2
## 3                3.88                    1.84       1
## 4                3.91                    1.40       2
## 5                2.72                    1.25       3
## 6                5.84                    2.16       1

2017 Players

clust2017 <- kmeans(pl_scale_2017, centers = 3)
players_2017$cluster <- clust2017$cluster

plot(players_2017[,c(2,6)], col = players_2017$cluster, pch=".", cex=3)

fviz_cluster(list(data=players_2017[,c(2,6)], cluster=clust2017$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic()) 

2018 Players

clust2018 <- kmeans(pl_scale_2018, centers = 3)
players_2018$cluster <- clust2018$cluster

plot(players_2018[,c(2,6)], col = players_2018$cluster, pch=".", cex=3)

fviz_cluster(list(data=players_2018[,c(2,6)], cluster=clust2018$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic()) 

2019 Players

clust2019 <- kmeans(pl_scale_2019, centers = 3)
players_2019$cluster <- clust2019$cluster

plot(players_2019[,c(2,6)], col = players_2019$cluster, pch=".", cex=3)

fviz_cluster(list(data=players_2019[,c(2,6)], cluster=clust2019$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic()) 

2020 Players

clust2020 <- kmeans(pl_scale_2020, centers = 3)
players_2020$cluster <- clust2020$cluster

plot(players_2020[,c(2,6)], col = players_2020$cluster, pch=".", cex=3)

fviz_cluster(list(data=players_2020[,c(2,6)], cluster=clust2020$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic()) 

In the graph I have used two columns to visualize our results which are shots and played matches. In the previous graphs, you can observe that there is no clear border among clusters. However, please remember that our clusters have been built not only based on two columns. We can check our cluster quality using different method that I will talk in the next section.

Results

For analyzing the quality of our clusters and determine the cluster that have best players, let’s first create the dataframe of top strikers. I have taken the list of top strikers from “Ballon d’Or 2021” result list in the given website.

https://www.sportingnews.com/ca/soccer/news/ballon-dor-results-live-updates-rankings-soccer-top-awards/15qzw8nygk8x31f8pl1ealifyl

top_players <- c("Lionel Messi", "Robert Lewandowski","Karim Benzema", "Cristiano Ronaldo", "Mohamed Salah", "Kylian Mbappe-Lottin", "Erling Haaland", "Romelu Lukaku", "Raheem Sterling", "Neymar", "Luis Suarez", "Riyad Mahrez", "Lautaro Martinez", "Harry Kane", "Gerard Moreno")

top_players <- data.frame(top_players)
colnames(top_players)[1] <- "Players"

Before proceeding, we need to combine our results to see the clusters that each player belongs.

result <- data.frame(unique(players$Player.Names))
colnames(result)[1] <- "Players"

result <- merge(x = result, y = players_2016[,c(1,10)], by.x = "Players", by.y = "Player.Names", all.x = TRUE) 
colnames(result)[2] <- "Cluster_2016"

result <- merge(x = result, y = players_2017[,c(1,10)], by.x = "Players", by.y = "Player.Names", all.x = TRUE) 
colnames(result)[3] <- "Cluster_2017"

result <- merge(x = result, y = players_2018[,c(1,10)], by.x = "Players", by.y = "Player.Names", all.x = TRUE) 
colnames(result)[4] <- "Cluster_2018"

result <- merge(x = result, y = players_2019[,c(1,10)], by.x = "Players", by.y = "Player.Names", all.x = TRUE) 
colnames(result)[5] <- "Cluster_2019"

result <- merge(x = result, y = players_2020[,c(1,10)], by.x = "Players", by.y = "Player.Names", all.x = TRUE) 
colnames(result)[6] <- "Cluster_2020"

Now, let’s merge our results with the top players to determine their clusters in each year.

final_result <- merge(x = top_players, y = result, by = "Players")
final_result
##                 Players Cluster_2016 Cluster_2017 Cluster_2018 Cluster_2019
## 1     Cristiano Ronaldo            1            2            3            3
## 2        Erling Haaland           NA           NA           NA            2
## 3         Gerard Moreno            2            3           NA            1
## 4            Harry Kane            1           NA            3            1
## 5         Karim Benzema            3           NA            2            3
## 6  Kylian Mbappe-Lottin           NA           NA            3            3
## 7          Lionel Messi            1            2            3            3
## 8           Luis Suarez            1            2            3            3
## 9         Mohamed Salah            2           NA            3            3
## 10      Raheem Sterling           NA           NA            2            3
## 11         Riyad Mahrez           NA           NA           NA            1
## 12   Robert Lewandowski            1            2            3            3
## 13        Romelu Lukaku            2           NA           NA            3
##    Cluster_2020
## 1             1
## 2             1
## 3             1
## 4             1
## 5             1
## 6             1
## 7             1
## 8             1
## 9             1
## 10           NA
## 11            1
## 12            1
## 13            1

As you can see all best strikers were in first cluster in 2020. Let’s find out which other players were also in first cluster in 2020.

clust1_2020 <- players_2020$Player.Names[players_2020$cluster == 1]
clust1_2020
##   [1] "Youssef   En-Nesyri"   "Portu "                "Karim Benzema"        
##   [4] "Carlos Soler"          "Cristian Tello"        "Mikel Oyarzabal"      
##   [7] "Esteban Burgos"        "Lucas Perez"           "Lionel Messi"         
##  [10] "Joao Felix"            "Luis Suarez"           "Paco Alcacer"         
##  [13] "Morales "              "Federico Valverde"     "Antoine Griezmann"    
##  [16] "Iago Aspas"            "Kike GarcIa"           "Angel Rodriguez"      
##  [19] "Ansu Fati"             "Gerard Moreno"         "Roberto Soriano"      
##  [22] "Francesco Caputo"      "Joao Pedro"            "Alejandro Gomez"      
##  [25] "Andrea Belotti"        "Gaetano Castrovilli"   "Giovanni Simeone"     
##  [28] "Domenico Berardi"      "Henrikh Mkhitaryan"    "Jordan Veretout"      
##  [31] "Gervinho "             "Lautaro MartInez"      "Luis Muriel"          
##  [34] "Cristiano Ronaldo"     "Romelu Lukaku"         "Dries Mertens"        
##  [37] "Ciro Immobile"         "Fabio Quagliarella"    "Zlatan Ibrahimovic"   
##  [40] "Hirving Lozano"        "Lucas Alario"          "Bas Dost"             
##  [43] "Serge Gnabry"          "Wout Weghorst"         "Dani Olmo"            
##  [46] "Ellyes Skhiri"         "Thomas Muller"         "Max Kruse"            
##  [49] "Andre Silva"           "Andre Hahn"            "Jean-Philippe Mateta" 
##  [52] "Andrej Kramaric"       "Niclas Fullkrug"       "Lars Stindl"          
##  [55] "Daniel Caligiuri"      "Erling Haaland"        "Jhon Cordoba"         
##  [58] "Nils Petersen"         "Matheus Cunha"         "Robert Lewandowski"   
##  [61] "Ludovic Blas"          "Stephane Bahoken"      "Karl Toko"            
##  [64] "Kevin Volland"         "Andy Delort"           "Burak Yilmaz"         
##  [67] "Ibrahima Niane"        "Boulaye Dia"           "Moise Kean"           
##  [70] "Ignatius Ganago"       "Irvin Cardona"         "Wissam Ben"           
##  [73] "Florian Thauvin"       "Amine Gouiri"          "Jonathan Bamba"       
##  [76] "Ludovic Ajorque"       "Kylian Mbappe-Lottin"  "Mama Balde"           
##  [79] "Memphis Depay"         "Gael Kakuta"           "James Ward-Prowse"    
##  [82] "Bruno Fernandes"       "Dominic Calvert-Lewin" "Timo Werner"          
##  [85] "Callum Wilson"         "Diogo Jota"            "Wilfried Zaha"        
##  [88] "Jack Grealish"         "Raul Jimenez"          "Jarrod Bowen"         
##  [91] "Patrick Bamford"       "Mohamed Salah"         "Jamie Vardy"          
##  [94] "Harry Kane"            "Danny Ings "           "Neal Maupay"          
##  [97] "Ollie Watkins"         "Sadio Mane"            "Riyad Mahrez"         
## [100] "Son Heung-Min"         "Raphael Veiga"         "Alerrandro "

Conclusions

The players in this list is also good strikers. This list can be more accurate with more statistics of players. It can be helpful for talent managers and coaches to find new talented strikers.