Data has been taken from Kaggle website. This dataset includes top football leagues scorers their goals ,Country, Club, matches played, substitution, min ,Goals, xG, and so on.
You can learn more about data from following website: https://www.kaggle.com/mohamedhanyyy/top-football-leagues-scorers?search=football/
We will use clustering on this data to categorize players into clusters. Then we will check our our result with the top strikers to see if our clustering worked correctly.
players <- read.csv2("Data.csv", sep = ",", dec = ".", encoding = "UTF-8")
head(players)
## Country League Club Player.Names Matches_Played Substitution Mins
## 1 Spain La Liga (BET) Juanmi Callejon 19 16 1849
## 2 Spain La Liga (BAR) Antoine Griezmann 36 0 3129
## 3 Spain La Liga (ATL) Luis Suarez 34 1 2940
## 4 Spain La Liga (CAR) Ruben Castro 32 3 2842
## 5 Spain La Liga (VAL) Kevin Gameiro 21 10 1745
## 6 Spain La Liga (JUV) Cristiano Ronaldo 29 0 2634
## Goals xG xG.Per.Avg.Match Shots OnTarget Shots.Per.Avg.Match
## 1 11 6.62 0.34 48 20 2.47
## 2 16 11.86 0.36 88 41 2.67
## 3 28 23.21 0.75 120 57 3.88
## 4 13 14.06 0.47 117 42 3.91
## 5 13 10.65 0.58 50 23 2.72
## 6 25 24.68 0.89 162 60 5.84
## On.Target.Per.Avg.Match Year
## 1 1.03 2016
## 2 1.24 2016
## 3 1.84 2016
## 4 1.40 2016
## 5 1.25 2016
## 6 2.16 2016
There are so many columns in our data. We need extract more important columns that reflects the payers’ form more accurate.
players <- players[,c(4,5,6,7,8,11,12,13,14,15)]
head(players)
## Player.Names Matches_Played Substitution Mins Goals Shots OnTarget
## 1 Juanmi Callejon 19 16 1849 11 48 20
## 2 Antoine Griezmann 36 0 3129 16 88 41
## 3 Luis Suarez 34 1 2940 28 120 57
## 4 Ruben Castro 32 3 2842 13 117 42
## 5 Kevin Gameiro 21 10 1745 13 50 23
## 6 Cristiano Ronaldo 29 0 2634 25 162 60
## Shots.Per.Avg.Match On.Target.Per.Avg.Match Year
## 1 2.47 1.03 2016
## 2 2.67 1.24 2016
## 3 3.88 1.84 2016
## 4 3.91 1.40 2016
## 5 2.72 1.25 2016
## 6 5.84 2.16 2016
Even though these statistics give more information about players, it is difficult to see from these statistics who are better strikers. In this project, I will use clustering to determine the best strikers.
Let’s first, get information about the general structure of our data.
dim(players)
## [1] 660 10
str(players)
## 'data.frame': 660 obs. of 10 variables:
## $ Player.Names : chr "Juanmi Callejon" "Antoine Griezmann" "Luis Suarez" "Ruben Castro" ...
## $ Matches_Played : int 19 36 34 32 21 29 23 30 25 31 ...
## $ Substitution : int 16 0 1 3 10 0 6 0 7 7 ...
## $ Mins : int 1849 3129 2940 2842 1745 2634 1967 2694 2354 2904 ...
## $ Goals : int 11 16 28 13 13 25 11 13 19 11 ...
## $ Shots : int 48 88 120 117 50 162 69 105 78 64 ...
## $ OnTarget : int 20 41 57 42 23 60 34 42 37 26 ...
## $ Shots.Per.Avg.Match : num 2.47 2.67 3.88 3.91 2.72 5.84 3.33 3.7 3.15 2.09 ...
## $ On.Target.Per.Avg.Match: num 1.03 1.24 1.84 1.4 1.25 2.16 1.64 1.48 1.49 0.85 ...
## $ Year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
summary(players)
## Player.Names Matches_Played Substitution Mins
## Length:660 Min. : 2.00 Min. : 0.000 Min. : 264
## Class :character 1st Qu.:14.00 1st Qu.: 0.000 1st Qu.:1364
## Mode :character Median :24.00 Median : 2.000 Median :2246
## Mean :22.37 Mean : 3.224 Mean :2071
## 3rd Qu.:31.00 3rd Qu.: 5.000 3rd Qu.:2822
## Max. :38.00 Max. :26.000 Max. :4177
## Goals Shots OnTarget Shots.Per.Avg.Match
## Min. : 2.00 Min. : 5.00 Min. : 2.00 Min. :0.800
## 1st Qu.: 8.00 1st Qu.: 37.75 1st Qu.: 17.00 1st Qu.:2.335
## Median :11.00 Median : 62.00 Median : 26.00 Median :2.845
## Mean :11.78 Mean : 64.18 Mean : 28.37 Mean :2.948
## 3rd Qu.:14.00 3rd Qu.: 86.00 3rd Qu.: 37.00 3rd Qu.:3.382
## Max. :37.00 Max. :208.00 Max. :102.00 Max. :7.200
## On.Target.Per.Avg.Match Year
## Min. :0.240 Min. :2016
## 1st Qu.:0.980 1st Qu.:2017
## Median :1.250 Median :2019
## Mean :1.316 Mean :2018
## 3rd Qu.:1.540 3rd Qu.:2019
## Max. :3.630 Max. :2020
Data is consist of the statistics of players between 2016 - 2020. That’s why there are duplicates in “Player.Names” columns. For more accurate, let’s extract player statistics of each year.
players_2016 <- players[players$Year == 2016, -10]
players_2017 <- players[players$Year == 2017, -10]
players_2018 <- players[players$Year == 2018, -10]
players_2019 <- players[players$Year == 2019, -10]
players_2020 <- players[players$Year == 2020, -10]
I will determine the number of cluster based on elbow method.
library(ClusterR)
## Loading required package: gtools
opt2016 <- Optimal_Clusters_KMeans(players_2016[,2:9], max_clusters=10, plot_clusters = TRUE)
opt2017 <- Optimal_Clusters_KMeans(players_2017[,2:9], max_clusters=10, plot_clusters = TRUE)
opt2018 <- Optimal_Clusters_KMeans(players_2018[,2:9], max_clusters=10, plot_clusters = TRUE)
opt2019 <- Optimal_Clusters_KMeans(players_2019[,2:9], max_clusters=10, plot_clusters = TRUE)
opt2020 <- Optimal_Clusters_KMeans(players_2020[,2:9], max_clusters=10, plot_clusters = TRUE)
As you can see from the graphs, the optimal number of clusters for each data is 3. Let’s also use total withinss to determine number of clusters.
wss_2016 <- sapply(1:10, function(k){kmeans(players_2016[,2:9], k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:10, wss_2016, type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K", ylab="Total within-clusters sum of squares")
wss_2017 <- sapply(1:10, function(k){kmeans(players_2017[,2:9], k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:10, wss_2017, type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K", ylab="Total within-clusters sum of squares")
wss_2018 <- sapply(1:10, function(k){kmeans(players_2018[,2:9], k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:10, wss_2018, type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K", ylab="Total within-clusters sum of squares")
wss_2019 <- sapply(1:10, function(k){kmeans(players_2019[,2:9], k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:10, wss_2019, type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K", ylab="Total within-clusters sum of squares")
wss_2020 <- sapply(1:10, function(k){kmeans(players_2020[,2:9], k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:10, wss_2020, type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K", ylab="Total within-clusters sum of squares")
These graphs also approve that the optimal number of clusters is 3.
Let’s first normalize our data.
pl_scale_2016 <- as.data.frame(lapply(players_2016[,2:9], scale))
pl_scale_2017 <- as.data.frame(lapply(players_2017[,2:9], scale))
pl_scale_2018 <- as.data.frame(lapply(players_2018[,2:9], scale))
pl_scale_2019 <- as.data.frame(lapply(players_2019[,2:9], scale))
pl_scale_2020 <- as.data.frame(lapply(players_2020[,2:9], scale))
Now let’s construct clusters on data using k-means algorithm.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
clust2016 <- kmeans(pl_scale_2016, centers = 3)
players_2016$cluster <- clust2016$cluster
plot(players_2016[,c(2,6)], col = players_2016$cluster, pch=".", cex=3)
fviz_cluster(list(data=players_2016[,c(2,6)], cluster=clust2016$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
head(players_2016)
## Player.Names Matches_Played Substitution Mins Goals Shots OnTarget
## 1 Juanmi Callejon 19 16 1849 11 48 20
## 2 Antoine Griezmann 36 0 3129 16 88 41
## 3 Luis Suarez 34 1 2940 28 120 57
## 4 Ruben Castro 32 3 2842 13 117 42
## 5 Kevin Gameiro 21 10 1745 13 50 23
## 6 Cristiano Ronaldo 29 0 2634 25 162 60
## Shots.Per.Avg.Match On.Target.Per.Avg.Match cluster
## 1 2.47 1.03 3
## 2 2.67 1.24 2
## 3 3.88 1.84 1
## 4 3.91 1.40 2
## 5 2.72 1.25 3
## 6 5.84 2.16 1
clust2017 <- kmeans(pl_scale_2017, centers = 3)
players_2017$cluster <- clust2017$cluster
plot(players_2017[,c(2,6)], col = players_2017$cluster, pch=".", cex=3)
fviz_cluster(list(data=players_2017[,c(2,6)], cluster=clust2017$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
clust2018 <- kmeans(pl_scale_2018, centers = 3)
players_2018$cluster <- clust2018$cluster
plot(players_2018[,c(2,6)], col = players_2018$cluster, pch=".", cex=3)
fviz_cluster(list(data=players_2018[,c(2,6)], cluster=clust2018$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
clust2019 <- kmeans(pl_scale_2019, centers = 3)
players_2019$cluster <- clust2019$cluster
plot(players_2019[,c(2,6)], col = players_2019$cluster, pch=".", cex=3)
fviz_cluster(list(data=players_2019[,c(2,6)], cluster=clust2019$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
clust2020 <- kmeans(pl_scale_2020, centers = 3)
players_2020$cluster <- clust2020$cluster
plot(players_2020[,c(2,6)], col = players_2020$cluster, pch=".", cex=3)
fviz_cluster(list(data=players_2020[,c(2,6)], cluster=clust2020$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
In the graph I have used two columns to visualize our results which are shots and played matches. In the previous graphs, you can observe that there is no clear border among clusters. However, please remember that our clusters have been built not only based on two columns. We can check our cluster quality using different method that I will talk in the next section.
For analyzing the quality of our clusters and determine the cluster that have best players, let’s first create the dataframe of top strikers. I have taken the list of top strikers from “Ballon d’Or 2021” result list in the given website.
top_players <- c("Lionel Messi", "Robert Lewandowski","Karim Benzema", "Cristiano Ronaldo", "Mohamed Salah", "Kylian Mbappe-Lottin", "Erling Haaland", "Romelu Lukaku", "Raheem Sterling", "Neymar", "Luis Suarez", "Riyad Mahrez", "Lautaro Martinez", "Harry Kane", "Gerard Moreno")
top_players <- data.frame(top_players)
colnames(top_players)[1] <- "Players"
Before proceeding, we need to combine our results to see the clusters that each player belongs.
result <- data.frame(unique(players$Player.Names))
colnames(result)[1] <- "Players"
result <- merge(x = result, y = players_2016[,c(1,10)], by.x = "Players", by.y = "Player.Names", all.x = TRUE)
colnames(result)[2] <- "Cluster_2016"
result <- merge(x = result, y = players_2017[,c(1,10)], by.x = "Players", by.y = "Player.Names", all.x = TRUE)
colnames(result)[3] <- "Cluster_2017"
result <- merge(x = result, y = players_2018[,c(1,10)], by.x = "Players", by.y = "Player.Names", all.x = TRUE)
colnames(result)[4] <- "Cluster_2018"
result <- merge(x = result, y = players_2019[,c(1,10)], by.x = "Players", by.y = "Player.Names", all.x = TRUE)
colnames(result)[5] <- "Cluster_2019"
result <- merge(x = result, y = players_2020[,c(1,10)], by.x = "Players", by.y = "Player.Names", all.x = TRUE)
colnames(result)[6] <- "Cluster_2020"
Now, let’s merge our results with the top players to determine their clusters in each year.
final_result <- merge(x = top_players, y = result, by = "Players")
final_result
## Players Cluster_2016 Cluster_2017 Cluster_2018 Cluster_2019
## 1 Cristiano Ronaldo 1 2 3 3
## 2 Erling Haaland NA NA NA 2
## 3 Gerard Moreno 2 3 NA 1
## 4 Harry Kane 1 NA 3 1
## 5 Karim Benzema 3 NA 2 3
## 6 Kylian Mbappe-Lottin NA NA 3 3
## 7 Lionel Messi 1 2 3 3
## 8 Luis Suarez 1 2 3 3
## 9 Mohamed Salah 2 NA 3 3
## 10 Raheem Sterling NA NA 2 3
## 11 Riyad Mahrez NA NA NA 1
## 12 Robert Lewandowski 1 2 3 3
## 13 Romelu Lukaku 2 NA NA 3
## Cluster_2020
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 NA
## 11 1
## 12 1
## 13 1
As you can see all best strikers were in first cluster in 2020. Let’s find out which other players were also in first cluster in 2020.
clust1_2020 <- players_2020$Player.Names[players_2020$cluster == 1]
clust1_2020
## [1] "Youssef En-Nesyri" "Portu " "Karim Benzema"
## [4] "Carlos Soler" "Cristian Tello" "Mikel Oyarzabal"
## [7] "Esteban Burgos" "Lucas Perez" "Lionel Messi"
## [10] "Joao Felix" "Luis Suarez" "Paco Alcacer"
## [13] "Morales " "Federico Valverde" "Antoine Griezmann"
## [16] "Iago Aspas" "Kike GarcIa" "Angel Rodriguez"
## [19] "Ansu Fati" "Gerard Moreno" "Roberto Soriano"
## [22] "Francesco Caputo" "Joao Pedro" "Alejandro Gomez"
## [25] "Andrea Belotti" "Gaetano Castrovilli" "Giovanni Simeone"
## [28] "Domenico Berardi" "Henrikh Mkhitaryan" "Jordan Veretout"
## [31] "Gervinho " "Lautaro MartInez" "Luis Muriel"
## [34] "Cristiano Ronaldo" "Romelu Lukaku" "Dries Mertens"
## [37] "Ciro Immobile" "Fabio Quagliarella" "Zlatan Ibrahimovic"
## [40] "Hirving Lozano" "Lucas Alario" "Bas Dost"
## [43] "Serge Gnabry" "Wout Weghorst" "Dani Olmo"
## [46] "Ellyes Skhiri" "Thomas Muller" "Max Kruse"
## [49] "Andre Silva" "Andre Hahn" "Jean-Philippe Mateta"
## [52] "Andrej Kramaric" "Niclas Fullkrug" "Lars Stindl"
## [55] "Daniel Caligiuri" "Erling Haaland" "Jhon Cordoba"
## [58] "Nils Petersen" "Matheus Cunha" "Robert Lewandowski"
## [61] "Ludovic Blas" "Stephane Bahoken" "Karl Toko"
## [64] "Kevin Volland" "Andy Delort" "Burak Yilmaz"
## [67] "Ibrahima Niane" "Boulaye Dia" "Moise Kean"
## [70] "Ignatius Ganago" "Irvin Cardona" "Wissam Ben"
## [73] "Florian Thauvin" "Amine Gouiri" "Jonathan Bamba"
## [76] "Ludovic Ajorque" "Kylian Mbappe-Lottin" "Mama Balde"
## [79] "Memphis Depay" "Gael Kakuta" "James Ward-Prowse"
## [82] "Bruno Fernandes" "Dominic Calvert-Lewin" "Timo Werner"
## [85] "Callum Wilson" "Diogo Jota" "Wilfried Zaha"
## [88] "Jack Grealish" "Raul Jimenez" "Jarrod Bowen"
## [91] "Patrick Bamford" "Mohamed Salah" "Jamie Vardy"
## [94] "Harry Kane" "Danny Ings " "Neal Maupay"
## [97] "Ollie Watkins" "Sadio Mane" "Riyad Mahrez"
## [100] "Son Heung-Min" "Raphael Veiga" "Alerrandro "
The players in this list is also good strikers. This list can be more accurate with more statistics of players. It can be helpful for talent managers and coaches to find new talented strikers.