Unsupervised Learning : English Premier League 2020 - 2021
English Premier League is one of the most famous in the world, if it’s not the best football league out there. They are so entertaining. They play so fast and physical. No wonder many people watch their match. There are many good player there. What I am trying do right here is to explore more about them. With hope we can see more which one that actually a good player. Because with the data we can do analysis more such as get more insight with combining Principal Component Analysis (PCA) and Clustering
Import Library
library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)
library(dplyr)
library(tidyr)Import Data
I get the dataset from Kaggel. This dataset contains Stats of football player from Premier League (2021-2022). You can check here
# Read data
epl <- read.csv('Football Players Stats (Premier League 2021-2022).csv')
glimpse(epl)#> Rows: 691
#> Columns: 30
#> $ Player <chr> "Bukayo Saka", "Gabriel Dos Santos", "Aaron Ramsdale", "Ben …
#> $ Team <chr> "Arsenal", "Arsenal", "Arsenal", "Arsenal", "Arsenal", "Arse…
#> $ Nation <chr> "eng\xa0ENG", "br\xa0BRA", "eng\xa0ENG", "eng\xa0ENG", "no\x…
#> $ Pos <chr> "FW,MF", "DF", "GK", "DF", "MF", "MF,DF", "MF", "DF", "MF,FW…
#> $ Age <int> 19, 23, 23, 23, 22, 28, 28, 24, 21, 20, 30, 22, 29, 21, 21, …
#> $ MP <int> 38, 35, 34, 32, 36, 27, 24, 22, 33, 29, 30, 21, 21, 22, 19, …
#> $ Starts <int> 36, 35, 34, 32, 32, 27, 23, 22, 21, 21, 20, 20, 16, 13, 12, …
#> $ Min <chr> "2,978", "3,063", "3,060", "2,880", "2,785", "2,327", "2,028…
#> $ X90s <dbl> 33.1, 34.0, 34.0, 32.0, 30.9, 25.9, 22.5, 21.3, 21.3, 20.7, …
#> $ Gls <int> 11, 5, 0, 0, 7, 1, 2, 1, 10, 6, 4, 0, 1, 1, 0, 4, 1, 5, 0, 1…
#> $ Ast <int> 7, 0, 0, 0, 4, 2, 1, 3, 2, 6, 7, 1, 1, 1, 0, 1, 0, 1, 2, 2, …
#> $ G.PK <int> 9, 5, 0, 0, 7, 1, 2, 1, 10, 5, 2, 0, 1, 1, 0, 4, 1, 5, 0, 1,…
#> $ PK <int> 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ PKatt <int> 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 2, 0, 0, 0, 0, …
#> $ CrdY <int> 6, 8, 1, 3, 4, 10, 6, 0, 1, 3, 0, 2, 3, 2, 5, 3, 4, 3, 1, 0,…
#> $ CrdR <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
#> $ Gls.1 <dbl> 0.33, 0.15, 0.00, 0.00, 0.23, 0.04, 0.09, 0.05, 0.47, 0.29, …
#> $ Ast.1 <dbl> 0.21, 0.00, 0.00, 0.00, 0.13, 0.08, 0.04, 0.14, 0.09, 0.29, …
#> $ G.A <dbl> 0.54, 0.15, 0.00, 0.00, 0.36, 0.12, 0.13, 0.19, 0.56, 0.58, …
#> $ G.PK.1 <dbl> 0.27, 0.15, 0.00, 0.00, 0.23, 0.04, 0.09, 0.05, 0.47, 0.24, …
#> $ G.A.PK <dbl> 0.48, 0.15, 0.00, 0.00, 0.36, 0.12, 0.13, 0.19, 0.56, 0.53, …
#> $ xG <dbl> 9.7, 2.7, 0.0, 1.0, 4.8, 1.2, 2.5, 0.7, 5.8, 7.2, 7.9, 0.8, …
#> $ npxG <dbl> 8.2, 2.7, 0.0, 1.0, 4.8, 1.2, 2.5, 0.7, 5.8, 6.5, 5.6, 0.8, …
#> $ xA <dbl> 6.9, 0.8, 0.0, 0.6, 6.8, 2.3, 1.3, 1.9, 2.2, 3.3, 1.9, 0.6, …
#> $ npxG.xA <dbl> 15.2, 3.5, 0.0, 1.6, 11.6, 3.5, 3.8, 2.6, 8.0, 9.8, 7.6, 1.4…
#> $ xG.1 <dbl> 0.29, 0.08, 0.00, 0.03, 0.16, 0.05, 0.11, 0.03, 0.27, 0.35, …
#> $ xA.1 <dbl> 0.21, 0.02, 0.00, 0.02, 0.22, 0.09, 0.06, 0.09, 0.10, 0.16, …
#> $ xG.xA <dbl> 0.50, 0.10, 0.00, 0.05, 0.38, 0.14, 0.17, 0.12, 0.37, 0.51, …
#> $ npxG.1 <dbl> 0.25, 0.08, 0.00, 0.03, 0.16, 0.05, 0.11, 0.03, 0.27, 0.31, …
#> $ npxG.xA.1 <dbl> 0.46, 0.10, 0.00, 0.05, 0.38, 0.14, 0.17, 0.12, 0.37, 0.47, …
About the data :
Player : Player’s name Team : Played
club in 2021-2020 Nation : Player’s nation
Pos : Position Age : Player’s age
MP : Matches played Starts : Matches
started Min : Minutes played 90s :
Minutes played divided by 90 Gls : Goals scored or
allowed Ast : Assists G-PK : Non
Penalty Goals PK : Penalty Kicks made
PKatt : Penalty Kicks attended CrdY :
Yellow Cards CrdR : Red Cards Gls
: Goals scored per 90 mins Ast : Assits per 90 mins
G+A : Goals and Assists per 90 mins
G-PK : Goals minus Penalty Kicks made per 90 mins
G+A-PK : Goals plus Assists minus Penalty Kicks made per 90
mins xG : Expected Goals npxG :
Non-Penalty Expected Goals xA : Expected Assits
npxG+xA : Non-Penalty Expected Goals plus Expected Assists
xG : Expected Goals per 90 mins
npxG : Non-Penalty Expected Goals made per 90 mins
xA : Expected Assits made per 90 mins
npxG+xA : Non-Penalty Expected Goals plus Expected Assists
made per 90 mins
This data actually good. Many thing we can do with this. But in this article, I only going to explore players’ statistic with unsupervised learning. Unsupervised learning refers to the use of artificial intelligence (AI) algorithms to identify patterns in data sets including data points that are neither classified nor labeled. In this case we use machine learning, with the help of R in Rstudio.
Data Wrangling
Check the missing value
is.na(epl) %>% colSums()#> Player Team Nation Pos Age MP Starts Min
#> 0 0 0 0 4 0 0 0
#> X90s Gls Ast G.PK PK PKatt CrdY CrdR
#> 144 144 144 144 144 144 144 144
#> Gls.1 Ast.1 G.A G.PK.1 G.A.PK xG npxG xA
#> 145 145 145 145 145 145 145 145
#> npxG.xA xG.1 xA.1 xG.xA npxG.1 npxG.xA.1
#> 145 145 145 145 145 145
It has missing value. We can drop them.
epl_drop <- epl %>% drop_na()
anyNA(epl_drop)#> [1] FALSE
We only going to use numeric column from dataset and name player
epl_unique<- epl_drop[!duplicated(epl_drop$Player), ]# Assign the player column as rownames
rownames(epl_unique) <- epl_unique$Player# select the column
epl_clean <- epl_unique %>%
select(-c(Player, Age, Team, Nation, Pos, Min))
head(epl_clean)#> MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1 Ast.1
#> Bukayo Saka 38 36 33.1 11 7 9 2 2 6 0 0.33 0.21
#> Gabriel Dos Santos 35 35 34.0 5 0 5 0 0 8 1 0.15 0.00
#> Aaron Ramsdale 34 34 34.0 0 0 0 0 0 1 0 0.00 0.00
#> Ben White 32 32 32.0 0 0 0 0 0 3 0 0.00 0.00
#> Martin \xd8degaard 36 32 30.9 7 4 7 0 0 4 0 0.23 0.13
#> Granit Xhaka 27 27 25.9 1 2 1 0 0 10 1 0.04 0.08
#> G.A G.PK.1 G.A.PK xG npxG xA npxG.xA xG.1 xA.1 xG.xA
#> Bukayo Saka 0.54 0.27 0.48 9.7 8.2 6.9 15.2 0.29 0.21 0.50
#> Gabriel Dos Santos 0.15 0.15 0.15 2.7 2.7 0.8 3.5 0.08 0.02 0.10
#> Aaron Ramsdale 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.00 0.00 0.00
#> Ben White 0.00 0.00 0.00 1.0 1.0 0.6 1.6 0.03 0.02 0.05
#> Martin \xd8degaard 0.36 0.23 0.36 4.8 4.8 6.8 11.6 0.16 0.22 0.38
#> Granit Xhaka 0.12 0.04 0.12 1.2 1.2 2.3 3.5 0.05 0.09 0.14
#> npxG.1 npxG.xA.1
#> Bukayo Saka 0.25 0.46
#> Gabriel Dos Santos 0.08 0.10
#> Aaron Ramsdale 0.00 0.00
#> Ben White 0.03 0.05
#> Martin \xd8degaard 0.16 0.38
#> Granit Xhaka 0.05 0.14
Exploratory Data Analysis and Scaling
Make sure the data only have numeric value because we k-means (it going to compute with the ecludean distance)
str(epl_clean)#> 'data.frame': 537 obs. of 24 variables:
#> $ MP : int 38 35 34 32 36 27 24 22 33 29 ...
#> $ Starts : int 36 35 34 32 32 27 23 22 21 21 ...
#> $ X90s : num 33.1 34 34 32 30.9 25.9 22.5 21.3 21.3 20.7 ...
#> $ Gls : int 11 5 0 0 7 1 2 1 10 6 ...
#> $ Ast : int 7 0 0 0 4 2 1 3 2 6 ...
#> $ G.PK : int 9 5 0 0 7 1 2 1 10 5 ...
#> $ PK : int 2 0 0 0 0 0 0 0 0 1 ...
#> $ PKatt : int 2 0 0 0 0 0 0 0 0 1 ...
#> $ CrdY : int 6 8 1 3 4 10 6 0 1 3 ...
#> $ CrdR : int 0 1 0 0 0 1 0 0 0 1 ...
#> $ Gls.1 : num 0.33 0.15 0 0 0.23 0.04 0.09 0.05 0.47 0.29 ...
#> $ Ast.1 : num 0.21 0 0 0 0.13 0.08 0.04 0.14 0.09 0.29 ...
#> $ G.A : num 0.54 0.15 0 0 0.36 0.12 0.13 0.19 0.56 0.58 ...
#> $ G.PK.1 : num 0.27 0.15 0 0 0.23 0.04 0.09 0.05 0.47 0.24 ...
#> $ G.A.PK : num 0.48 0.15 0 0 0.36 0.12 0.13 0.19 0.56 0.53 ...
#> $ xG : num 9.7 2.7 0 1 4.8 1.2 2.5 0.7 5.8 7.2 ...
#> $ npxG : num 8.2 2.7 0 1 4.8 1.2 2.5 0.7 5.8 6.5 ...
#> $ xA : num 6.9 0.8 0 0.6 6.8 2.3 1.3 1.9 2.2 3.3 ...
#> $ npxG.xA : num 15.2 3.5 0 1.6 11.6 3.5 3.8 2.6 8 9.8 ...
#> $ xG.1 : num 0.29 0.08 0 0.03 0.16 0.05 0.11 0.03 0.27 0.35 ...
#> $ xA.1 : num 0.21 0.02 0 0.02 0.22 0.09 0.06 0.09 0.1 0.16 ...
#> $ xG.xA : num 0.5 0.1 0 0.05 0.38 0.14 0.17 0.12 0.37 0.51 ...
#> $ npxG.1 : num 0.25 0.08 0 0.03 0.16 0.05 0.11 0.03 0.27 0.31 ...
#> $ npxG.xA.1: num 0.46 0.1 0 0.05 0.38 0.14 0.17 0.12 0.37 0.47 ...
Check the data variance with summary()
summary(epl_clean)#> MP Starts X90s Gls
#> Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
#> 1st Qu.: 9.00 1st Qu.: 4.00 1st Qu.: 4.50 1st Qu.: 0.000
#> Median :20.00 Median :15.00 Median :14.90 Median : 1.000
#> Mean :19.35 Mean :15.42 Mean :15.39 Mean : 1.922
#> 3rd Qu.:30.00 3rd Qu.:25.00 3rd Qu.:24.10 3rd Qu.: 2.000
#> Max. :38.00 Max. :38.00 Max. :38.00 Max. :23.000
#> Ast G.PK PK PKatt
#> Min. : 0.000 Min. : 0.000 Min. :0.0000 Min. :0.0000
#> 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000
#> Median : 1.000 Median : 1.000 Median :0.0000 Median :0.0000
#> Mean : 1.385 Mean : 1.769 Mean :0.1527 Mean :0.1881
#> 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.:0.0000 3rd Qu.:0.0000
#> Max. :13.000 Max. :23.000 Max. :6.0000 Max. :7.0000
#> CrdY CrdR Gls.1 Ast.1
#> Min. : 0.000 Min. :0.00000 Min. :0.0000 Min. : 0.0000
#> 1st Qu.: 0.000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: 0.0000
#> Median : 2.000 Median :0.00000 Median :0.0300 Median : 0.0300
#> Mean : 2.475 Mean :0.08007 Mean :0.1093 Mean : 0.1018
#> 3rd Qu.: 4.000 3rd Qu.:0.00000 3rd Qu.:0.1500 3rd Qu.: 0.1200
#> Max. :11.000 Max. :2.00000 Max. :2.0300 Max. :11.2500
#> G.A G.PK.1 G.A.PK xG
#> Min. : 0.0000 Min. :0.0000 Min. : 0.0000 Min. : 0.000
#> 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.: 0.100
#> Median : 0.1000 Median :0.0300 Median : 0.1000 Median : 0.800
#> Mean : 0.2111 Mean :0.1024 Mean : 0.2042 Mean : 1.949
#> 3rd Qu.: 0.2900 3rd Qu.:0.1400 3rd Qu.: 0.2800 3rd Qu.: 2.500
#> Max. :11.2500 Max. :2.0300 Max. :11.2500 Max. :21.800
#> npxG xA npxG.xA xG.1
#> Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. :0.0000
#> 1st Qu.: 0.100 1st Qu.: 0.100 1st Qu.: 0.30 1st Qu.:0.0100
#> Median : 0.800 Median : 0.700 Median : 1.60 Median :0.0600
#> Mean : 1.805 Mean : 1.312 Mean : 3.12 Mean :0.1366
#> 3rd Qu.: 2.400 3rd Qu.: 2.000 3rd Qu.: 4.40 3rd Qu.:0.1700
#> Max. :17.100 Max. :11.200 Max. :27.40 Max. :4.4800
#> xA.1 xG.xA npxG.1 npxG.xA.1
#> Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
#> 1st Qu.:0.01000 1st Qu.:0.0500 1st Qu.:0.0100 1st Qu.:0.0500
#> Median :0.06000 Median :0.1300 Median :0.0600 Median :0.1300
#> Mean :0.09227 Mean :0.2291 Mean :0.1297 Mean :0.2222
#> 3rd Qu.:0.12000 3rd Qu.:0.3300 3rd Qu.:0.1600 3rd Qu.:0.3200
#> Max. :6.50000 Max. :6.5000 Max. :4.4800 Max. :6.5000
These variables still has different scaling.
epl_scale <- scale(epl_clean)
head(epl_scale)#> MP Starts X90s Gls Ast
#> Bukayo Saka 1.6025461 1.7669405 1.5691333 2.7691376 2.7408214
#> Gabriel Dos Santos 1.3447884 1.6810797 1.6488920 0.9389506 -0.6763420
#> Aaron Ramsdale 1.2588691 1.5952190 1.6488920 -0.5862051 -0.6763420
#> Ben White 1.0870306 1.4234975 1.4716504 -0.5862051 -0.6763420
#> Martin \xd8degaard 1.4307076 1.4234975 1.3741674 1.5490130 1.2763228
#> Granit Xhaka 0.6574343 0.9941938 0.9310633 -0.2811740 0.2999904
#> G.PK PK PKatt CrdY CrdR
#> Bukayo Saka 2.4477935 2.7186354 2.2884894 1.3697775 -0.280898
#> Gabriel Dos Santos 1.0937218 -0.2247259 -0.2375513 2.1469254 3.227060
#> Aaron Ramsdale -0.5988678 -0.2247259 -0.2375513 -0.5730923 -0.280898
#> Ben White -0.5988678 -0.2247259 -0.2375513 0.2040556 -0.280898
#> Martin \xd8degaard 1.7707577 -0.2247259 -0.2375513 0.5926295 -0.280898
#> Granit Xhaka -0.2603499 -0.2247259 -0.2375513 2.9240733 3.227060
#> Gls.1 Ast.1 G.A G.PK.1 G.A.PK
#> Bukayo Saka 1.1867295 0.21587135 0.6085622 0.9363269 0.5132679
#> Gabriel Dos Santos 0.2187188 -0.20327142 -0.1130911 0.2659169 -0.1008032
#> Aaron Ramsdale -0.5879569 -0.20327142 -0.3906500 -0.5720958 -0.3799264
#> Ben White -0.5879569 -0.20327142 -0.3906500 -0.5720958 -0.3799264
#> Martin \xd8degaard 0.6489458 0.05619791 0.2754915 0.7128569 0.2899693
#> Granit Xhaka -0.3728434 -0.04359798 -0.1686029 -0.3486257 -0.1566278
#> xG npxG xA npxG.xA xG.1
#> Bukayo Saka 2.5824284 2.3767810 3.2461211 2.96090107 0.5919711
#> Gabriel Dos Santos 0.2503327 0.3325099 -0.2973736 0.09315861 -0.2186735
#> Aaron Ramsdale -0.6491899 -0.6710414 -0.7620942 -0.76471306 -0.5274904
#> Ben White -0.3160334 -0.2993558 -0.4135537 -0.37254316 -0.4116841
#> Martin \xd8degaard 0.9499614 1.1130497 3.1880310 2.07851877 0.0901435
#> Granit Xhaka -0.2494021 -0.2250186 0.5739776 0.09315861 -0.3344798
#> xA.1 xG.xA npxG.1 npxG.xA.1
#> Bukayo Saka 0.396273633 0.6683443 0.4725364 0.5920731
#> Gabriel Dos Santos -0.243267632 -0.3184604 -0.1954390 -0.3041728
#> Aaron Ramsdale -0.310587765 -0.5651616 -0.5097803 -0.5531300
#> Ben White -0.243267632 -0.4418110 -0.3919023 -0.4286514
#> Martin \xd8degaard 0.429933700 0.3723029 0.1189024 0.3929073
#> Granit Xhaka -0.007647166 -0.2197800 -0.3133170 -0.2045900
Data Pre-processing
Unsupervised Learning : Clustering
Find the k optimum
library(factoextra)
fviz_nbclust(
x = epl_scale,
FUNcluster = kmeans,
method = "wss"
)From the plot graph, we can see that 4 is the optimum of K. Because it made elbow (elbow method). So we can divide EPL’s players to 4 clusters.
# k-means clustering
RNGkind(sample.kind = "Rounding")
set.seed(70)
epl_cluster <- kmeans(epl_scale, centers = 4)
# calculate the size for every cluster
epl_cluster$size#> [1] 34 224 72 207
The second cluster is the biggest and the first cluster is the smallest. My guess it would seperate from best, good, average, and bad player.
# calculate the center for every cluster
epl_cluster$centers#> MP Starts X90s Gls Ast G.PK
#> 1 0.5588799 0.6381243 0.6198478 -0.06585786 0.2425591 -0.03135249
#> 2 -0.9734973 -0.9361396 -0.9387571 -0.50994733 -0.5564798 -0.51726085
#> 3 0.7708000 0.6555209 0.6538770 1.95995770 1.2831029 1.88829862
#> 4 0.6935453 0.6802007 0.6866068 -0.11908012 0.1160437 -0.09190863
#> PK PKatt CrdY CrdR Gls.1 Ast.1
#> 1 -0.1814412 -0.2004037 1.0612040 3.43341086 -0.1419127 0.003951826
#> 2 -0.2115859 -0.2206359 -0.6702358 -0.24957691 -0.3207456 -0.110514697
#> 3 1.2469548 1.3061402 0.3497708 -0.03728975 1.5669127 0.439857695
#> 4 -0.1749589 -0.1826374 0.4293159 -0.28089797 -0.1746168 -0.034052290
#> G.A G.PK.1 G.A.PK xG npxG xA
#> 1 -0.04506191 -0.1152967 -0.03403251 -0.06714588 -0.02715068 -0.0820986
#> 2 -0.21279751 -0.3069745 -0.20530822 -0.53407245 -0.54642717 -0.6013092
#> 3 0.94857198 1.4507736 0.89447785 1.96146166 1.89617077 1.4106360
#> 4 -0.09226297 -0.1534943 -0.08336357 -0.09328527 -0.06377529 0.1735208
#> npxG.xA xG.1 xA.1 xG.xA npxG.1 npxG.xA.1
#> 1 -0.05534691 -0.1982371 -0.1096174 -0.2081705 -0.1769483 -0.1943388
#> 2 -0.61403841 -0.1471561 -0.1065236 -0.1727526 -0.1329917 -0.1635787
#> 3 1.84498704 1.1077383 0.5056688 1.0795129 1.0062618 1.0128799
#> 4 0.03182286 -0.1934982 -0.0426081 -0.1543505 -0.1770264 -0.1433730
Label the cluster for every observation
as.data.frame(epl_cluster$cluster) %>% head()#> epl_cluster$cluster
#> Bukayo Saka 3
#> Gabriel Dos Santos 1
#> Aaron Ramsdale 4
#> Ben White 4
#> Martin \xd8degaard 3
#> Granit Xhaka 1
Cluster Profiling
Make new column with label cluster information
epl_clean$cluster <- epl_cluster$clusterCheck the first 6 rows of the data
head(epl_clean)#> MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1 Ast.1
#> Bukayo Saka 38 36 33.1 11 7 9 2 2 6 0 0.33 0.21
#> Gabriel Dos Santos 35 35 34.0 5 0 5 0 0 8 1 0.15 0.00
#> Aaron Ramsdale 34 34 34.0 0 0 0 0 0 1 0 0.00 0.00
#> Ben White 32 32 32.0 0 0 0 0 0 3 0 0.00 0.00
#> Martin \xd8degaard 36 32 30.9 7 4 7 0 0 4 0 0.23 0.13
#> Granit Xhaka 27 27 25.9 1 2 1 0 0 10 1 0.04 0.08
#> G.A G.PK.1 G.A.PK xG npxG xA npxG.xA xG.1 xA.1 xG.xA
#> Bukayo Saka 0.54 0.27 0.48 9.7 8.2 6.9 15.2 0.29 0.21 0.50
#> Gabriel Dos Santos 0.15 0.15 0.15 2.7 2.7 0.8 3.5 0.08 0.02 0.10
#> Aaron Ramsdale 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.00 0.00 0.00
#> Ben White 0.00 0.00 0.00 1.0 1.0 0.6 1.6 0.03 0.02 0.05
#> Martin \xd8degaard 0.36 0.23 0.36 4.8 4.8 6.8 11.6 0.16 0.22 0.38
#> Granit Xhaka 0.12 0.04 0.12 1.2 1.2 2.3 3.5 0.05 0.09 0.14
#> npxG.1 npxG.xA.1 cluster
#> Bukayo Saka 0.25 0.46 3
#> Gabriel Dos Santos 0.08 0.10 1
#> Aaron Ramsdale 0.00 0.00 4
#> Ben White 0.03 0.05 4
#> Martin \xd8degaard 0.16 0.38 3
#> Granit Xhaka 0.05 0.14 1
Grouping data based on cluster label
Grouping based on cluster label, so we can learn the character from each cluster
epl_clean %>%
group_by(cluster) %>%
summarise_all(mean)#> # A tibble: 4 × 25
#> cluster MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 25.9 22.9 22.4 1.71 1.88 1.68 0.0294 0.0294 5.21 1.06
#> 2 2 8.02 4.52 4.80 0.25 0.246 0.241 0.00893 0.0134 0.75 0.00893
#> 3 3 28.3 23.1 22.8 8.35 4.01 7.35 1 1.22 3.38 0.0694
#> 4 4 27.4 23.3 23.1 1.53 1.62 1.50 0.0338 0.0435 3.58 0
#> # … with 14 more variables: Gls.1 <dbl>, Ast.1 <dbl>, G.A <dbl>, G.PK.1 <dbl>,
#> # G.A.PK <dbl>, xG <dbl>, npxG <dbl>, xA <dbl>, npxG.xA <dbl>, xG.1 <dbl>,
#> # xA.1 <dbl>, xG.xA <dbl>, npxG.1 <dbl>, npxG.xA.1 <dbl>
Filtering data based on cluster label
# Show some of players from cluster 1
epl_clean[epl_clean$cluster==1,] %>% head()#> MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1 Ast.1
#> Gabriel Dos Santos 35 35 34.0 5 0 5 0 0 8 1 0.15 0.00
#> Granit Xhaka 27 27 25.9 1 2 1 0 0 10 1 0.04 0.08
#> Rob Holding 15 9 9.4 1 0 1 0 0 4 1 0.11 0.00
#> Ezri Konsa 29 29 27.5 2 0 2 0 0 6 2 0.07 0.00
#> Sergi Can\xf3s 31 25 23.1 3 2 3 0 0 8 1 0.13 0.09
#> Shandon Baptiste 22 9 10.1 1 1 1 0 0 2 1 0.10 0.10
#> G.A G.PK.1 G.A.PK xG npxG xA npxG.xA xG.1 xA.1 xG.xA
#> Gabriel Dos Santos 0.15 0.15 0.15 2.7 2.7 0.8 3.5 0.08 0.02 0.10
#> Granit Xhaka 0.12 0.04 0.12 1.2 1.2 2.3 3.5 0.05 0.09 0.14
#> Rob Holding 0.11 0.11 0.11 0.3 0.3 0.0 0.3 0.03 0.00 0.03
#> Ezri Konsa 0.07 0.07 0.07 1.5 1.5 0.3 1.7 0.05 0.01 0.06
#> Sergi Can\xf3s 0.22 0.13 0.22 3.0 3.0 1.9 4.9 0.13 0.08 0.21
#> Shandon Baptiste 0.20 0.10 0.20 0.7 0.7 0.5 1.2 0.07 0.05 0.12
#> npxG.1 npxG.xA.1 cluster
#> Gabriel Dos Santos 0.08 0.10 1
#> Granit Xhaka 0.05 0.14 1
#> Rob Holding 0.03 0.03 1
#> Ezri Konsa 0.05 0.06 1
#> Sergi Can\xf3s 0.13 0.21 1
#> Shandon Baptiste 0.07 0.12 1
# Show some of players from cluster 2
epl_clean[epl_clean$cluster==2,] %>% head()#> MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1
#> Albert Sambi Lokonga 19 12 12.6 0 0 0 0 0 5 0 0.00
#> Mohamed Elneny 14 8 9.0 0 2 0 0 0 1 0 0.00
#> Nicolas P\xe9p\xe9 20 5 7.7 1 2 1 0 0 0 0 0.13
#> Bernd Leno 4 4 4.0 0 0 0 0 0 0 0 0.00
#> Ainsley Maitland-Niles 8 2 3.0 0 0 0 0 0 0 0 0.00
#> Pablo Mar\xed 2 2 2.0 0 0 0 0 0 1 0 0.00
#> Ast.1 G.A G.PK.1 G.A.PK xG npxG xA npxG.xA xG.1 xA.1
#> Albert Sambi Lokonga 0.00 0.00 0.00 0.00 0.6 0.6 0.8 1.4 0.05 0.06
#> Mohamed Elneny 0.22 0.22 0.00 0.22 0.2 0.2 0.6 0.7 0.02 0.06
#> Nicolas P\xe9p\xe9 0.26 0.39 0.13 0.39 2.9 2.9 1.7 4.6 0.38 0.22
#> Bernd Leno 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.00 0.00
#> Ainsley Maitland-Niles 0.00 0.00 0.00 0.00 0.0 0.0 0.3 0.4 0.01 0.12
#> Pablo Mar\xed 0.00 0.00 0.00 0.00 0.3 0.3 0.1 0.4 0.15 0.05
#> xG.xA npxG.1 npxG.xA.1 cluster
#> Albert Sambi Lokonga 0.11 0.05 0.11 2
#> Mohamed Elneny 0.08 0.02 0.08 2
#> Nicolas P\xe9p\xe9 0.60 0.38 0.60 2
#> Bernd Leno 0.00 0.00 0.00 2
#> Ainsley Maitland-Niles 0.13 0.01 0.13 2
#> Pablo Mar\xed 0.21 0.15 0.21 2
# Show some of players from cluster 3
epl_clean[epl_clean$cluster==3,] %>% head()#> MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1
#> Bukayo Saka 38 36 33.1 11 7 9 2 2 6 0 0.33
#> Martin \xd8degaard 36 32 30.9 7 4 7 0 0 4 0 0.23
#> Emile Smith Rowe 33 21 21.3 10 2 10 0 0 1 0 0.47
#> Martinelli 29 21 20.7 6 6 5 1 1 3 1 0.29
#> Alexandre Lacazette 30 20 19.8 4 7 2 2 3 0 0 0.20
#> Pierre-Emerick Aubameyang 14 12 11.5 4 1 4 0 2 3 0 0.35
#> Ast.1 G.A G.PK.1 G.A.PK xG npxG xA npxG.xA xG.1
#> Bukayo Saka 0.21 0.54 0.27 0.48 9.7 8.2 6.9 15.2 0.29
#> Martin \xd8degaard 0.13 0.36 0.23 0.36 4.8 4.8 6.8 11.6 0.16
#> Emile Smith Rowe 0.09 0.56 0.47 0.56 5.8 5.8 2.2 8.0 0.27
#> Martinelli 0.29 0.58 0.24 0.53 7.2 6.5 3.3 9.8 0.35
#> Alexandre Lacazette 0.35 0.56 0.10 0.45 7.9 5.6 1.9 7.6 0.40
#> Pierre-Emerick Aubameyang 0.09 0.43 0.35 0.43 5.8 4.1 0.8 5.0 0.50
#> xA.1 xG.xA npxG.1 npxG.xA.1 cluster
#> Bukayo Saka 0.21 0.50 0.25 0.46 3
#> Martin \xd8degaard 0.22 0.38 0.16 0.38 3
#> Emile Smith Rowe 0.10 0.37 0.27 0.37 3
#> Martinelli 0.16 0.51 0.31 0.47 3
#> Alexandre Lacazette 0.10 0.50 0.28 0.38 3
#> Pierre-Emerick Aubameyang 0.07 0.57 0.36 0.43 3
# Show some of players from cluster 4
epl_clean[epl_clean$cluster==4,] %>% head()#> MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1 Ast.1
#> Aaron Ramsdale 34 34 34.0 0 0 0 0 0 1 0 0.00 0.00
#> Ben White 32 32 32.0 0 0 0 0 0 3 0 0.00 0.00
#> Thomas Partey 24 23 22.5 2 1 2 0 0 6 0 0.09 0.04
#> Kieran Tierney 22 22 21.3 1 3 1 0 0 0 0 0.05 0.14
#> Takehiro Tomiyasu 21 20 18.7 0 1 0 0 0 2 0 0.00 0.05
#> C\xe9dric Soares 21 16 16.5 1 1 1 0 0 3 0 0.06 0.06
#> G.A G.PK.1 G.A.PK xG npxG xA npxG.xA xG.1 xA.1 xG.xA
#> Aaron Ramsdale 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.00 0.00 0.00
#> Ben White 0.00 0.00 0.00 1.0 1.0 0.6 1.6 0.03 0.02 0.05
#> Thomas Partey 0.13 0.09 0.13 2.5 2.5 1.3 3.8 0.11 0.06 0.17
#> Kieran Tierney 0.19 0.05 0.19 0.7 0.7 1.9 2.6 0.03 0.09 0.12
#> Takehiro Tomiyasu 0.05 0.00 0.05 0.8 0.8 0.6 1.4 0.04 0.03 0.08
#> C\xe9dric Soares 0.12 0.06 0.12 0.5 0.5 1.4 2.0 0.03 0.09 0.12
#> npxG.1 npxG.xA.1 cluster
#> Aaron Ramsdale 0.00 0.00 4
#> Ben White 0.03 0.05 4
#> Thomas Partey 0.11 0.17 4
#> Kieran Tierney 0.03 0.12 4
#> Takehiro Tomiyasu 0.04 0.08 4
#> C\xe9dric Soares 0.03 0.12 4
Visualization
# Visualization in 2 dimensition
library(ggiraphExtra)
ggRadar(
data=epl_clean,
mapping = aes(colours = cluster),
interactive = T
)# Visualization for cluster profiling
# Make the model
library(ggiraphExtra)
epl_pca1 <- PCA(X = epl_clean, # data
scale.unit = T, #untuk menentukan data agar tidak di scaling
quali.sup = 25, #index kolom dari variable cluster
graph=F) #disable graph
# Visualize the model
fviz_pca_biplot(epl_pca1,
habillage = "cluster", #kolom pewarnaan
geom.ind = "point", # menampilkan titik observasi saja
addEllipses = T, # membuat elips disekitar cluster
col.var = "navy") # warna panah dan teks variabl
That is not many thing we can see here, because we have many variables.
Let’t find centroid in every cluster then group them and summarize by
their maximum value and minimum value. Then we can do cluster
profiling.
# Find Centroid in every cluster
epl_centroid <- epl_clean %>%
group_by(cluster) %>%
summarise_all(mean)
epl_centroid#> # A tibble: 4 × 25
#> cluster MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 25.9 22.9 22.4 1.71 1.88 1.68 0.0294 0.0294 5.21 1.06
#> 2 2 8.02 4.52 4.80 0.25 0.246 0.241 0.00893 0.0134 0.75 0.00893
#> 3 3 28.3 23.1 22.8 8.35 4.01 7.35 1 1.22 3.38 0.0694
#> 4 4 27.4 23.3 23.1 1.53 1.62 1.50 0.0338 0.0435 3.58 0
#> # … with 14 more variables: Gls.1 <dbl>, Ast.1 <dbl>, G.A <dbl>, G.PK.1 <dbl>,
#> # G.A.PK <dbl>, xG <dbl>, npxG <dbl>, xA <dbl>, npxG.xA <dbl>, xG.1 <dbl>,
#> # xA.1 <dbl>, xG.xA <dbl>, npxG.1 <dbl>, npxG.xA.1 <dbl>
# Grouping the centroid to min group and max group
epl_centroid %>%
pivot_longer(-cluster) %>%
group_by(name) %>%
summarize(
min_group = which.min(value),
max_group = which.max(value))#> # A tibble: 24 × 3
#> name min_group max_group
#> <chr> <int> <int>
#> 1 Ast 2 3
#> 2 Ast.1 2 3
#> 3 CrdR 4 1
#> 4 CrdY 2 1
#> 5 G.A 2 3
#> 6 G.A.PK 2 3
#> 7 G.PK 2 3
#> 8 G.PK.1 2 3
#> 9 Gls 2 3
#> 10 Gls.1 2 3
#> # … with 14 more rows
Profiling every cluster
Cluster 1 - Max in aspect : CrdR
(Red Cards), CrdY (Yellow Cards) - Min in aspect :
npxG.xA.1 (Non-Penalty Expected Goals made per 90 mins) ,
xA.1 (Expected Assist made per 90 mins), xG.1
(Expected Goal per 90 mins), xG.xA (Expected Assist &
Goal) - Label : Rough player - Description
: These kind of player that easily foul the opponent. They maybe have a
lot duty in defense.
Cluster 2 - Max in aspect : none - Min
in aspect : Ast (Assists), Ast.1(Assist per 90
mins), CrdY(Yellow Cards), G.A(Goals and
Assists per 90 mins), G.A.PK(Goals plus Assists minus
Penalty Kicks made per 90 mins), G.PK(Non Penalty Goals),
G.PK.1(Goals minus Penalty Kicks made per 90 mins ),
Gls(Goals scored or allowed), Gls.1(Goals
scored per 90 mins), MP(Matches played ),
npxG(Non-Penalty Expected Goals),
npxG.xA(Non-Penalty Expected Goals ),
PK(Penalty Kicks made), PKatt(Penalty Kicks
attended ), Starts(Matches started ), X90s(90
minutes Expected Played), xA(Expected Assists ),
xG(Expected Goals) - Label : Bench
warmer - Description : These kind of player is almost has
minimum contribution to the team. They have tendecy to spend of their
time in bench.
Cluster 3 - Max in aspect : Ast
(Assists), Ast.1(Assist per 90 mins),G.A(Goals
and Assists per 90 mins), G.A.PK(Goals plus Assists minus
Penalty Kicks made per 90 mins), G.PK(Non Penalty
Goals),G.PK.1(Goals minus Penalty Kicks made per 90 mins
),Gls(Goals scored or allowed),Gls.1(Goals
scored per 90 mins), MP(Matches played ),
npxG(Non-Penalty Expected Goals),
npxG.1(Non-Penalty Expected Goals made per 90 mins ),
npxG.xA(Non-Penalty Expected Goals plus Expected Assists),
npxG.xA.1(Non-Penalty Expected Goals plus Expected Assists
made per 90 mins),PK(Penalty Kicks made),
PKatt(Penalty Kicks attended ),xA(Expected
Assist), xA.1(Expected Assits made per 90 mins),
xG(Expected Goals), xG.1(Expected Goals per 90
mins), xG.xA(Expected Goals plus Expected Assists) -
Min in aspect : None - Label : Key Player -
Description : These kind of player that contribute a lot in attack. They
do a lot in final third, usually their position are attacking midfielder
and striker.
Cluster 4 - Max in aspect :
Starts(Matches started), X90s(Expected Minutes
played divided by 90) - Min in aspect : npxG.1(Non-Penalty
Expected Goals) - Label : First Team Player -
Description : These kind of player that play the most of match. Their
contribution maybe not a lot in attacking, but maybe they contribute in
other area.
Conclusion
This article Unsupervised Learning : English Premier League 2020 - 2021 can be use to help manager or staff team to analysis opponent team. We could get which player that actually danger in front of goal. Or which player that play dirty, we may want avoid these players on duel because they could injure our player. Unsupervised learning main purpose is exploring our data, with hope get interisting or new information . I can say this purposed fulfilled here. We know a lot more about player in English Premier League (season 2020-2021).