Unsupervised Learning : English Premier League 2020 - 2021

English Premier League is one of the most famous in the world, if it’s not the best football league out there. They are so entertaining. They play so fast and physical. No wonder many people watch their match.
There are many good player there. What I am trying do right here is to explore more about them. With hope we can see more which one that actually a good player. Because with the data we can do analysis more such as get more insight with combining Principal Component Analysis (PCA) and Clustering

Import Library

library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)
library(dplyr)
library(tidyr)

Import Data

I get the dataset from Kaggel. This dataset contains Stats of football player from Premier League (2021-2022). You can check here

# Read data
epl <- read.csv('Football Players Stats (Premier League 2021-2022).csv')
glimpse(epl)

#> Rows: 691
#> Columns: 30
#> $ Player    <chr> "Bukayo Saka", "Gabriel Dos Santos", "Aaron Ramsdale", "Ben …
#> $ Team      <chr> "Arsenal", "Arsenal", "Arsenal", "Arsenal", "Arsenal", "Arse…
#> $ Nation    <chr> "eng\xa0ENG", "br\xa0BRA", "eng\xa0ENG", "eng\xa0ENG", "no\x…
#> $ Pos       <chr> "FW,MF", "DF", "GK", "DF", "MF", "MF,DF", "MF", "DF", "MF,FW…
#> $ Age       <int> 19, 23, 23, 23, 22, 28, 28, 24, 21, 20, 30, 22, 29, 21, 21, …
#> $ MP        <int> 38, 35, 34, 32, 36, 27, 24, 22, 33, 29, 30, 21, 21, 22, 19, …
#> $ Starts    <int> 36, 35, 34, 32, 32, 27, 23, 22, 21, 21, 20, 20, 16, 13, 12, …
#> $ Min       <chr> "2,978", "3,063", "3,060", "2,880", "2,785", "2,327", "2,028…
#> $ X90s      <dbl> 33.1, 34.0, 34.0, 32.0, 30.9, 25.9, 22.5, 21.3, 21.3, 20.7, …
#> $ Gls       <int> 11, 5, 0, 0, 7, 1, 2, 1, 10, 6, 4, 0, 1, 1, 0, 4, 1, 5, 0, 1…
#> $ Ast       <int> 7, 0, 0, 0, 4, 2, 1, 3, 2, 6, 7, 1, 1, 1, 0, 1, 0, 1, 2, 2, …
#> $ G.PK      <int> 9, 5, 0, 0, 7, 1, 2, 1, 10, 5, 2, 0, 1, 1, 0, 4, 1, 5, 0, 1,…
#> $ PK        <int> 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ PKatt     <int> 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 2, 0, 0, 0, 0, …
#> $ CrdY      <int> 6, 8, 1, 3, 4, 10, 6, 0, 1, 3, 0, 2, 3, 2, 5, 3, 4, 3, 1, 0,…
#> $ CrdR      <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
#> $ Gls.1     <dbl> 0.33, 0.15, 0.00, 0.00, 0.23, 0.04, 0.09, 0.05, 0.47, 0.29, …
#> $ Ast.1     <dbl> 0.21, 0.00, 0.00, 0.00, 0.13, 0.08, 0.04, 0.14, 0.09, 0.29, …
#> $ G.A       <dbl> 0.54, 0.15, 0.00, 0.00, 0.36, 0.12, 0.13, 0.19, 0.56, 0.58, …
#> $ G.PK.1    <dbl> 0.27, 0.15, 0.00, 0.00, 0.23, 0.04, 0.09, 0.05, 0.47, 0.24, …
#> $ G.A.PK    <dbl> 0.48, 0.15, 0.00, 0.00, 0.36, 0.12, 0.13, 0.19, 0.56, 0.53, …
#> $ xG        <dbl> 9.7, 2.7, 0.0, 1.0, 4.8, 1.2, 2.5, 0.7, 5.8, 7.2, 7.9, 0.8, …
#> $ npxG      <dbl> 8.2, 2.7, 0.0, 1.0, 4.8, 1.2, 2.5, 0.7, 5.8, 6.5, 5.6, 0.8, …
#> $ xA        <dbl> 6.9, 0.8, 0.0, 0.6, 6.8, 2.3, 1.3, 1.9, 2.2, 3.3, 1.9, 0.6, …
#> $ npxG.xA   <dbl> 15.2, 3.5, 0.0, 1.6, 11.6, 3.5, 3.8, 2.6, 8.0, 9.8, 7.6, 1.4…
#> $ xG.1      <dbl> 0.29, 0.08, 0.00, 0.03, 0.16, 0.05, 0.11, 0.03, 0.27, 0.35, …
#> $ xA.1      <dbl> 0.21, 0.02, 0.00, 0.02, 0.22, 0.09, 0.06, 0.09, 0.10, 0.16, …
#> $ xG.xA     <dbl> 0.50, 0.10, 0.00, 0.05, 0.38, 0.14, 0.17, 0.12, 0.37, 0.51, …
#> $ npxG.1    <dbl> 0.25, 0.08, 0.00, 0.03, 0.16, 0.05, 0.11, 0.03, 0.27, 0.31, …
#> $ npxG.xA.1 <dbl> 0.46, 0.10, 0.00, 0.05, 0.38, 0.14, 0.17, 0.12, 0.37, 0.47, …

About the data :

Player : Player’s name
Team : Played club in 2021-2020
Nation : Player’s nation
Pos : Position
Age : Player’s age
MP : Matches played
Starts : Matches started
Min : Minutes played
90s : Minutes played divided by 90
Gls : Goals scored or allowed
Ast : Assists
G-PK : Non Penalty Goals
PK : Penalty Kicks made
PKatt : Penalty Kicks attended
CrdY : Yellow Cards
CrdR : Red Cards
Gls : Goals scored per 90 mins
Ast : Assits per 90 mins
G+A : Goals and Assists per 90 mins
G-PK : Goals minus Penalty Kicks made per 90 mins
G+A-PK : Goals plus Assists minus Penalty Kicks made per 90 mins
xG : Expected Goals
npxG : Non-Penalty Expected Goals
xA : Expected Assits
npxG+xA : Non-Penalty Expected Goals plus Expected Assists
xG : Expected Goals per 90 mins
npxG : Non-Penalty Expected Goals made per 90 mins
xA : Expected Assits made per 90 mins
npxG+xA : Non-Penalty Expected Goals plus Expected Assists made per 90 mins

This data actually good. Many thing we can do with this. But in this article, I only going to explore players’ statistic with unsupervised learning. Unsupervised learning refers to the use of artificial intelligence (AI) algorithms to identify patterns in data sets including data points that are neither classified nor labeled. In this case we use machine learning, with the help of R in Rstudio.

Data Wrangling

Check the missing value

is.na(epl) %>% colSums()

#>    Player      Team    Nation       Pos       Age        MP    Starts       Min 
#>         0         0         0         0         4         0         0         0 
#>      X90s       Gls       Ast      G.PK        PK     PKatt      CrdY      CrdR 
#>       144       144       144       144       144       144       144       144 
#>     Gls.1     Ast.1       G.A    G.PK.1    G.A.PK        xG      npxG        xA 
#>       145       145       145       145       145       145       145       145 
#>   npxG.xA      xG.1      xA.1     xG.xA    npxG.1 npxG.xA.1 
#>       145       145       145       145       145       145

It has missing value. We can drop them.

epl_drop <- epl %>% drop_na()
anyNA(epl_drop)

#> [1] FALSE

We only going to use numeric column from dataset and name player

epl_unique<- epl_drop[!duplicated(epl_drop$Player), ]

# Assign the player column as rownames
rownames(epl_unique) <- epl_unique$Player

# select the column
epl_clean <- epl_unique %>% 
  select(-c(Player, Age, Team, Nation, Pos, Min)) 
head(epl_clean)

#>                    MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1 Ast.1
#> Bukayo Saka        38     36 33.1  11   7    9  2     2    6    0  0.33  0.21
#> Gabriel Dos Santos 35     35 34.0   5   0    5  0     0    8    1  0.15  0.00
#> Aaron Ramsdale     34     34 34.0   0   0    0  0     0    1    0  0.00  0.00
#> Ben White          32     32 32.0   0   0    0  0     0    3    0  0.00  0.00
#> Martin \xd8degaard 36     32 30.9   7   4    7  0     0    4    0  0.23  0.13
#> Granit Xhaka       27     27 25.9   1   2    1  0     0   10    1  0.04  0.08
#>                     G.A G.PK.1 G.A.PK  xG npxG  xA npxG.xA xG.1 xA.1 xG.xA
#> Bukayo Saka        0.54   0.27   0.48 9.7  8.2 6.9    15.2 0.29 0.21  0.50
#> Gabriel Dos Santos 0.15   0.15   0.15 2.7  2.7 0.8     3.5 0.08 0.02  0.10
#> Aaron Ramsdale     0.00   0.00   0.00 0.0  0.0 0.0     0.0 0.00 0.00  0.00
#> Ben White          0.00   0.00   0.00 1.0  1.0 0.6     1.6 0.03 0.02  0.05
#> Martin \xd8degaard 0.36   0.23   0.36 4.8  4.8 6.8    11.6 0.16 0.22  0.38
#> Granit Xhaka       0.12   0.04   0.12 1.2  1.2 2.3     3.5 0.05 0.09  0.14
#>                    npxG.1 npxG.xA.1
#> Bukayo Saka          0.25      0.46
#> Gabriel Dos Santos   0.08      0.10
#> Aaron Ramsdale       0.00      0.00
#> Ben White            0.03      0.05
#> Martin \xd8degaard   0.16      0.38
#> Granit Xhaka         0.05      0.14

Exploratory Data Analysis and Scaling

Make sure the data only have numeric value because we k-means (it going to compute with the ecludean distance)

str(epl_clean)

#> 'data.frame':    537 obs. of  24 variables:
#>  $ MP       : int  38 35 34 32 36 27 24 22 33 29 ...
#>  $ Starts   : int  36 35 34 32 32 27 23 22 21 21 ...
#>  $ X90s     : num  33.1 34 34 32 30.9 25.9 22.5 21.3 21.3 20.7 ...
#>  $ Gls      : int  11 5 0 0 7 1 2 1 10 6 ...
#>  $ Ast      : int  7 0 0 0 4 2 1 3 2 6 ...
#>  $ G.PK     : int  9 5 0 0 7 1 2 1 10 5 ...
#>  $ PK       : int  2 0 0 0 0 0 0 0 0 1 ...
#>  $ PKatt    : int  2 0 0 0 0 0 0 0 0 1 ...
#>  $ CrdY     : int  6 8 1 3 4 10 6 0 1 3 ...
#>  $ CrdR     : int  0 1 0 0 0 1 0 0 0 1 ...
#>  $ Gls.1    : num  0.33 0.15 0 0 0.23 0.04 0.09 0.05 0.47 0.29 ...
#>  $ Ast.1    : num  0.21 0 0 0 0.13 0.08 0.04 0.14 0.09 0.29 ...
#>  $ G.A      : num  0.54 0.15 0 0 0.36 0.12 0.13 0.19 0.56 0.58 ...
#>  $ G.PK.1   : num  0.27 0.15 0 0 0.23 0.04 0.09 0.05 0.47 0.24 ...
#>  $ G.A.PK   : num  0.48 0.15 0 0 0.36 0.12 0.13 0.19 0.56 0.53 ...
#>  $ xG       : num  9.7 2.7 0 1 4.8 1.2 2.5 0.7 5.8 7.2 ...
#>  $ npxG     : num  8.2 2.7 0 1 4.8 1.2 2.5 0.7 5.8 6.5 ...
#>  $ xA       : num  6.9 0.8 0 0.6 6.8 2.3 1.3 1.9 2.2 3.3 ...
#>  $ npxG.xA  : num  15.2 3.5 0 1.6 11.6 3.5 3.8 2.6 8 9.8 ...
#>  $ xG.1     : num  0.29 0.08 0 0.03 0.16 0.05 0.11 0.03 0.27 0.35 ...
#>  $ xA.1     : num  0.21 0.02 0 0.02 0.22 0.09 0.06 0.09 0.1 0.16 ...
#>  $ xG.xA    : num  0.5 0.1 0 0.05 0.38 0.14 0.17 0.12 0.37 0.51 ...
#>  $ npxG.1   : num  0.25 0.08 0 0.03 0.16 0.05 0.11 0.03 0.27 0.31 ...
#>  $ npxG.xA.1: num  0.46 0.1 0 0.05 0.38 0.14 0.17 0.12 0.37 0.47 ...

Check the data variance with summary()

summary(epl_clean)

#>        MP            Starts           X90s            Gls        
#>  Min.   : 1.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.000  
#>  1st Qu.: 9.00   1st Qu.: 4.00   1st Qu.: 4.50   1st Qu.: 0.000  
#>  Median :20.00   Median :15.00   Median :14.90   Median : 1.000  
#>  Mean   :19.35   Mean   :15.42   Mean   :15.39   Mean   : 1.922  
#>  3rd Qu.:30.00   3rd Qu.:25.00   3rd Qu.:24.10   3rd Qu.: 2.000  
#>  Max.   :38.00   Max.   :38.00   Max.   :38.00   Max.   :23.000  
#>       Ast              G.PK              PK             PKatt       
#>  Min.   : 0.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000  
#>  Median : 1.000   Median : 1.000   Median :0.0000   Median :0.0000  
#>  Mean   : 1.385   Mean   : 1.769   Mean   :0.1527   Mean   :0.1881  
#>  3rd Qu.: 2.000   3rd Qu.: 2.000   3rd Qu.:0.0000   3rd Qu.:0.0000  
#>  Max.   :13.000   Max.   :23.000   Max.   :6.0000   Max.   :7.0000  
#>       CrdY             CrdR             Gls.1            Ast.1        
#>  Min.   : 0.000   Min.   :0.00000   Min.   :0.0000   Min.   : 0.0000  
#>  1st Qu.: 0.000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.: 0.0000  
#>  Median : 2.000   Median :0.00000   Median :0.0300   Median : 0.0300  
#>  Mean   : 2.475   Mean   :0.08007   Mean   :0.1093   Mean   : 0.1018  
#>  3rd Qu.: 4.000   3rd Qu.:0.00000   3rd Qu.:0.1500   3rd Qu.: 0.1200  
#>  Max.   :11.000   Max.   :2.00000   Max.   :2.0300   Max.   :11.2500  
#>       G.A              G.PK.1           G.A.PK              xG        
#>  Min.   : 0.0000   Min.   :0.0000   Min.   : 0.0000   Min.   : 0.000  
#>  1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.: 0.0000   1st Qu.: 0.100  
#>  Median : 0.1000   Median :0.0300   Median : 0.1000   Median : 0.800  
#>  Mean   : 0.2111   Mean   :0.1024   Mean   : 0.2042   Mean   : 1.949  
#>  3rd Qu.: 0.2900   3rd Qu.:0.1400   3rd Qu.: 0.2800   3rd Qu.: 2.500  
#>  Max.   :11.2500   Max.   :2.0300   Max.   :11.2500   Max.   :21.800  
#>       npxG              xA            npxG.xA           xG.1       
#>  Min.   : 0.000   Min.   : 0.000   Min.   : 0.00   Min.   :0.0000  
#>  1st Qu.: 0.100   1st Qu.: 0.100   1st Qu.: 0.30   1st Qu.:0.0100  
#>  Median : 0.800   Median : 0.700   Median : 1.60   Median :0.0600  
#>  Mean   : 1.805   Mean   : 1.312   Mean   : 3.12   Mean   :0.1366  
#>  3rd Qu.: 2.400   3rd Qu.: 2.000   3rd Qu.: 4.40   3rd Qu.:0.1700  
#>  Max.   :17.100   Max.   :11.200   Max.   :27.40   Max.   :4.4800  
#>       xA.1             xG.xA            npxG.1         npxG.xA.1     
#>  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:0.01000   1st Qu.:0.0500   1st Qu.:0.0100   1st Qu.:0.0500  
#>  Median :0.06000   Median :0.1300   Median :0.0600   Median :0.1300  
#>  Mean   :0.09227   Mean   :0.2291   Mean   :0.1297   Mean   :0.2222  
#>  3rd Qu.:0.12000   3rd Qu.:0.3300   3rd Qu.:0.1600   3rd Qu.:0.3200  
#>  Max.   :6.50000   Max.   :6.5000   Max.   :4.4800   Max.   :6.5000

These variables still has different scaling.

epl_scale <- scale(epl_clean)
head(epl_scale)

#>                           MP    Starts      X90s        Gls        Ast
#> Bukayo Saka        1.6025461 1.7669405 1.5691333  2.7691376  2.7408214
#> Gabriel Dos Santos 1.3447884 1.6810797 1.6488920  0.9389506 -0.6763420
#> Aaron Ramsdale     1.2588691 1.5952190 1.6488920 -0.5862051 -0.6763420
#> Ben White          1.0870306 1.4234975 1.4716504 -0.5862051 -0.6763420
#> Martin \xd8degaard 1.4307076 1.4234975 1.3741674  1.5490130  1.2763228
#> Granit Xhaka       0.6574343 0.9941938 0.9310633 -0.2811740  0.2999904
#>                          G.PK         PK      PKatt       CrdY      CrdR
#> Bukayo Saka         2.4477935  2.7186354  2.2884894  1.3697775 -0.280898
#> Gabriel Dos Santos  1.0937218 -0.2247259 -0.2375513  2.1469254  3.227060
#> Aaron Ramsdale     -0.5988678 -0.2247259 -0.2375513 -0.5730923 -0.280898
#> Ben White          -0.5988678 -0.2247259 -0.2375513  0.2040556 -0.280898
#> Martin \xd8degaard  1.7707577 -0.2247259 -0.2375513  0.5926295 -0.280898
#> Granit Xhaka       -0.2603499 -0.2247259 -0.2375513  2.9240733  3.227060
#>                         Gls.1       Ast.1        G.A     G.PK.1     G.A.PK
#> Bukayo Saka         1.1867295  0.21587135  0.6085622  0.9363269  0.5132679
#> Gabriel Dos Santos  0.2187188 -0.20327142 -0.1130911  0.2659169 -0.1008032
#> Aaron Ramsdale     -0.5879569 -0.20327142 -0.3906500 -0.5720958 -0.3799264
#> Ben White          -0.5879569 -0.20327142 -0.3906500 -0.5720958 -0.3799264
#> Martin \xd8degaard  0.6489458  0.05619791  0.2754915  0.7128569  0.2899693
#> Granit Xhaka       -0.3728434 -0.04359798 -0.1686029 -0.3486257 -0.1566278
#>                            xG       npxG         xA     npxG.xA       xG.1
#> Bukayo Saka         2.5824284  2.3767810  3.2461211  2.96090107  0.5919711
#> Gabriel Dos Santos  0.2503327  0.3325099 -0.2973736  0.09315861 -0.2186735
#> Aaron Ramsdale     -0.6491899 -0.6710414 -0.7620942 -0.76471306 -0.5274904
#> Ben White          -0.3160334 -0.2993558 -0.4135537 -0.37254316 -0.4116841
#> Martin \xd8degaard  0.9499614  1.1130497  3.1880310  2.07851877  0.0901435
#> Granit Xhaka       -0.2494021 -0.2250186  0.5739776  0.09315861 -0.3344798
#>                            xA.1      xG.xA     npxG.1  npxG.xA.1
#> Bukayo Saka         0.396273633  0.6683443  0.4725364  0.5920731
#> Gabriel Dos Santos -0.243267632 -0.3184604 -0.1954390 -0.3041728
#> Aaron Ramsdale     -0.310587765 -0.5651616 -0.5097803 -0.5531300
#> Ben White          -0.243267632 -0.4418110 -0.3919023 -0.4286514
#> Martin \xd8degaard  0.429933700  0.3723029  0.1189024  0.3929073
#> Granit Xhaka       -0.007647166 -0.2197800 -0.3133170 -0.2045900

Data Pre-processing

Unsupervised Learning : Clustering

Find the k optimum

library(factoextra)

fviz_nbclust(
  x = epl_scale,
  FUNcluster = kmeans,
  method = "wss"
)

From the plot graph, we can see that 4 is the optimum of K. Because it made elbow (elbow method). So we can divide EPL’s players to 4 clusters.

# k-means clustering
RNGkind(sample.kind = "Rounding")
set.seed(70)

epl_cluster <- kmeans(epl_scale, centers = 4)

# calculate the size for every cluster
epl_cluster$size

#> [1]  34 224  72 207

The second cluster is the biggest and the first cluster is the smallest. My guess it would seperate from best, good, average, and bad player.

# calculate the center for every cluster
epl_cluster$centers

#>           MP     Starts       X90s         Gls        Ast        G.PK
#> 1  0.5588799  0.6381243  0.6198478 -0.06585786  0.2425591 -0.03135249
#> 2 -0.9734973 -0.9361396 -0.9387571 -0.50994733 -0.5564798 -0.51726085
#> 3  0.7708000  0.6555209  0.6538770  1.95995770  1.2831029  1.88829862
#> 4  0.6935453  0.6802007  0.6866068 -0.11908012  0.1160437 -0.09190863
#>           PK      PKatt       CrdY        CrdR      Gls.1        Ast.1
#> 1 -0.1814412 -0.2004037  1.0612040  3.43341086 -0.1419127  0.003951826
#> 2 -0.2115859 -0.2206359 -0.6702358 -0.24957691 -0.3207456 -0.110514697
#> 3  1.2469548  1.3061402  0.3497708 -0.03728975  1.5669127  0.439857695
#> 4 -0.1749589 -0.1826374  0.4293159 -0.28089797 -0.1746168 -0.034052290
#>           G.A     G.PK.1      G.A.PK          xG        npxG         xA
#> 1 -0.04506191 -0.1152967 -0.03403251 -0.06714588 -0.02715068 -0.0820986
#> 2 -0.21279751 -0.3069745 -0.20530822 -0.53407245 -0.54642717 -0.6013092
#> 3  0.94857198  1.4507736  0.89447785  1.96146166  1.89617077  1.4106360
#> 4 -0.09226297 -0.1534943 -0.08336357 -0.09328527 -0.06377529  0.1735208
#>       npxG.xA       xG.1       xA.1      xG.xA     npxG.1  npxG.xA.1
#> 1 -0.05534691 -0.1982371 -0.1096174 -0.2081705 -0.1769483 -0.1943388
#> 2 -0.61403841 -0.1471561 -0.1065236 -0.1727526 -0.1329917 -0.1635787
#> 3  1.84498704  1.1077383  0.5056688  1.0795129  1.0062618  1.0128799
#> 4  0.03182286 -0.1934982 -0.0426081 -0.1543505 -0.1770264 -0.1433730

Label the cluster for every observation

as.data.frame(epl_cluster$cluster) %>% head()

#>                    epl_cluster$cluster
#> Bukayo Saka                          3
#> Gabriel Dos Santos                   1
#> Aaron Ramsdale                       4
#> Ben White                            4
#> Martin \xd8degaard                   3
#> Granit Xhaka                         1

Cluster Profiling

Make new column with label cluster information

epl_clean$cluster <- epl_cluster$cluster

Check the first 6 rows of the data

head(epl_clean)

#>                    MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1 Ast.1
#> Bukayo Saka        38     36 33.1  11   7    9  2     2    6    0  0.33  0.21
#> Gabriel Dos Santos 35     35 34.0   5   0    5  0     0    8    1  0.15  0.00
#> Aaron Ramsdale     34     34 34.0   0   0    0  0     0    1    0  0.00  0.00
#> Ben White          32     32 32.0   0   0    0  0     0    3    0  0.00  0.00
#> Martin \xd8degaard 36     32 30.9   7   4    7  0     0    4    0  0.23  0.13
#> Granit Xhaka       27     27 25.9   1   2    1  0     0   10    1  0.04  0.08
#>                     G.A G.PK.1 G.A.PK  xG npxG  xA npxG.xA xG.1 xA.1 xG.xA
#> Bukayo Saka        0.54   0.27   0.48 9.7  8.2 6.9    15.2 0.29 0.21  0.50
#> Gabriel Dos Santos 0.15   0.15   0.15 2.7  2.7 0.8     3.5 0.08 0.02  0.10
#> Aaron Ramsdale     0.00   0.00   0.00 0.0  0.0 0.0     0.0 0.00 0.00  0.00
#> Ben White          0.00   0.00   0.00 1.0  1.0 0.6     1.6 0.03 0.02  0.05
#> Martin \xd8degaard 0.36   0.23   0.36 4.8  4.8 6.8    11.6 0.16 0.22  0.38
#> Granit Xhaka       0.12   0.04   0.12 1.2  1.2 2.3     3.5 0.05 0.09  0.14
#>                    npxG.1 npxG.xA.1 cluster
#> Bukayo Saka          0.25      0.46       3
#> Gabriel Dos Santos   0.08      0.10       1
#> Aaron Ramsdale       0.00      0.00       4
#> Ben White            0.03      0.05       4
#> Martin \xd8degaard   0.16      0.38       3
#> Granit Xhaka         0.05      0.14       1

Grouping data based on cluster label

Grouping based on cluster label, so we can learn the character from each cluster

epl_clean %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

#> # A tibble: 4 × 25
#>   cluster    MP Starts  X90s   Gls   Ast  G.PK      PK  PKatt  CrdY    CrdR
#>     <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl>   <dbl>
#> 1       1 25.9   22.9  22.4   1.71 1.88  1.68  0.0294  0.0294  5.21 1.06   
#> 2       2  8.02   4.52  4.80  0.25 0.246 0.241 0.00893 0.0134  0.75 0.00893
#> 3       3 28.3   23.1  22.8   8.35 4.01  7.35  1       1.22    3.38 0.0694 
#> 4       4 27.4   23.3  23.1   1.53 1.62  1.50  0.0338  0.0435  3.58 0      
#> # … with 14 more variables: Gls.1 <dbl>, Ast.1 <dbl>, G.A <dbl>, G.PK.1 <dbl>,
#> #   G.A.PK <dbl>, xG <dbl>, npxG <dbl>, xA <dbl>, npxG.xA <dbl>, xG.1 <dbl>,
#> #   xA.1 <dbl>, xG.xA <dbl>, npxG.1 <dbl>, npxG.xA.1 <dbl>

Filtering data based on cluster label

# Show some of players from cluster 1
epl_clean[epl_clean$cluster==1,] %>% head()

#>                    MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1 Ast.1
#> Gabriel Dos Santos 35     35 34.0   5   0    5  0     0    8    1  0.15  0.00
#> Granit Xhaka       27     27 25.9   1   2    1  0     0   10    1  0.04  0.08
#> Rob Holding        15      9  9.4   1   0    1  0     0    4    1  0.11  0.00
#> Ezri Konsa         29     29 27.5   2   0    2  0     0    6    2  0.07  0.00
#> Sergi Can\xf3s     31     25 23.1   3   2    3  0     0    8    1  0.13  0.09
#> Shandon Baptiste   22      9 10.1   1   1    1  0     0    2    1  0.10  0.10
#>                     G.A G.PK.1 G.A.PK  xG npxG  xA npxG.xA xG.1 xA.1 xG.xA
#> Gabriel Dos Santos 0.15   0.15   0.15 2.7  2.7 0.8     3.5 0.08 0.02  0.10
#> Granit Xhaka       0.12   0.04   0.12 1.2  1.2 2.3     3.5 0.05 0.09  0.14
#> Rob Holding        0.11   0.11   0.11 0.3  0.3 0.0     0.3 0.03 0.00  0.03
#> Ezri Konsa         0.07   0.07   0.07 1.5  1.5 0.3     1.7 0.05 0.01  0.06
#> Sergi Can\xf3s     0.22   0.13   0.22 3.0  3.0 1.9     4.9 0.13 0.08  0.21
#> Shandon Baptiste   0.20   0.10   0.20 0.7  0.7 0.5     1.2 0.07 0.05  0.12
#>                    npxG.1 npxG.xA.1 cluster
#> Gabriel Dos Santos   0.08      0.10       1
#> Granit Xhaka         0.05      0.14       1
#> Rob Holding          0.03      0.03       1
#> Ezri Konsa           0.05      0.06       1
#> Sergi Can\xf3s       0.13      0.21       1
#> Shandon Baptiste     0.07      0.12       1

# Show some of players from cluster 2
epl_clean[epl_clean$cluster==2,] %>% head()

#>                        MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1
#> Albert Sambi Lokonga   19     12 12.6   0   0    0  0     0    5    0  0.00
#> Mohamed Elneny         14      8  9.0   0   2    0  0     0    1    0  0.00
#> Nicolas P\xe9p\xe9     20      5  7.7   1   2    1  0     0    0    0  0.13
#> Bernd Leno              4      4  4.0   0   0    0  0     0    0    0  0.00
#> Ainsley Maitland-Niles  8      2  3.0   0   0    0  0     0    0    0  0.00
#> Pablo Mar\xed           2      2  2.0   0   0    0  0     0    1    0  0.00
#>                        Ast.1  G.A G.PK.1 G.A.PK  xG npxG  xA npxG.xA xG.1 xA.1
#> Albert Sambi Lokonga    0.00 0.00   0.00   0.00 0.6  0.6 0.8     1.4 0.05 0.06
#> Mohamed Elneny          0.22 0.22   0.00   0.22 0.2  0.2 0.6     0.7 0.02 0.06
#> Nicolas P\xe9p\xe9      0.26 0.39   0.13   0.39 2.9  2.9 1.7     4.6 0.38 0.22
#> Bernd Leno              0.00 0.00   0.00   0.00 0.0  0.0 0.0     0.0 0.00 0.00
#> Ainsley Maitland-Niles  0.00 0.00   0.00   0.00 0.0  0.0 0.3     0.4 0.01 0.12
#> Pablo Mar\xed           0.00 0.00   0.00   0.00 0.3  0.3 0.1     0.4 0.15 0.05
#>                        xG.xA npxG.1 npxG.xA.1 cluster
#> Albert Sambi Lokonga    0.11   0.05      0.11       2
#> Mohamed Elneny          0.08   0.02      0.08       2
#> Nicolas P\xe9p\xe9      0.60   0.38      0.60       2
#> Bernd Leno              0.00   0.00      0.00       2
#> Ainsley Maitland-Niles  0.13   0.01      0.13       2
#> Pablo Mar\xed           0.21   0.15      0.21       2

# Show some of players from cluster 3
epl_clean[epl_clean$cluster==3,] %>% head()

#>                           MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1
#> Bukayo Saka               38     36 33.1  11   7    9  2     2    6    0  0.33
#> Martin \xd8degaard        36     32 30.9   7   4    7  0     0    4    0  0.23
#> Emile Smith Rowe          33     21 21.3  10   2   10  0     0    1    0  0.47
#> Martinelli                29     21 20.7   6   6    5  1     1    3    1  0.29
#> Alexandre Lacazette       30     20 19.8   4   7    2  2     3    0    0  0.20
#> Pierre-Emerick Aubameyang 14     12 11.5   4   1    4  0     2    3    0  0.35
#>                           Ast.1  G.A G.PK.1 G.A.PK  xG npxG  xA npxG.xA xG.1
#> Bukayo Saka                0.21 0.54   0.27   0.48 9.7  8.2 6.9    15.2 0.29
#> Martin \xd8degaard         0.13 0.36   0.23   0.36 4.8  4.8 6.8    11.6 0.16
#> Emile Smith Rowe           0.09 0.56   0.47   0.56 5.8  5.8 2.2     8.0 0.27
#> Martinelli                 0.29 0.58   0.24   0.53 7.2  6.5 3.3     9.8 0.35
#> Alexandre Lacazette        0.35 0.56   0.10   0.45 7.9  5.6 1.9     7.6 0.40
#> Pierre-Emerick Aubameyang  0.09 0.43   0.35   0.43 5.8  4.1 0.8     5.0 0.50
#>                           xA.1 xG.xA npxG.1 npxG.xA.1 cluster
#> Bukayo Saka               0.21  0.50   0.25      0.46       3
#> Martin \xd8degaard        0.22  0.38   0.16      0.38       3
#> Emile Smith Rowe          0.10  0.37   0.27      0.37       3
#> Martinelli                0.16  0.51   0.31      0.47       3
#> Alexandre Lacazette       0.10  0.50   0.28      0.38       3
#> Pierre-Emerick Aubameyang 0.07  0.57   0.36      0.43       3

# Show some of players from cluster 4
epl_clean[epl_clean$cluster==4,] %>% head()

#>                   MP Starts X90s Gls Ast G.PK PK PKatt CrdY CrdR Gls.1 Ast.1
#> Aaron Ramsdale    34     34 34.0   0   0    0  0     0    1    0  0.00  0.00
#> Ben White         32     32 32.0   0   0    0  0     0    3    0  0.00  0.00
#> Thomas Partey     24     23 22.5   2   1    2  0     0    6    0  0.09  0.04
#> Kieran Tierney    22     22 21.3   1   3    1  0     0    0    0  0.05  0.14
#> Takehiro Tomiyasu 21     20 18.7   0   1    0  0     0    2    0  0.00  0.05
#> C\xe9dric Soares  21     16 16.5   1   1    1  0     0    3    0  0.06  0.06
#>                    G.A G.PK.1 G.A.PK  xG npxG  xA npxG.xA xG.1 xA.1 xG.xA
#> Aaron Ramsdale    0.00   0.00   0.00 0.0  0.0 0.0     0.0 0.00 0.00  0.00
#> Ben White         0.00   0.00   0.00 1.0  1.0 0.6     1.6 0.03 0.02  0.05
#> Thomas Partey     0.13   0.09   0.13 2.5  2.5 1.3     3.8 0.11 0.06  0.17
#> Kieran Tierney    0.19   0.05   0.19 0.7  0.7 1.9     2.6 0.03 0.09  0.12
#> Takehiro Tomiyasu 0.05   0.00   0.05 0.8  0.8 0.6     1.4 0.04 0.03  0.08
#> C\xe9dric Soares  0.12   0.06   0.12 0.5  0.5 1.4     2.0 0.03 0.09  0.12
#>                   npxG.1 npxG.xA.1 cluster
#> Aaron Ramsdale      0.00      0.00       4
#> Ben White           0.03      0.05       4
#> Thomas Partey       0.11      0.17       4
#> Kieran Tierney      0.03      0.12       4
#> Takehiro Tomiyasu   0.04      0.08       4
#> C\xe9dric Soares    0.03      0.12       4

Visualization

# Visualization in 2 dimensition
library(ggiraphExtra)

ggRadar(
  data=epl_clean,
  mapping = aes(colours = cluster),
  interactive = T
)

# Visualization for cluster profiling

# Make the model
library(ggiraphExtra)
epl_pca1 <- PCA(X = epl_clean, # data
               scale.unit = T, #untuk menentukan data agar tidak di scaling
               quali.sup = 25, #index kolom dari variable cluster
               graph=F) #disable graph

# Visualize the model
fviz_pca_biplot(epl_pca1,
                habillage = "cluster", #kolom pewarnaan
                geom.ind = "point", # menampilkan titik observasi saja
                addEllipses = T, # membuat elips disekitar cluster
                col.var = "navy") # warna panah dan teks variabl

That is not many thing we can see here, because we have many variables. Let’t find centroid in every cluster then group them and summarize by their maximum value and minimum value. Then we can do cluster profiling.

# Find Centroid in every cluster
epl_centroid <- epl_clean %>% 
  group_by(cluster) %>% 
  summarise_all(mean)
epl_centroid

#> # A tibble: 4 × 25
#>   cluster    MP Starts  X90s   Gls   Ast  G.PK      PK  PKatt  CrdY    CrdR
#>     <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl>   <dbl>
#> 1       1 25.9   22.9  22.4   1.71 1.88  1.68  0.0294  0.0294  5.21 1.06   
#> 2       2  8.02   4.52  4.80  0.25 0.246 0.241 0.00893 0.0134  0.75 0.00893
#> 3       3 28.3   23.1  22.8   8.35 4.01  7.35  1       1.22    3.38 0.0694 
#> 4       4 27.4   23.3  23.1   1.53 1.62  1.50  0.0338  0.0435  3.58 0      
#> # … with 14 more variables: Gls.1 <dbl>, Ast.1 <dbl>, G.A <dbl>, G.PK.1 <dbl>,
#> #   G.A.PK <dbl>, xG <dbl>, npxG <dbl>, xA <dbl>, npxG.xA <dbl>, xG.1 <dbl>,
#> #   xA.1 <dbl>, xG.xA <dbl>, npxG.1 <dbl>, npxG.xA.1 <dbl>

# Grouping the centroid to min group and max group
epl_centroid %>% 
  pivot_longer(-cluster) %>% 
  group_by(name) %>% 
  summarize(
    min_group = which.min(value),
    max_group = which.max(value))

#> # A tibble: 24 × 3
#>    name   min_group max_group
#>    <chr>      <int>     <int>
#>  1 Ast            2         3
#>  2 Ast.1          2         3
#>  3 CrdR           4         1
#>  4 CrdY           2         1
#>  5 G.A            2         3
#>  6 G.A.PK         2         3
#>  7 G.PK           2         3
#>  8 G.PK.1         2         3
#>  9 Gls            2         3
#> 10 Gls.1          2         3
#> # … with 14 more rows

Profiling every cluster

Cluster 1
- Max in aspect : CrdR (Red Cards), CrdY (Yellow Cards)
- Min in aspect : npxG.xA.1 (Non-Penalty Expected Goals made per 90 mins) , xA.1 (Expected Assist made per 90 mins), xG.1 (Expected Goal per 90 mins), xG.xA (Expected Assist & Goal)
- Label : Rough player
- Description : These kind of player that easily foul the opponent. They maybe have a lot duty in defense.

Cluster 2
- Max in aspect : none
- Min in aspect : Ast (Assists), Ast.1(Assist per 90 mins), CrdY(Yellow Cards), G.A(Goals and Assists per 90 mins), G.A.PK(Goals plus Assists minus Penalty Kicks made per 90 mins), G.PK(Non Penalty Goals), G.PK.1(Goals minus Penalty Kicks made per 90 mins ), Gls(Goals scored or allowed), Gls.1(Goals scored per 90 mins), MP(Matches played ), npxG(Non-Penalty Expected Goals), npxG.xA(Non-Penalty Expected Goals ), PK(Penalty Kicks made), PKatt(Penalty Kicks attended ), Starts(Matches started ), X90s(90 minutes Expected Played), xA(Expected Assists ), xG(Expected Goals)
- Label : Bench warmer
- Description : These kind of player is almost has minimum contribution to the team. They have tendecy to spend of their time in bench.

Cluster 3
- Max in aspect : Ast (Assists), Ast.1(Assist per 90 mins),G.A(Goals and Assists per 90 mins), G.A.PK(Goals plus Assists minus Penalty Kicks made per 90 mins), G.PK(Non Penalty Goals),G.PK.1(Goals minus Penalty Kicks made per 90 mins ),Gls(Goals scored or allowed),Gls.1(Goals scored per 90 mins), MP(Matches played ), npxG(Non-Penalty Expected Goals), npxG.1(Non-Penalty Expected Goals made per 90 mins ), npxG.xA(Non-Penalty Expected Goals plus Expected Assists), npxG.xA.1(Non-Penalty Expected Goals plus Expected Assists made per 90 mins),PK(Penalty Kicks made), PKatt(Penalty Kicks attended ),xA(Expected Assist), xA.1(Expected Assits made per 90 mins), xG(Expected Goals), xG.1(Expected Goals per 90 mins), xG.xA(Expected Goals plus Expected Assists)
- Min in aspect : None
- Label : Key Player
- Description : These kind of player that contribute a lot in attack. They do a lot in final third, usually their position are attacking midfielder and striker.

Cluster 4
- Max in aspect : Starts(Matches started), X90s(Expected Minutes played divided by 90) - Min in aspect : npxG.1(Non-Penalty Expected Goals)
- Label : First Team Player
- Description : These kind of player that play the most of match. Their contribution maybe not a lot in attacking, but maybe they contribute in other area.

Conclusion

This article Unsupervised Learning : English Premier League 2020 - 2021 can be use to help manager or staff team to analysis opponent team. We could get which player that actually danger in front of goal. Or which player that play dirty, we may want avoid these players on duel because they could injure our player.

Unsupervised learning main purpose is exploring our data, with hope get interisting or new information . I can say this purposed fulfilled here. We know a lot more about player in English Premier League (season 2020-2021).