Clustering of TOP 5 European Leagues Players

Cezary Kuźmowicz

“The ball doesn’t go in by chance” ~ Johan Cruyff

Football is the most popular sport on Earth. Millions of people around the globe play it on daily basis. Most of countries have their own national leagues. But the best of the best take place in Europe. In football’s nomenclature, while sharing some graphs and statistics, often used concept is “TOP 5 European Leagues”. That means clearly five best national competitions in Europe.

This group consist of English Premier League, Spanish LaLiga, Italian Serie A, German Bundesliga and French Ligue 1. In this report I will conduct clustering analysis using players statistics from season 2021/22.

In order to achieve satisfactory results many tests and visualizations will be presented. As the clustering method I’ve chose k-means, mainly due to medium size of dataset and simply interpretation.

dataset used for analysis

Installing neccesary packages

In the beginning we have to install and access needed packages for whole analysis:

library(ggplot2)
library(cluster)
library(NbClust)
library(reshape2)
library(clustertend)
## Package `clustertend` is deprecated.  Use package `hopkins` instead.
library(ClusterR)
library(fpc)
library(flexclust)
## Loading required package: grid

## Loading required package: lattice

## Loading required package: modeltools

## Loading required package: stats4
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

CLUSTERING PREPARATIONS

Data Preparation

Loading dataset into R

My dataset comes from Kaggle. It was prepared based on data from fbref - online website gathering huge amount of informations from plenty of sports.

raw_stats <- read.csv("/Users/czarek/Downloads/2021-2022 Football Player Stats.csv", sep = ";", dec = ".",
                      check.names = FALSE)

Raw dataset includes nearly 3000 observations (individual players), each described by 143 variables! It’s important to add that every statistic is calculated “per 90 minutes”. This means that author already unified data, so one step less for us!

First step during our data preparation will be excluding footballers who played less than 180 minutes through whole season. That operation will assure us that we don’t analyze player who for example scored 4 goals in his only played match.

stats_180 <- raw_stats[raw_stats$Min >= 180,]

By conducting such an operation we got rid of nearly 600 observations. That will definitely improve overview of our dataset and quality of analysis.

Before removing some of them, let’s dive into overview of the data

head(stats_180)
##   Rk            Player Nation Pos             Squad           Comp Age Born MP
## 1  1        Max Aarons    ENG  DF      Norwich City Premier League  22 2000 34
## 2  2  Yunis Abdelhamid    MAR  DF             Reims        Ligue 1  34 1987 34
## 3  3 Salis Abdul Samed    GHA  MF     Clermont Foot        Ligue 1  22 2000 31
## 4  4   Laurent Abergel    FRA  MF           Lorient        Ligue 1  29 1993 34
## 6  6    Dickson Abiama    NGA  FW Greuther F\xfcrth     Bundesliga  23 1998 24
## 8  8     Tammy Abraham    ENG  FW              Roma        Serie A  24 1997 37
##   Starts  Min  90s Goals Shots  SoT SoT% G/Sh G/SoT ShoDist ShoFK ShoPK PKatt
## 1     32 2881 32.0  0.00  0.41 0.06 15.4 0.00  0.00    20.5  0.00  0.00  0.00
## 2     34 2983 33.1  0.06  0.54 0.18 33.3 0.11  0.33    18.7  0.00  0.00  0.00
## 3     29 2462 27.4  0.04  0.66 0.18 27.8 0.06  0.20    20.3  0.00  0.00  0.00
## 4     34 2956 32.8  0.00  0.91 0.21 23.3 0.00  0.00    22.6  0.00  0.00  0.00
## 6      5  726  8.1  0.00  2.22 0.49 22.2 0.00  0.00    15.6  0.00  0.00  0.00
## 8     36 3084 34.3  0.50  2.71 0.93 34.4 0.15  0.44    12.2  0.06  0.09  0.09
##   PasTotCmp PasTotAtt PasTotCmp% PasTotDist PasTotPrgDist PasShoCmp PasShoAtt
## 1      34.0      45.0       75.5      574.1         214.8     17.50     19.40
## 2      38.7      47.0       82.4      835.8         287.9     10.20     11.40
## 3      55.9      61.0       91.7     1033.3         184.4     22.50     24.10
## 4      40.7      49.8       81.6      780.8         206.0     16.30     18.40
## 6      11.1      17.2       64.7      160.0          40.7      6.67      9.63
## 8      14.6      20.2       72.0      224.9          51.2      8.69     11.40
##   PasShoCmp% PasMedCmp PasMedAtt PasMedCmp% PasLonCmp PasLonAtt PasLonCmp%
## 1       90.0     13.10     17.00       77.0      3.06      6.78       45.2
## 2       89.9     22.40     25.00       89.4      5.65      9.15       61.7
## 3       93.5     25.80     27.20       94.9      6.72      7.81       86.0
## 4       88.6     17.30     19.60       87.9      6.25      9.39       66.6
## 6       69.2      3.09      4.69       65.8      0.62      1.23       50.0
## 8       76.2      4.05      5.86       69.2      1.28      1.60       80.0
##   Assists PasAss Pas3rd  PPA CrsPA PasProg PasAtt PasLive PasDead PasFK   TB
## 1    0.06   0.59   1.56 1.13  0.25    2.94   45.0    34.4   10.60  0.84 0.06
## 2    0.00   0.24   2.45 0.18  0.00    2.72   47.0    44.0    3.02  2.45 0.00
## 3    0.00   0.55   2.81 0.47  0.04    2.96   61.0    60.3    0.73  0.58 0.04
## 4    0.06   0.91   3.87 0.58  0.18    4.18   49.8    49.0    0.85  0.64 0.18
## 6    0.12   0.99   0.86 0.74  0.00    1.60   17.2    16.0    1.11  0.12 0.12
## 8    0.12   1.02   1.08 0.82  0.12    1.78   20.2    18.3    1.98  0.17 0.15
##   PasPress   Sw PasCrs   CK CkIn CkOut CkStr PasGround PasLow PasHigh PaswLeft
## 1     5.41 0.59   1.41 0.00    0     0     0      26.5   9.59    8.94     4.91
## 2     5.68 1.66   0.06 0.00    0     0     0      35.3   3.78    7.95    31.70
## 3     8.03 0.80   0.36 0.00    0     0     0      52.6   4.71    3.72     4.82
## 4     9.48 1.49   0.79 0.03    0     0     0      37.6   5.64    6.65     4.48
## 6     5.19 0.25   0.25 0.00    0     0     0      10.7   3.95    2.47     3.33
## 8     5.92 0.32   0.70 0.00    0     0     0      12.8   4.37    3.03     1.84
##   PaswRight PaswHead   TI PaswOther PasCmp PasOff PasOut PasInt PasBlocks  SCA
## 1     29.00     0.91 9.72      0.06   34.0   0.22   0.88   1.63      1.75 1.19
## 2     12.10     1.48 0.42      0.12   38.7   0.15   0.97   1.24      0.88 0.63
## 3     53.10     1.90 0.15      0.29   55.9   0.07   0.58   1.24      0.84 1.46
## 4     43.90     0.73 0.15      0.15   40.7   0.21   0.55   1.83      1.68 2.01
## 6      9.38     2.35 0.00      0.37   11.1   0.25   0.62   0.62      1.11 2.47
## 8     14.10     1.92 0.03      0.70   14.6   0.00   0.29   1.02      0.70 2.33
##   ScaPassLive ScaPassDead ScaDrib ScaSh ScaFld ScaDef  GCA GcaPassLive
## 1        0.84        0.06    0.09  0.13   0.06   0.00 0.16        0.16
## 2        0.42        0.00    0.09  0.03   0.00   0.09 0.03        0.00
## 3        1.09        0.00    0.00  0.15   0.15   0.07 0.04        0.04
## 4        1.49        0.06    0.03  0.03   0.21   0.18 0.15        0.12
## 6        1.48        0.00    0.00  0.12   0.25   0.62 0.25        0.12
## 8        1.63        0.03    0.03  0.38   0.17   0.09 0.35        0.15
##   GcaPassDead GcaDrib GcaSh GcaFld GcaDef  Tkl TklWon TklDef3rd TklMid3rd
## 1           0    0.00  0.00   0.00      0 2.16   1.16      1.56      0.59
## 2           0    0.03  0.00   0.00      0 1.87   1.39      1.24      0.60
## 3           0    0.00  0.00   0.00      0 2.01   1.24      0.91      0.91
## 4           0    0.00  0.00   0.03      0 3.57   2.23      1.49      1.71
## 6           0    0.00  0.12   0.00      0 1.73   0.86      0.37      0.86
## 8           0    0.03  0.09   0.09      0 0.99   0.64      0.29      0.44
##   TklAtt3rd TklDri TklDriAtt TklDri% TklDriPast Press PresSucc Press%
## 1      0.00   1.16      1.81    63.8       0.66  13.6     3.53   26.0
## 2      0.03   0.39      0.82    48.1       0.42  13.6     4.89   35.9
## 3      0.18   0.69      2.15    32.2       1.46  23.4     6.53   27.9
## 4      0.37   1.80      4.97    36.2       3.17  28.0     7.90   28.2
## 6      0.49   0.25      1.36    18.2       1.11  23.5     4.81   20.5
## 8      0.26   0.29      0.67    43.5       0.38  13.6     4.02   29.6
##   PresDef3rd PresMid3rd PresAtt3rd Blocks BlkSh BlkShSv BlkPass  Int Tkl+Int
## 1       7.97       4.38       1.22   2.69  0.69    0.03    2.00 1.75    3.91
## 2       7.61       5.14       0.88   1.87  0.79    0.06    1.09 3.11    4.98
## 3       7.19      12.30       3.94   0.99  0.04    0.00    0.95 1.86    3.87
## 4       9.27      15.30       3.41   1.68  0.09    0.00    1.59 2.56    6.13
## 6       2.59      10.00      10.90   1.23  0.37    0.00    0.86 0.99    2.72
## 8       1.34       5.71       6.56   0.96  0.32    0.00    0.64 0.44    1.43
##    Clr Err Touches TouDefPen TouDef3rd TouMid3rd TouAtt3rd TouAttPen TouLive
## 1 2.19   0    58.0      5.06     23.30      23.8      15.0      0.91    47.8
## 2 3.20   0    57.3      8.28     32.80      25.7       2.9      0.85    54.5
## 3 0.55   0    70.4      2.01     22.70      41.8      10.9      0.62    69.9
## 4 0.34   0    61.6      0.67     13.70      40.3      11.6      0.46    60.9
## 6 0.86   0    33.0      1.11      3.46      15.6      15.6      3.83    31.9
## 8 0.90   0    32.4      0.96      2.54      16.3      15.2      5.86    30.4
##   DriSucc DriAtt DriSucc% DriPast DriMegs Carries CarTotDist CarPrgDist CarProg
## 1    1.03   2.44     42.3    1.09    0.19    33.9      199.4      121.7    5.44
## 2    0.48   0.66     72.7    0.48    0.03    35.7      204.7      115.5    2.75
## 3    0.99   1.53     64.3    1.09    0.07    53.5      246.5      106.3    2.85
## 4    1.28   1.98     64.6    1.34    0.09    45.7      171.9       86.4    2.87
## 6    0.74   2.22     33.3    0.86    0.12    19.0       74.7       40.2    2.59
## 8    1.08   2.22     48.7    1.14    0.09    18.0       76.8       39.4    2.45
##   Car3rd  CPA CarMis CarDis RecTarg  Rec Rec% RecProg CrdY CrdR 2CrdY  Fls  Fld
## 1   1.66 0.41   0.84   0.94    36.0 32.4 89.9    1.28 0.25 0.00  0.00 0.97 1.84
## 2   0.73 0.00   0.45   0.39    37.5 36.3 96.9    0.36 0.15 0.03  0.00 1.30 0.73
## 3   0.73 0.15   0.84   1.46    58.6 54.2 92.5    1.72 0.44 0.11  0.07 1.64 1.28
## 4   1.13 0.09   0.85   1.46    46.3 43.0 93.0    1.86 0.27 0.00  0.00 1.40 2.07
## 6   0.99 0.86   5.06   1.36    41.4 21.1 51.0    5.93 0.37 0.00  0.00 2.22 1.48
## 8   0.82 0.67   2.39   1.28    41.2 22.4 54.3    6.71 0.26 0.00  0.00 1.25 1.46
##    Off  Crs TklW PKwon PKcon   OG Recov AerWon AerLost AerWon%
## 1 0.03 1.41 1.16  0.00  0.06 0.03  5.53   0.47    1.59    22.7
## 2 0.00 0.06 1.39  0.00  0.03 0.00  6.77   2.02    1.36    59.8
## 3 0.00 0.36 1.24  0.00  0.00 0.00  8.76   0.88    0.88    50.0
## 4 0.03 0.79 2.23  0.00  0.00 0.00  8.87   0.43    0.43    50.0
## 6 1.85 0.25 0.86  0.00  0.00 0.00  4.81   2.72    4.94    35.5
## 8 0.50 0.70 0.64  0.03  0.03 0.00  3.67   2.39    2.89    45.3
summary(stats_180)
##        Rk            Player             Nation              Pos           
##  Min.   :   1.0   Length:2350        Length:2350        Length:2350       
##  1st Qu.: 739.2   Class :character   Class :character   Class :character  
##  Median :1458.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1459.1                                                           
##  3rd Qu.:2170.8                                                           
##  Max.   :2921.0                                                           
##     Squad               Comp                Age             Born     
##  Length:2350        Length:2350        Min.   :17.00   Min.   :1981  
##  Class :character   Class :character   1st Qu.:24.00   1st Qu.:1992  
##  Mode  :character   Mode  :character   Median :26.00   Median :1995  
##                                        Mean   :26.76   Mean   :1995  
##                                        3rd Qu.:30.00   3rd Qu.:1998  
##                                        Max.   :40.00   Max.   :2004  
##        MP            Starts        Min              90s            Goals       
##  Min.   : 2.00   Min.   : 0   Min.   : 180.0   Min.   : 2.00   Min.   :0.0000  
##  1st Qu.:15.00   1st Qu.: 8   1st Qu.: 764.2   1st Qu.: 8.50   1st Qu.:0.0000  
##  Median :24.00   Median :16   Median :1447.5   Median :16.10   Median :0.0500  
##  Mean   :22.68   Mean   :17   Mean   :1520.6   Mean   :16.89   Mean   :0.1214  
##  3rd Qu.:31.00   3rd Qu.:26   3rd Qu.:2221.2   3rd Qu.:24.70   3rd Qu.:0.1800  
##  Max.   :38.00   Max.   :38   Max.   :3420.0   Max.   :38.00   Max.   :1.4300  
##      Shots            SoT              SoT%             G/Sh        
##  Min.   :0.000   Min.   :0.0000   Min.   :  0.00   Min.   :0.00000  
##  1st Qu.:0.420   1st Qu.:0.0700   1st Qu.: 14.30   1st Qu.:0.00000  
##  Median :0.890   Median :0.2400   Median : 28.00   Median :0.05000  
##  Mean   :1.179   Mean   :0.3820   Mean   : 26.69   Mean   :0.07494  
##  3rd Qu.:1.830   3rd Qu.:0.5975   3rd Qu.: 37.90   3rd Qu.:0.13000  
##  Max.   :5.100   Max.   :2.3500   Max.   :100.00   Max.   :1.00000  
##      G/SoT           ShoDist         ShoFK             ShoPK         
##  Min.   :0.0000   Min.   : 0.0   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.0000   1st Qu.:12.0   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.1700   Median :16.1   Median :0.00000   Median :0.000000  
##  Mean   :0.2257   Mean   :15.4   Mean   :0.03797   Mean   :0.009553  
##  3rd Qu.:0.3800   3rd Qu.:20.1   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :2.0000   Max.   :71.2   Max.   :1.04000   Max.   :0.500000  
##      PKatt           PasTotCmp       PasTotAtt        PasTotCmp%   
##  Min.   :0.00000   Min.   : 7.11   Min.   : 11.30   Min.   :40.50  
##  1st Qu.:0.00000   1st Qu.:23.30   1st Qu.: 32.10   1st Qu.:72.00  
##  Median :0.00000   Median :32.75   Median : 41.95   Median :78.10  
##  Mean   :0.01221   Mean   :34.26   Mean   : 43.21   Mean   :77.45  
##  3rd Qu.:0.00000   3rd Qu.:43.08   3rd Qu.: 52.80   3rd Qu.:83.60  
##  Max.   :0.80000   Max.   :96.10   Max.   :100.60   Max.   :96.80  
##    PasTotDist     PasTotPrgDist      PasShoCmp        PasShoAtt    
##  Min.   : 103.7   Min.   :  11.4   Min.   : 0.500   Min.   : 0.50  
##  1st Qu.: 431.9   1st Qu.: 116.6   1st Qu.: 9.453   1st Qu.:11.60  
##  Median : 641.1   Median : 206.7   Median :13.500   Median :15.70  
##  Mean   : 661.9   Mean   : 217.9   Mean   :14.136   Mean   :16.22  
##  3rd Qu.: 835.8   3rd Qu.: 299.6   3rd Qu.:17.900   3rd Qu.:20.10  
##  Max.   :2062.9   Max.   :1019.6   Max.   :46.400   Max.   :49.10  
##    PasShoCmp%       PasMedCmp        PasMedAtt       PasMedCmp%    
##  Min.   : 57.90   Min.   : 1.300   Min.   : 3.00   Min.   : 27.30  
##  1st Qu.: 82.72   1st Qu.: 8.402   1st Qu.:10.50   1st Qu.: 76.70  
##  Median : 87.50   Median :13.600   Median :16.30   Median : 84.10  
##  Mean   : 86.73   Mean   :14.470   Mean   :16.92   Mean   : 83.03  
##  3rd Qu.: 91.30   3rd Qu.:18.800   3rd Qu.:21.88   3rd Qu.: 90.40  
##  Max.   :100.00   Max.   :46.200   Max.   :48.90   Max.   :100.00  
##    PasLonCmp        PasLonAtt        PasLonCmp%        Assists       
##  Min.   : 0.000   Min.   : 0.000   Min.   :  0.00   Min.   :0.00000  
##  1st Qu.: 2.410   1st Qu.: 4.420   1st Qu.: 50.00   1st Qu.:0.00000  
##  Median : 4.360   Median : 7.515   Median : 59.15   Median :0.05000  
##  Mean   : 4.957   Mean   : 8.288   Mean   : 59.53   Mean   :0.08705  
##  3rd Qu.: 6.990   3rd Qu.:11.100   3rd Qu.: 70.00   3rd Qu.:0.14000  
##  Max.   :22.800   Max.   :42.600   Max.   :100.00   Max.   :1.30000  
##      PasAss           Pas3rd            PPA             CrsPA      
##  Min.   :0.0000   Min.   : 0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.2725   1st Qu.: 1.180   1st Qu.:0.2225   1st Qu.:0.000  
##  Median :0.7300   Median : 2.155   Median :0.6100   Median :0.100  
##  Mean   :0.8339   Mean   : 2.402   Mean   :0.7257   Mean   :0.198  
##  3rd Qu.:1.2000   3rd Qu.: 3.237   3rd Qu.:1.0900   3rd Qu.:0.310  
##  Max.   :5.0000   Max.   :11.700   Max.   :3.7700   Max.   :2.000  
##     PasProg           PasAtt          PasLive         PasDead      
##  Min.   : 0.000   Min.   : 11.30   Min.   :10.10   Min.   : 0.000  
##  1st Qu.: 1.680   1st Qu.: 32.10   1st Qu.:28.10   1st Qu.: 1.312  
##  Median : 2.670   Median : 41.95   Median :37.50   Median : 2.520  
##  Mean   : 2.794   Mean   : 43.21   Mean   :38.98   Mean   : 4.239  
##  3rd Qu.: 3.800   3rd Qu.: 52.80   3rd Qu.:47.80   3rd Qu.: 7.140  
##  Max.   :10.300   Max.   :100.60   Max.   :98.10   Max.   :18.500  
##      PasFK             TB             PasPress            Sw       
##  Min.   :0.000   Min.   :0.00000   Min.   : 0.000   Min.   :0.000  
##  1st Qu.:0.290   1st Qu.:0.00000   1st Qu.: 5.030   1st Qu.:0.570  
##  Median :0.890   Median :0.00000   Median : 6.355   Median :1.000  
##  Mean   :1.099   Mean   :0.06671   Mean   : 6.444   Mean   :1.225  
##  3rd Qu.:1.640   3rd Qu.:0.10000   3rd Qu.: 7.810   3rd Qu.:1.640  
##  Max.   :8.000   Max.   :1.00000   Max.   :15.300   Max.   :6.880  
##      PasCrs            CK              CkIn            CkOut       
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.190   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.760   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :1.104   Mean   :0.4404   Mean   :0.1806   Mean   :0.1645  
##  3rd Qu.:1.790   3rd Qu.:0.2600   3rd Qu.:0.0600   3rd Qu.:0.0600  
##  Max.   :9.130   Max.   :5.9400   Max.   :2.9100   Max.   :3.3000  
##      CkStr           PasGround         PasLow          PasHigh      
##  Min.   :0.00000   Min.   : 3.67   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.00000   1st Qu.:19.43   1st Qu.: 3.712   1st Qu.: 5.210  
##  Median :0.00000   Median :26.90   Median : 4.960   Median : 7.870  
##  Mean   :0.01907   Mean   :29.13   Mean   : 5.644   Mean   : 8.446  
##  3rd Qu.:0.00000   3rd Qu.:36.40   3rd Qu.: 6.880   3rd Qu.:10.700  
##  Max.   :1.04000   Max.   :91.30   Max.   :18.300   Max.   :37.900  
##     PaswLeft        PaswRight        PaswHead           TI        
##  Min.   : 0.000   Min.   : 0.85   Min.   :0.000   Min.   : 0.000  
##  1st Qu.: 3.473   1st Qu.: 9.97   1st Qu.:0.980   1st Qu.: 0.080  
##  Median : 5.900   Median :24.90   Median :1.640   Median : 0.270  
##  Mean   :12.686   Mean   :25.17   Mean   :1.713   Mean   : 1.894  
##  3rd Qu.:16.875   3rd Qu.:35.60   3rd Qu.:2.370   3rd Qu.: 1.300  
##  Max.   :75.900   Max.   :91.60   Max.   :6.150   Max.   :15.700  
##    PaswOther          PasCmp          PasOff           PasOut      
##  Min.   :0.0000   Min.   : 7.11   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1100   1st Qu.:23.30   1st Qu.:0.0400   1st Qu.:0.4700  
##  Median :0.2200   Median :32.75   Median :0.1200   Median :0.7000  
##  Mean   :0.5636   Mean   :34.26   Mean   :0.1437   Mean   :0.7497  
##  3rd Qu.:0.4200   3rd Qu.:43.08   3rd Qu.:0.2100   3rd Qu.:0.9800  
##  Max.   :7.0900   Max.   :96.10   Max.   :1.0000   Max.   :2.6500  
##      PasInt        PasBlocks          SCA         ScaPassLive   
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.870   1st Qu.:0.600   1st Qu.:0.770   1st Qu.:0.580  
##  Median :1.300   Median :1.050   Median :1.680   Median :1.240  
##  Mean   :1.341   Mean   :1.096   Mean   :1.788   Mean   :1.284  
##  3rd Qu.:1.760   3rd Qu.:1.540   3rd Qu.:2.538   3rd Qu.:1.810  
##  Max.   :5.000   Max.   :4.500   Max.   :7.310   Max.   :5.600  
##   ScaPassDead        ScaDrib            ScaSh             ScaFld       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.06000   Median :0.04000  
##  Mean   :0.1616   Mean   :0.09674   Mean   :0.09749   Mean   :0.09969  
##  3rd Qu.:0.1500   3rd Qu.:0.14000   3rd Qu.:0.15000   3rd Qu.:0.15000  
##  Max.   :3.0000   Max.   :1.39000   Max.   :0.91000   Max.   :1.00000  
##      ScaDef             GCA          GcaPassLive      GcaPassDead     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.1500   Median :0.1000   Median :0.00000  
##  Mean   :0.04902   Mean   :0.1989   Mean   :0.1358   Mean   :0.01273  
##  3rd Qu.:0.08000   3rd Qu.:0.3000   3rd Qu.:0.2100   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.5400   Max.   :1.3000   Max.   :0.56000  
##     GcaDrib            GcaSh             GcaFld            GcaDef        
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.000000  
##  Mean   :0.01206   Mean   :0.01797   Mean   :0.01511   Mean   :0.005119  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :0.50000   Max.   :0.49000   Max.   :0.50000   Max.   :0.370000  
##       Tkl            TklWon        TklDef3rd        TklMid3rd     
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.030   1st Qu.:0.600   1st Qu.:0.3300   1st Qu.:0.3400  
##  Median :1.650   Median :0.975   Median :0.7400   Median :0.6100  
##  Mean   :1.674   Mean   :1.020   Mean   :0.7936   Mean   :0.6524  
##  3rd Qu.:2.310   3rd Qu.:1.400   3rd Qu.:1.1800   3rd Qu.:0.9200  
##  Max.   :5.880   Max.   :3.390   Max.   :4.1200   Max.   :2.6900  
##    TklAtt3rd          TklDri         TklDriAtt        TklDri%      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :  0.00  
##  1st Qu.:0.0700   1st Qu.:0.2700   1st Qu.:0.860   1st Qu.: 26.23  
##  Median :0.2000   Median :0.5600   Median :1.400   Median : 37.95  
##  Mean   :0.2277   Mean   :0.5959   Mean   :1.477   Mean   : 36.99  
##  3rd Qu.:0.3400   3rd Qu.:0.8600   3rd Qu.:2.020   3rd Qu.: 50.00  
##  Max.   :1.7600   Max.   :2.6100   Max.   :5.240   Max.   :100.00  
##    TklDriPast         Press          PresSucc          Press%      
##  Min.   :0.0000   Min.   : 0.00   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.:0.4725   1st Qu.:10.30   1st Qu.: 3.180   1st Qu.: 25.70  
##  Median :0.8100   Median :14.80   Median : 4.340   Median : 29.70  
##  Mean   :0.8812   Mean   :14.42   Mean   : 4.261   Mean   : 29.22  
##  3rd Qu.:1.2200   3rd Qu.:18.80   3rd Qu.: 5.490   3rd Qu.: 33.50  
##  Max.   :4.2900   Max.   :33.80   Max.   :11.500   Max.   :100.00  
##    PresDef3rd       PresMid3rd       PresAtt3rd         Blocks      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.: 2.712   1st Qu.: 3.933   1st Qu.: 1.000   1st Qu.:0.9325  
##  Median : 4.660   Median : 6.230   Median : 2.915   Median :1.4300  
##  Mean   : 4.545   Mean   : 6.418   Mean   : 3.459   Mean   :1.4061  
##  3rd Qu.: 6.250   3rd Qu.: 8.818   3rd Qu.: 5.395   3rd Qu.:1.8900  
##  Max.   :13.900   Max.   :19.200   Max.   :15.900   Max.   :5.2200  
##      BlkSh           BlkShSv            BlkPass          Int       
##  Min.   :0.0000   Min.   :0.000000   Min.   :0.00   Min.   :0.000  
##  1st Qu.:0.0600   1st Qu.:0.000000   1st Qu.:0.69   1st Qu.:0.700  
##  Median :0.2100   Median :0.000000   Median :1.08   Median :1.390  
##  Mean   :0.3159   Mean   :0.006277   Mean   :1.09   Mean   :1.379  
##  3rd Qu.:0.4700   3rd Qu.:0.000000   3rd Qu.:1.47   3rd Qu.:1.970  
##  Max.   :2.5800   Max.   :0.500000   Max.   :5.22   Max.   :6.360  
##     Tkl+Int            Clr             Err            Touches      
##  Min.   : 0.000   Min.   :0.000   Min.   :0.0000   Min.   : 20.60  
##  1st Qu.: 1.920   1st Qu.:0.500   1st Qu.:0.0000   1st Qu.: 42.70  
##  Median : 3.190   Median :1.170   Median :0.0000   Median : 53.10  
##  Mean   : 3.053   Mean   :1.683   Mean   :0.0208   Mean   : 54.21  
##  3rd Qu.: 4.180   3rd Qu.:2.458   3rd Qu.:0.0300   3rd Qu.: 64.10  
##  Max.   :10.500   Max.   :8.510   Max.   :1.0000   Max.   :109.60  
##    TouDefPen        TouDef3rd        TouMid3rd       TouAtt3rd     
##  Min.   : 0.000   Min.   : 0.740   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 1.120   1st Qu.: 6.982   1st Qu.:18.82   1st Qu.: 6.183  
##  Median : 2.590   Median :15.300   Median :25.30   Median :14.950  
##  Mean   : 5.472   Mean   :17.467   Mean   :25.84   Mean   :14.223  
##  3rd Qu.: 5.928   3rd Qu.:25.975   3rd Qu.:33.00   3rd Qu.:20.600  
##  Max.   :44.800   Max.   :59.200   Max.   :70.90   Max.   :47.000  
##    TouAttPen         TouLive         DriSucc           DriAtt     
##  Min.   : 0.000   Min.   : 15.3   Min.   :0.0000   Min.   :0.000  
##  1st Qu.: 0.710   1st Qu.: 40.0   1st Qu.:0.2500   1st Qu.:0.470  
##  Median : 1.490   Median : 49.2   Median :0.6400   Median :1.210  
##  Mean   : 2.231   Mean   : 50.1   Mean   :0.8075   Mean   :1.516  
##  3rd Qu.: 3.498   3rd Qu.: 59.3   3rd Qu.:1.1700   3rd Qu.:2.230  
##  Max.   :11.200   Max.   :105.9   Max.   :5.5800   Max.   :8.750  
##     DriSucc%         DriPast          DriMegs           Carries     
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.00000   Min.   : 8.67  
##  1st Qu.: 41.83   1st Qu.:0.2800   1st Qu.:0.00000   1st Qu.:25.30  
##  Median : 52.90   Median :0.7100   Median :0.00000   Median :32.90  
##  Mean   : 51.44   Mean   :0.8881   Mean   :0.07947   Mean   :34.04  
##  3rd Qu.: 65.67   3rd Qu.:1.2800   3rd Qu.:0.12000   3rd Qu.:41.60  
##  Max.   :100.00   Max.   :5.7700   Max.   :1.67000   Max.   :87.70  
##    CarTotDist      CarPrgDist        CarProg           Car3rd     
##  Min.   : 30.6   Min.   :  6.19   Min.   : 0.000   Min.   :0.000  
##  1st Qu.:120.9   1st Qu.: 59.52   1st Qu.: 2.020   1st Qu.:0.450  
##  Median :167.8   Median : 87.70   Median : 3.375   Median :1.020  
##  Mean   :172.5   Mean   : 92.16   Mean   : 3.668   Mean   :1.128  
##  3rd Qu.:215.2   3rd Qu.:117.50   3rd Qu.: 4.980   3rd Qu.:1.637  
##  Max.   :500.7   Max.   :305.30   Max.   :15.300   Max.   :5.000  
##       CPA            CarMis          CarDis         RecTarg     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   : 6.50  
##  1st Qu.:0.000   1st Qu.:0.370   1st Qu.:0.370   1st Qu.:33.40  
##  Median :0.190   Median :1.000   Median :0.970   Median :41.50  
##  Mean   :0.386   Mean   :1.293   Mean   :1.122   Mean   :41.33  
##  3rd Qu.:0.550   3rd Qu.:2.030   3rd Qu.:1.670   3rd Qu.:49.10  
##  Max.   :4.760   Max.   :9.170   Max.   :6.220   Max.   :94.90  
##       Rec             Rec%           RecProg            CrdY       
##  Min.   : 6.25   Min.   : 38.20   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:26.20   1st Qu.: 77.70   1st Qu.: 0.430   1st Qu.:0.0800  
##  Median :33.10   Median : 89.90   Median : 2.205   Median :0.1800  
##  Mean   :34.86   Mean   : 85.07   Mean   : 3.098   Mean   :0.2054  
##  3rd Qu.:42.08   3rd Qu.: 96.30   3rd Qu.: 5.350   3rd Qu.:0.2900  
##  Max.   :90.70   Max.   :100.00   Max.   :13.200   Max.   :1.6000  
##       CrdR              2CrdY               Fls             Fld       
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.840   1st Qu.:0.630  
##  Median :0.000000   Median :0.000000   Median :1.295   Median :1.140  
##  Mean   :0.009855   Mean   :0.004272   Mean   :1.338   Mean   :1.252  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:1.780   3rd Qu.:1.730  
##  Max.   :0.510000   Max.   :0.260000   Max.   :4.780   Max.   :5.000  
##       Off             Crs             TklW           PKwon        
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.00000  
##  1st Qu.:0.000   1st Qu.:0.190   1st Qu.:0.600   1st Qu.:0.00000  
##  Median :0.050   Median :0.760   Median :0.975   Median :0.00000  
##  Mean   :0.181   Mean   :1.104   Mean   :1.020   Mean   :0.01097  
##  3rd Qu.:0.240   3rd Qu.:1.790   3rd Qu.:1.400   3rd Qu.:0.00000  
##  Max.   :2.070   Max.   :9.130   Max.   :3.390   Max.   :0.50000  
##      PKcon              OG               Recov            AerWon      
##  Min.   :0.0000   Min.   :0.000000   Min.   : 1.430   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.: 5.530   1st Qu.: 0.660  
##  Median :0.0000   Median :0.000000   Median : 7.500   Median : 1.310  
##  Mean   :0.0143   Mean   :0.004289   Mean   : 7.516   Mean   : 1.611  
##  3rd Qu.:0.0000   3rd Qu.:0.000000   3rd Qu.: 9.260   3rd Qu.: 2.270  
##  Max.   :0.5000   Max.   :0.500000   Max.   :15.700   Max.   :12.400  
##     AerLost         AerWon%      
##  Min.   :0.000   Min.   :  0.00  
##  1st Qu.:0.950   1st Qu.: 33.30  
##  Median :1.410   Median : 45.55  
##  Mean   :1.702   Mean   : 43.48  
##  3rd Qu.:2.087   3rd Qu.: 57.90  
##  Max.   :9.720   Max.   :100.00
str(stats_180)
## 'data.frame':    2350 obs. of  143 variables:
##  $ Rk           : int  1 2 3 4 6 8 9 10 11 13 ...
##  $ Player       : chr  "Max Aarons" "Yunis Abdelhamid" "Salis Abdul Samed" "Laurent Abergel" ...
##  $ Nation       : chr  "ENG" "MAR" "GHA" "FRA" ...
##  $ Pos          : chr  "DF" "DF" "MF" "MF" ...
##  $ Squad        : chr  "Norwich City" "Reims" "Clermont Foot" "Lorient" ...
##  $ Comp         : chr  "Premier League" "Ligue 1" "Ligue 1" "Ligue 1" ...
##  $ Age          : int  22 34 22 29 23 24 26 34 23 30 ...
##  $ Born         : int  2000 1987 2000 1993 1998 1997 1996 1988 1998 1991 ...
##  $ MP           : int  34 34 31 34 24 37 8 30 13 31 ...
##  $ Starts       : int  32 34 29 34 5 36 6 29 1 26 ...
##  $ Min          : int  2881 2983 2462 2956 726 3084 560 2536 259 2260 ...
##  $ 90s          : num  32 33.1 27.4 32.8 8.1 34.3 6.2 28.2 2.9 25.1 ...
##  $ Goals        : num  0 0.06 0.04 0 0 0.5 0 0.14 0 0.04 ...
##  $ Shots        : num  0.41 0.54 0.66 0.91 2.22 2.71 0.32 0.57 3.79 0.68 ...
##  $ SoT          : num  0.06 0.18 0.18 0.21 0.49 0.93 0 0.25 0.69 0.2 ...
##  $ SoT%         : num  15.4 33.3 27.8 23.3 22.2 34.4 0 43.8 18.2 29.4 ...
##  $ G/Sh         : num  0 0.11 0.06 0 0 0.15 0 0.25 0 0.06 ...
##  $ G/SoT        : num  0 0.33 0.2 0 0 0.44 0 0.57 0 0.2 ...
##  $ ShoDist      : num  20.5 18.7 20.3 22.6 15.6 12.2 7 9.2 15.8 22.1 ...
##  $ ShoFK        : num  0 0 0 0 0 0.06 0 0 0 0.04 ...
##  $ ShoPK        : num  0 0 0 0 0 0.09 0 0 0 0 ...
##  $ PKatt        : num  0 0 0 0 0 0.09 0 0 0 0 ...
##  $ PasTotCmp    : num  34 38.7 55.9 40.7 11.1 14.6 31.3 64.3 14.5 62.3 ...
##  $ PasTotAtt    : num  45 47 61 49.8 17.2 20.2 35.5 71.1 24.5 78.5 ...
##  $ PasTotCmp%   : num  75.5 82.4 91.7 81.6 64.7 72 88.2 90.3 59.2 79.4 ...
##  $ PasTotDist   : num  574 836 1033 781 160 ...
##  $ PasTotPrgDist: num  214.8 287.9 184.4 206 40.7 ...
##  $ PasShoCmp    : num  17.5 10.2 22.5 16.3 6.67 8.69 6.77 20.1 7.59 23.5 ...
##  $ PasShoAtt    : num  19.4 11.4 24.1 18.4 9.63 11.4 7.42 21.6 12.1 25.3 ...
##  $ PasShoCmp%   : num  90 89.9 93.5 88.6 69.2 76.2 91.3 93.1 62.9 93.1 ...
##  $ PasMedCmp    : num  13.1 22.4 25.8 17.3 3.09 4.05 18.5 33.3 5.86 27.2 ...
##  $ PasMedAtt    : num  17 25 27.2 19.6 4.69 5.86 20 35.3 9.31 32 ...
##  $ PasMedCmp%   : num  77 89.4 94.9 87.9 65.8 69.2 92.7 94.4 63 84.9 ...
##  $ PasLonCmp    : num  3.06 5.65 6.72 6.25 0.62 1.28 5.81 9.93 0.34 11 ...
##  $ PasLonAtt    : num  6.78 9.15 7.81 9.39 1.23 1.6 7.74 12.9 1.03 18.9 ...
##  $ PasLonCmp%   : num  45.2 61.7 86 66.6 50 80 75 77.1 33.3 58 ...
##  $ Assists      : num  0.06 0 0 0.06 0.12 0.12 0 0 0.34 0.12 ...
##  $ PasAss       : num  0.59 0.24 0.55 0.91 0.99 1.02 0 0.21 0.69 1.63 ...
##  $ Pas3rd       : num  1.56 2.45 2.81 3.87 0.86 1.08 1.29 4.33 1.03 4.1 ...
##  $ PPA          : num  1.13 0.18 0.47 0.58 0.74 0.82 0 0.21 0.69 1.67 ...
##  $ CrsPA        : num  0.25 0 0.04 0.18 0 0.12 0 0 0.69 0.92 ...
##  $ PasProg      : num  2.94 2.72 2.96 4.18 1.6 1.78 1.45 3.12 1.03 4.9 ...
##  $ PasAtt       : num  45 47 61 49.8 17.2 20.2 35.5 71.1 24.5 78.5 ...
##  $ PasLive      : num  34.4 44 60.3 49 16 18.3 34 68.6 23.8 65.3 ...
##  $ PasDead      : num  10.6 3.02 0.73 0.85 1.11 1.98 1.45 2.52 0.69 13.2 ...
##  $ PasFK        : num  0.84 2.45 0.58 0.64 0.12 0.17 0.81 2.2 0 2.39 ...
##  $ TB           : num  0.06 0 0.04 0.18 0.12 0.15 0 0.07 0 0.12 ...
##  $ PasPress     : num  5.41 5.68 8.03 9.48 5.19 5.92 1.61 5.96 8.97 6.77 ...
##  $ Sw           : num  0.59 1.66 0.8 1.49 0.25 0.32 1.13 2.66 0.34 4.22 ...
##  $ PasCrs       : num  1.41 0.06 0.36 0.79 0.25 0.7 0 0.04 2.07 4.94 ...
##  $ CK           : num  0 0 0 0.03 0 0 0 0 0 1.59 ...
##  $ CkIn         : num  0 0 0 0 0 0 0 0 0 0.04 ...
##  $ CkOut        : num  0 0 0 0 0 0 0 0 0 1.16 ...
##  $ CkStr        : num  0 0 0 0 0 0 0 0 0 0.04 ...
##  $ PasGround    : num  26.5 35.3 52.6 37.6 10.7 12.8 28.2 52.8 12.1 46.1 ...
##  $ PasLow       : num  9.59 3.78 4.71 5.64 3.95 4.37 2.42 5.6 6.21 12.4 ...
##  $ PasHigh      : num  8.94 7.95 3.72 6.65 2.47 3.03 4.84 12.7 6.21 20.1 ...
##  $ PaswLeft     : num  4.91 31.7 4.82 4.48 3.33 1.84 29.2 54.1 4.14 58.5 ...
##  $ PaswRight    : num  29 12.1 53.1 43.9 9.38 14.1 3.55 12.2 12.1 7.69 ...
##  $ PaswHead     : num  0.91 1.48 1.9 0.73 2.35 1.92 1.77 2.77 1.72 1.87 ...
##  $ TI           : num  9.72 0.42 0.15 0.15 0 0.03 0.32 0.04 0.69 9.2 ...
##  $ PaswOther    : num  0.06 0.12 0.29 0.15 0.37 0.7 0.16 0.28 0.34 0.2 ...
##  $ PasCmp       : num  34 38.7 55.9 40.7 11.1 14.6 31.3 64.3 14.5 62.3 ...
##  $ PasOff       : num  0.22 0.15 0.07 0.21 0.25 0 0.16 0.04 0.34 0.24 ...
##  $ PasOut       : num  0.88 0.97 0.58 0.55 0.62 0.29 0.16 0.78 0.69 1.16 ...
##  $ PasInt       : num  1.63 1.24 1.24 1.83 0.62 1.02 0.32 0.85 1.38 2.47 ...
##  $ PasBlocks    : num  1.75 0.88 0.84 1.68 1.11 0.7 0.16 0.43 1.72 1.63 ...
##  $ SCA          : num  1.19 0.63 1.46 2.01 2.47 2.33 0.48 0.64 1.72 2.63 ...
##  $ ScaPassLive  : num  0.84 0.42 1.09 1.49 1.48 1.63 0.16 0.5 1.03 1.67 ...
##  $ ScaPassDead  : num  0.06 0 0 0.06 0 0.03 0 0.04 0 0.64 ...
##  $ ScaDrib      : num  0.09 0.09 0 0.03 0 0.03 0 0 0.34 0.04 ...
##  $ ScaSh        : num  0.13 0.03 0.15 0.03 0.12 0.38 0.32 0.04 0.34 0.12 ...
##  $ ScaFld       : num  0.06 0 0.15 0.21 0.25 0.17 0 0.07 0 0.16 ...
##  $ ScaDef       : num  0 0.09 0.07 0.18 0.62 0.09 0 0 0 0 ...
##  $ GCA          : num  0.16 0.03 0.04 0.15 0.25 0.35 0.16 0.07 0.34 0.12 ...
##  $ GcaPassLive  : num  0.16 0 0.04 0.12 0.12 0.15 0 0.04 0 0.04 ...
##  $ GcaPassDead  : num  0 0 0 0 0 0 0 0 0 0.08 ...
##  $ GcaDrib      : num  0 0.03 0 0 0 0.03 0 0 0 0 ...
##  $ GcaSh        : num  0 0 0 0 0.12 0.09 0.16 0 0.34 0 ...
##  $ GcaFld       : num  0 0 0 0.03 0 0.09 0 0.04 0 0 ...
##  $ GcaDef       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Tkl          : num  2.16 1.87 2.01 3.57 1.73 0.99 1.13 0.96 0.69 2.31 ...
##  $ TklWon       : num  1.16 1.39 1.24 2.23 0.86 0.64 0.97 0.43 0.69 1.55 ...
##  $ TklDef3rd    : num  1.56 1.24 0.91 1.49 0.37 0.29 0.81 0.64 0 1.31 ...
##  $ TklMid3rd    : num  0.59 0.6 0.91 1.71 0.86 0.44 0.16 0.32 0.34 0.6 ...
##  $ TklAtt3rd    : num  0 0.03 0.18 0.37 0.49 0.26 0.16 0 0.34 0.4 ...
##  $ TklDri       : num  1.16 0.39 0.69 1.8 0.25 0.29 0.48 0.32 0.69 0.84 ...
##  $ TklDriAtt    : num  1.81 0.82 2.15 4.97 1.36 0.67 0.65 0.71 1.03 1.39 ...
##  $ TklDri%      : num  63.8 48.1 32.2 36.2 18.2 43.5 75 45 66.7 60 ...
##  $ TklDriPast   : num  0.66 0.42 1.46 3.17 1.11 0.38 0.16 0.39 0.34 0.56 ...
##  $ Press        : num  13.6 13.6 23.4 28 23.5 13.6 5.65 5.85 23.8 11.7 ...
##  $ PresSucc     : num  3.53 4.89 6.53 7.9 4.81 4.02 2.1 1.7 6.9 3.98 ...
##  $ Press%       : num  26 35.9 27.9 28.2 20.5 29.6 37.1 29.1 29 34 ...
##  $ PresDef3rd   : num  7.97 7.61 7.19 9.27 2.59 1.34 2.26 3.05 1.72 5.3 ...
##  $ PresMid3rd   : num  4.38 5.14 12.3 15.3 10 5.71 3.23 2.41 10 4.74 ...
##  $ PresAtt3rd   : num  1.22 0.88 3.94 3.41 10.9 6.56 0.16 0.39 12.1 1.67 ...
##  $ Blocks       : num  2.69 1.87 0.99 1.68 1.23 0.96 2.1 1.7 1.03 1.43 ...
##  $ BlkSh        : num  0.69 0.79 0.04 0.09 0.37 0.32 0.81 1.06 0 0.16 ...
##  $ BlkShSv      : num  0.03 0.06 0 0 0 0 0.32 0 0 0 ...
##   [list output truncated]

As you can see, there is too much information. Analyzing dataset with such a number of variables is very impractical. But there are some overall insights are can obtain:

In conducting clustering important part is assuring that none NAs occur

colSums(is.na(stats_180))
##            Rk        Player        Nation           Pos         Squad 
##             0             0             0             0             0 
##          Comp           Age          Born            MP        Starts 
##             0             0             0             0             0 
##           Min           90s         Goals         Shots           SoT 
##             0             0             0             0             0 
##          SoT%          G/Sh         G/SoT       ShoDist         ShoFK 
##             0             0             0             0             0 
##         ShoPK         PKatt     PasTotCmp     PasTotAtt    PasTotCmp% 
##             0             0             0             0             0 
##    PasTotDist PasTotPrgDist     PasShoCmp     PasShoAtt    PasShoCmp% 
##             0             0             0             0             0 
##     PasMedCmp     PasMedAtt    PasMedCmp%     PasLonCmp     PasLonAtt 
##             0             0             0             0             0 
##    PasLonCmp%       Assists        PasAss        Pas3rd           PPA 
##             0             0             0             0             0 
##         CrsPA       PasProg        PasAtt       PasLive       PasDead 
##             0             0             0             0             0 
##         PasFK            TB      PasPress            Sw        PasCrs 
##             0             0             0             0             0 
##            CK          CkIn         CkOut         CkStr     PasGround 
##             0             0             0             0             0 
##        PasLow       PasHigh      PaswLeft     PaswRight      PaswHead 
##             0             0             0             0             0 
##            TI     PaswOther        PasCmp        PasOff        PasOut 
##             0             0             0             0             0 
##        PasInt     PasBlocks           SCA   ScaPassLive   ScaPassDead 
##             0             0             0             0             0 
##       ScaDrib         ScaSh        ScaFld        ScaDef           GCA 
##             0             0             0             0             0 
##   GcaPassLive   GcaPassDead       GcaDrib         GcaSh        GcaFld 
##             0             0             0             0             0 
##        GcaDef           Tkl        TklWon     TklDef3rd     TklMid3rd 
##             0             0             0             0             0 
##     TklAtt3rd        TklDri     TklDriAtt       TklDri%    TklDriPast 
##             0             0             0             0             0 
##         Press      PresSucc        Press%    PresDef3rd    PresMid3rd 
##             0             0             0             0             0 
##    PresAtt3rd        Blocks         BlkSh       BlkShSv       BlkPass 
##             0             0             0             0             0 
##           Int       Tkl+Int           Clr           Err       Touches 
##             0             0             0             0             0 
##     TouDefPen     TouDef3rd     TouMid3rd     TouAtt3rd     TouAttPen 
##             0             0             0             0             0 
##       TouLive       DriSucc        DriAtt      DriSucc%       DriPast 
##             0             0             0             0             0 
##       DriMegs       Carries    CarTotDist    CarPrgDist       CarProg 
##             0             0             0             0             0 
##        Car3rd           CPA        CarMis        CarDis       RecTarg 
##             0             0             0             0             0 
##           Rec          Rec%       RecProg          CrdY          CrdR 
##             0             0             0             0             0 
##         2CrdY           Fls           Fld           Off           Crs 
##             0             0             0             0             0 
##          TklW         PKwon         PKcon            OG         Recov 
##             0             0             0             0             0 
##        AerWon       AerLost       AerWon% 
##             0             0             0

As we can see - there are none NA values. Probably author from Kaggle already handled it.

Handling number of variables

In our dataset we have nearly 150 columns. That is pretty impressive amount but plenty of them are too detailed/provide no valuable information. In order to choose the best I studied every columns. What’s more, I’m a football fan for more than 12 years (majority of my life!). Thanks to that experience I was able to choose the most valuable variables.

After plenty of testing (including calculating Hopkins statistics for each set of columns) I decided to limit number of variables from 143 to 8. Part of Data Scientist’s job is using previous experience and knowledge to better understand and solve problems. So did I here!

less_vars <- c('Shots', 'PasTotAtt', 'Assists', 'Tkl', 'Blocks',  'Fls', 'Off', 'AerWon%')

I’ve tried to choose as diverse and valuable variables as I could. Let’s decode them:

less_stats <- stats_180[,less_vars]

By creating new data frame with less variables, I could easily continue my analysis.

Hopkins statistics

In order to check how clusterable is my dataset, I’ll compute Hopkins test. It compares the distances between randomly sampled points in the dataset and synthetic points generated uniformly in the feature space. A value close to 1 indicates strong clustering tendency, while a value near 0.5 suggests the data is uniformly distributed and unlikely to form meaningful clusters.

hopkins(less_stats, n=nrow(less_stats)-1)
## Warning in hopkins(less_stats, n = nrow(less_stats) - 1): Package `clustertend`
## is deprecated.  Use package `hopkins` instead.

## $H
## [1] 0.2225858

The value of Hopkins statistic is pretty low. If we would interpret it in proper way, we could conclude that our data is uniformly distributed or random. But we won’t be so serious… That low value comes from high number of observations and variables. Obtaining high Hopkins statistics from such a complex dataset is not common.

Data Visualisation

In order present some of the statistics of charts, I’ll create another variable for those purposes.

foot_vars <- c("Min", "90s", "Goals", "PasTotAtt", "AerWon%")
player_stats <- stats_180[,foot_vars]

Minutes played

First thing we want to visualize is played time. It’s important to see if your results won’t be biased by footballers who played very limited time.

We could observe that played time is more or less equally distributed. Biggest differences are visible in the beginning and end of the plot - that’s pretty intuitive. Those differences aren’t big enough to influence our analysis strongly.

Goals

Next statistics we will deep into are scored goals. In order to better visualize it, I will multiply the “goals/90min” times “played 90s”. That way I am able to present the most goalscoring players.

From this chart we see that majority of players didn’t score any goal (nearly 1000 footballers). This could be somehow intuitive because positions like goalkeepers or defenders rarely celebrate goals.

The plot has extremely long right tail. Those single values aren’t errors. They are just super-star forwards like Lewandowski or Haaland.

Passes

If we talked about scoring goals, now let’s check how many passes players make per 90 minutes. This plot should be more normally distributed because passing aren’t only trait of midfielders.

As predicted, distribution of passes attempts is more normal distribution. It also has longer right tail but it isn’t that extreme as in goals scored case. Once again, “the outliers” are usually the world class players. What’s interesting, we could find some defenders with one of the highest values (like Sergio Ramos or Jordi Alba). In the end, the most frequent passing players are midfielders, for example Toni Kroos and Marco Veratti.

Aerial duels

We analyzed domain of forwards and midfielders. Now it’s time for defenders! Last statistic we will analyze is percentage of won aerial duels. By stereotype the tallest players in each team are defenders. Their role is to protect the goal, often by winning aerial duels in penalty area.

With every plot we are step closer to normal distribution shape! This time we have “problem” from the other side - nearly 200 players didn’t win any aerial ball! This may sound ridiculous but if you discover that some players are only 150-160 cm height, it appears more real. Such a footballers have little chance of winning aerial ball with 190 cm enemy defender. When it comes to players with 100%, they are almost all defenders.

Correlation Analysis

Crucial part of conducting analysis such a clustering is checking correlations between variables. If they are be too high, it will negatively impact quality of the output. In our case situation is a little bit easier - in the beginning I reduced number of variables from 143 to only 8. They are all describing overall different characteristics of players.

In order to prepare visually appealing heatmap, I have to calculate the correlations between my variables. Having this done I’ll convert the data into a molten data frame using melt function from reshape2 package.

stats_cor <- cor(less_stats)
cor_long <- melt(stats_cor)

With ready data, I am able to create a correlation heatmap. It’ll visualize connections between variables in very intuitive and clean way.

As we could see on the heatmap, no extreme correlations occur. Two of the highest correlation values are connected with offsides (Off). First one is positive correlation with shots (Shots) - this could be explain by the fact that most shots are hit by forwards, playing in the front. They are most exposed to be caught on the offside.

Negatively correlated variable with offside is pass attempts (PasTotAtt). This could also be explained by player positions. Most passes, as described in Data Visualization section, are done by midfielders and defenders. Those players stay more in the back and are not exposed to be on offside position.

We could conclude that no strong correlations occur - they won’t disturb the results of our analysis.

CLUSTERING ANALYSIS

Pre-clustering details

Scaling the data

Scaling data is necessary - variable such as “AerWon%” would have stronger impact on results of analysis due to it’s format (range from 0 to 100). Rest variables are calculated per 90 minutes.

stats_scaled <- scale(less_stats)
head(stats_scaled, 5)
##        Shots  PasTotAtt    Assists        Tkl     Blocks         Fls        Off
## 1 -0.7884370  0.1163052 -0.2260469 0.51349324  1.7739078 -0.48631207 -0.5174874
## 2 -0.6551024  0.2465772 -0.7275041 0.20733175  0.6409632 -0.05024222 -0.6202879
## 3 -0.5320244  1.1584811 -0.7275041 0.35513385 -0.5748797  0.39904187 -0.6202879
## 4 -0.2756119  0.4289579 -0.2260469 2.00207151  0.3784517  0.08190016 -0.5174874
## 6  1.0679899 -1.6944754  0.2754102 0.05952965 -0.2432862  1.16546767  5.7190751
##      AerWon%
## 1 -1.0492802
## 2  0.8244920
## 3  0.3295333
## 4  0.3295333
## 6 -0.4028035
summary(stats_scaled)
##      Shots           PasTotAtt           Assists             Tkl          
##  Min.   :-1.2090   Min.   :-2.07878   Min.   :-0.7275   Min.   :-1.76688  
##  1st Qu.:-0.7782   1st Qu.:-0.72395   1st Qu.:-0.7275   1st Qu.:-0.67948  
##  Median :-0.2961   Median :-0.08236   Median :-0.3096   Median :-0.02493  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.6680   3rd Qu.: 0.62437   3rd Qu.: 0.4426   3rd Qu.: 0.67185  
##  Max.   : 4.0219   Max.   : 3.73787   Max.   :10.1374   Max.   : 4.44081  
##      Blocks              Fls                Off             AerWon%       
##  Min.   :-1.94270   Min.   :-1.76809   Min.   :-0.6203   Min.   :-2.1958  
##  1st Qu.:-0.65432   1st Qu.:-0.65810   1st Qu.:-0.6203   1st Qu.:-0.5139  
##  Median : 0.03304   Median :-0.05685   Median :-0.4490   Median : 0.1048  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.66860   3rd Qu.: 0.58404   3rd Qu.: 0.2021   3rd Qu.: 0.7285  
##  Max.   : 5.26946   Max.   : 4.54831   Max.   : 6.4729   Max.   : 2.8548

After scaling we could see that our data is more reliable showing variance in variables. It’ll be perfect for clustering.

Optimal number of clusters

In order to get optimal number of clusters, we will use NbClust and Optimal_Clusters_KMeans functions.

First is NbClust:

opt_clusters <- NbClust(stats_scaled, distance = 'euclidean', min.nc = 2, max.nc = 10, method = 'complete',
                        index = 'silhouette')
opt_clusters$All.index
##      2      3      4      5      6      7      8      9     10 
## 0.6339 0.0603 0.0639 0.0672 0.0370 0.0374 0.0321 0.0705 0.0872
opt_clusters$Best.nc
## Number_clusters     Value_Index 
##          2.0000          0.6339

Accroding to NbClust, the best number of clusters is 2. There is significant difference (10x) between it and other options.

This time we will use Optimal_Clusters_KMeans function. It will be conducted 3 times - once with default criterion (variance_explained), then with silhouette and last one with AIC. Each function has set max number or clusters to 10.

Decision about number of clusters is not easy. NbClust extremely highly suggested only 2 - in my opinion such a small number won’t be especially informative and insightful. That’s why I’ve chose conducting my analysis on 3 clusters. All three Optimal_Clusters_KMeans plots suggest that number. Silhouette shows the highest value for 3 clusters. On variance_explained plot I’m looking for well-known “elbow point” ;) Drops in values are significant till 3, then they’re visibly smaller. When it comes to AIC - the lower value, the better fit. Regarding data from chart and other information number of 3 is here also optimal choice.

To sum up this section - final number of clusters is 3.

Clustering (finally)

As a method of our clustering analysis I’m choosing k-means. Why? It is:

As being said, let’s cluster some data!

cluster_km <- kmeans(stats_scaled, 3)
fviz_cluster(cluster_km, geom = "point", stats_scaled) + ggtitle("Number of cluster - 3")

On the graph we see 3 groups. Cluster 1 stands out from the rest - almost all assigned observations are highly concentrated far from others. There is no clear space between cluster 2 and 3. The blue one (3) is seem to have lower variance (high density and no strong “outliers”). The green cluster (2) looks the most differentiated - that’s mainly by some extreme observations which extend its area on the plot. My guess about this players - that could be some superstars with unreal statistics which look like outliers.

To check how well observation fits into assigned cluster, we will calculate and visualize Silhouette width. It has range from -1 to 1, and we could interpret certain values in that way:

Knowing the theory behind Silhouette width, we could calculate and visualize it.

sil<-silhouette(cluster_km$cluster, dist(stats_scaled))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1  173          0.87
## 2       2  723          0.18
## 3       3 1454          0.36

First thing we could comment is clusters size. Without any doubt the biggest one in Cluster 3. It includes nearly 62% of all observations. Its average Silhouette width is 0.36, which is moderate value. On the other side is Cluster 1 - it covers over 7% of players but have incredibly high Silhouette width - 0.87. Such a value indicates that this is our best cluster in terms of fitting observations. Last cluster, number 2, is not that perfect. It groups nearly 31% of players while having avg. width equal to 0.18. What’s more, that cluster has some observations below 0 - that means that they’re on decision boundary and are not easy to assign.

Additional quality measure

Calinski-Harabasz

In short words - it evaluates the compactness and separation of clusters. A higher CH Index value indicates better-defined clusters, as it suggests high between-cluster variance and low within-cluster variance. We will calculate CH index for our choice (3 clusters) and additionaly two more options - 2 and 4 clusters.

## [1] "Calinski-Harabasz index for 3 clusters"

## [1] 823.42

## [1] "Calinski-Harabasz index for 2 clusters"

## [1] 688.7

## [1] "Calinski-Harabasz index for 4 clusters"

## [1] 702.73

Highest value of CH index was obtained for 3 clusters. This means that our original choice provides a more compact and well-separated clustering structure. It is the most optimal cluster.

Shadow statistics

Shadow statistics is used to evaluate the quality of clustering. It is similar in concept to the Silhouette Score. The shadow statistic provides insights into how well an individual data point fits into its assigned cluster compared to other clusters.

It’s interpretation is very similar to Silhouette:

for_shadow <- cclust(stats_scaled, 3, dist="euclidean")
## Found more than one class "kcca" in cache; using the first, from namespace 'kernlab'

## Also defined by 'flexclust'

## Found more than one class "kcca" in cache; using the first, from namespace 'kernlab'

## Also defined by 'flexclust'
shadow(for_shadow)
##         1         2         3 
## 0.1568998 0.6779121 0.7601752
plot(shadow(for_shadow))

So there are the results. We could observe that Cluster 1 have very low shadow values. On the other hand, Cluster 3 most points have high shadow values (near 0.7 – 0.8), suggesting it’s strong cohesion. Cluster 2 presents wide range of values but moreover it’s pretty good one.

Why this graph differs so much from Silhouette one?
Main reason is differences in methodologies. Silhouette score calculates distance of a point to its cluster versus the closest other cluster. On the other hand, shadow statistic compares a point’s distance to its cluster’s centroid versus the nearest cluster’s centroid. That’s why.

Clusters summary and interpretation

Seeing distribution of observation into clusters without analyzing their centers is worthless. Let’s see features of our clusters.

## [1] "Cluster Centers:"

##        Shots  PasTotAtt    Assists        Tkl     Blocks        Fls        Off
## 1 -1.1811484 -0.5707101 -0.7033491 -1.7381393 -1.9113965 -1.7163056 -0.6183072
## 2  1.1615894 -0.7829139  0.6185003 -0.5077606 -0.5071927  0.1994758  0.9994477
## 3 -0.4370636  0.4572074 -0.2238627  0.4592909  0.4796231  0.1050205 -0.4234068
##      AerWon%
## 1 -2.1771111
## 2 -0.4027337
## 3  0.4592962

Already knowing centers for every statistic allows us to create insightful clusters which could describe whole dataset well. I’ll also make a use of my expertise knowledge in naming them.

Cluster 1 - Goalkeepers
(n = 173, avg. sil. = 0.87)

Very homogeneous group. They have almost all statistics centers on the lowest level. This could be explained by two factors. First - their real role during the match. Goalkeepers’ mission is to protest their teams from losing goals. They are not shooting, assisting, blocking or being on offside. They just stay mainly in penalty area and secure. Important statistic are won aerial duels. It may be obvious that 2m players have nearly 100% won balls. The hook is in the methodology - mentioned statistic describe won aerial HEAD duels. Goalkeepers mostly catch the ball into hands, so it doesn’t count.

Cluster 2 - Offensive Players
(n = 723, avg. sil. = 0.18)

This cluster is not as good fitted as previous one. It’s characterized by highest number of shots and assists - two metrics directly connected with scoring goals. Players from this cluster are also often caught on offside position, which is due to their front positioning on the pitch. What’s little surprising - they tend to make the least passes, even less than goalkeepers. That may show that this group of players don’t need that many passes to do their job.

Cluster 3 - Defensive Players
(n = 1454, avg. sil. = 0.36)

The largest cluster with moderate fit. I’ve named it defensive because most statistics have reversed values compared to Cluster 2. It contains large group of players - not only defenders, but also plenty of midfielders. They tend to shoot less and pass more (while having less assists). Clear defensive statistics like blocks and tackles (Tkl) are also highest in this cluster. Those players are also winning most aerial (head) duels; that could be influenced by height of defenders.

Conclusions

In this report we have gone through every process of clustering analysis. Started with loading data and choosing most appropriate variables. Then some visualizations to better understand the whole dataset. In the end, crème de la crème, proper clustering.

Results could be for someone interested in football pretty intuitive. This analysis also confirms that expertise knowledge agrees with independent data. Using such interesting tools to describe my hobby was an awesome experience. I’ve enjoyed it a lot :)