Cezary Kuźmowicz
Football is the most popular sport on Earth. Millions of people around the globe play it on daily basis. Most of countries have their own national leagues. But the best of the best take place in Europe. In football’s nomenclature, while sharing some graphs and statistics, often used concept is “TOP 5 European Leagues”. That means clearly five best national competitions in Europe.
This group consist of English Premier League, Spanish LaLiga, Italian Serie A, German Bundesliga and French Ligue 1. In this report I will conduct clustering analysis using players statistics from season 2021/22.
In order to achieve satisfactory results many tests and visualizations will be presented. As the clustering method I’ve chose k-means, mainly due to medium size of dataset and simply interpretation.
In the beginning we have to install and access needed packages for whole analysis:
## Package `clustertend` is deprecated. Use package `hopkins` instead.
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
My dataset comes from Kaggle. It was prepared based on data from fbref - online website gathering huge amount of informations from plenty of sports.
raw_stats <- read.csv("/Users/czarek/Downloads/2021-2022 Football Player Stats.csv", sep = ";", dec = ".",
check.names = FALSE)Raw dataset includes nearly 3000 observations (individual players), each described by 143 variables! It’s important to add that every statistic is calculated “per 90 minutes”. This means that author already unified data, so one step less for us!
First step during our data preparation will be excluding footballers who played less than 180 minutes through whole season. That operation will assure us that we don’t analyze player who for example scored 4 goals in his only played match.
By conducting such an operation we got rid of nearly 600 observations. That will definitely improve overview of our dataset and quality of analysis.
Before removing some of them, let’s dive into overview of the data
## Rk Player Nation Pos Squad Comp Age Born MP
## 1 1 Max Aarons ENG DF Norwich City Premier League 22 2000 34
## 2 2 Yunis Abdelhamid MAR DF Reims Ligue 1 34 1987 34
## 3 3 Salis Abdul Samed GHA MF Clermont Foot Ligue 1 22 2000 31
## 4 4 Laurent Abergel FRA MF Lorient Ligue 1 29 1993 34
## 6 6 Dickson Abiama NGA FW Greuther F\xfcrth Bundesliga 23 1998 24
## 8 8 Tammy Abraham ENG FW Roma Serie A 24 1997 37
## Starts Min 90s Goals Shots SoT SoT% G/Sh G/SoT ShoDist ShoFK ShoPK PKatt
## 1 32 2881 32.0 0.00 0.41 0.06 15.4 0.00 0.00 20.5 0.00 0.00 0.00
## 2 34 2983 33.1 0.06 0.54 0.18 33.3 0.11 0.33 18.7 0.00 0.00 0.00
## 3 29 2462 27.4 0.04 0.66 0.18 27.8 0.06 0.20 20.3 0.00 0.00 0.00
## 4 34 2956 32.8 0.00 0.91 0.21 23.3 0.00 0.00 22.6 0.00 0.00 0.00
## 6 5 726 8.1 0.00 2.22 0.49 22.2 0.00 0.00 15.6 0.00 0.00 0.00
## 8 36 3084 34.3 0.50 2.71 0.93 34.4 0.15 0.44 12.2 0.06 0.09 0.09
## PasTotCmp PasTotAtt PasTotCmp% PasTotDist PasTotPrgDist PasShoCmp PasShoAtt
## 1 34.0 45.0 75.5 574.1 214.8 17.50 19.40
## 2 38.7 47.0 82.4 835.8 287.9 10.20 11.40
## 3 55.9 61.0 91.7 1033.3 184.4 22.50 24.10
## 4 40.7 49.8 81.6 780.8 206.0 16.30 18.40
## 6 11.1 17.2 64.7 160.0 40.7 6.67 9.63
## 8 14.6 20.2 72.0 224.9 51.2 8.69 11.40
## PasShoCmp% PasMedCmp PasMedAtt PasMedCmp% PasLonCmp PasLonAtt PasLonCmp%
## 1 90.0 13.10 17.00 77.0 3.06 6.78 45.2
## 2 89.9 22.40 25.00 89.4 5.65 9.15 61.7
## 3 93.5 25.80 27.20 94.9 6.72 7.81 86.0
## 4 88.6 17.30 19.60 87.9 6.25 9.39 66.6
## 6 69.2 3.09 4.69 65.8 0.62 1.23 50.0
## 8 76.2 4.05 5.86 69.2 1.28 1.60 80.0
## Assists PasAss Pas3rd PPA CrsPA PasProg PasAtt PasLive PasDead PasFK TB
## 1 0.06 0.59 1.56 1.13 0.25 2.94 45.0 34.4 10.60 0.84 0.06
## 2 0.00 0.24 2.45 0.18 0.00 2.72 47.0 44.0 3.02 2.45 0.00
## 3 0.00 0.55 2.81 0.47 0.04 2.96 61.0 60.3 0.73 0.58 0.04
## 4 0.06 0.91 3.87 0.58 0.18 4.18 49.8 49.0 0.85 0.64 0.18
## 6 0.12 0.99 0.86 0.74 0.00 1.60 17.2 16.0 1.11 0.12 0.12
## 8 0.12 1.02 1.08 0.82 0.12 1.78 20.2 18.3 1.98 0.17 0.15
## PasPress Sw PasCrs CK CkIn CkOut CkStr PasGround PasLow PasHigh PaswLeft
## 1 5.41 0.59 1.41 0.00 0 0 0 26.5 9.59 8.94 4.91
## 2 5.68 1.66 0.06 0.00 0 0 0 35.3 3.78 7.95 31.70
## 3 8.03 0.80 0.36 0.00 0 0 0 52.6 4.71 3.72 4.82
## 4 9.48 1.49 0.79 0.03 0 0 0 37.6 5.64 6.65 4.48
## 6 5.19 0.25 0.25 0.00 0 0 0 10.7 3.95 2.47 3.33
## 8 5.92 0.32 0.70 0.00 0 0 0 12.8 4.37 3.03 1.84
## PaswRight PaswHead TI PaswOther PasCmp PasOff PasOut PasInt PasBlocks SCA
## 1 29.00 0.91 9.72 0.06 34.0 0.22 0.88 1.63 1.75 1.19
## 2 12.10 1.48 0.42 0.12 38.7 0.15 0.97 1.24 0.88 0.63
## 3 53.10 1.90 0.15 0.29 55.9 0.07 0.58 1.24 0.84 1.46
## 4 43.90 0.73 0.15 0.15 40.7 0.21 0.55 1.83 1.68 2.01
## 6 9.38 2.35 0.00 0.37 11.1 0.25 0.62 0.62 1.11 2.47
## 8 14.10 1.92 0.03 0.70 14.6 0.00 0.29 1.02 0.70 2.33
## ScaPassLive ScaPassDead ScaDrib ScaSh ScaFld ScaDef GCA GcaPassLive
## 1 0.84 0.06 0.09 0.13 0.06 0.00 0.16 0.16
## 2 0.42 0.00 0.09 0.03 0.00 0.09 0.03 0.00
## 3 1.09 0.00 0.00 0.15 0.15 0.07 0.04 0.04
## 4 1.49 0.06 0.03 0.03 0.21 0.18 0.15 0.12
## 6 1.48 0.00 0.00 0.12 0.25 0.62 0.25 0.12
## 8 1.63 0.03 0.03 0.38 0.17 0.09 0.35 0.15
## GcaPassDead GcaDrib GcaSh GcaFld GcaDef Tkl TklWon TklDef3rd TklMid3rd
## 1 0 0.00 0.00 0.00 0 2.16 1.16 1.56 0.59
## 2 0 0.03 0.00 0.00 0 1.87 1.39 1.24 0.60
## 3 0 0.00 0.00 0.00 0 2.01 1.24 0.91 0.91
## 4 0 0.00 0.00 0.03 0 3.57 2.23 1.49 1.71
## 6 0 0.00 0.12 0.00 0 1.73 0.86 0.37 0.86
## 8 0 0.03 0.09 0.09 0 0.99 0.64 0.29 0.44
## TklAtt3rd TklDri TklDriAtt TklDri% TklDriPast Press PresSucc Press%
## 1 0.00 1.16 1.81 63.8 0.66 13.6 3.53 26.0
## 2 0.03 0.39 0.82 48.1 0.42 13.6 4.89 35.9
## 3 0.18 0.69 2.15 32.2 1.46 23.4 6.53 27.9
## 4 0.37 1.80 4.97 36.2 3.17 28.0 7.90 28.2
## 6 0.49 0.25 1.36 18.2 1.11 23.5 4.81 20.5
## 8 0.26 0.29 0.67 43.5 0.38 13.6 4.02 29.6
## PresDef3rd PresMid3rd PresAtt3rd Blocks BlkSh BlkShSv BlkPass Int Tkl+Int
## 1 7.97 4.38 1.22 2.69 0.69 0.03 2.00 1.75 3.91
## 2 7.61 5.14 0.88 1.87 0.79 0.06 1.09 3.11 4.98
## 3 7.19 12.30 3.94 0.99 0.04 0.00 0.95 1.86 3.87
## 4 9.27 15.30 3.41 1.68 0.09 0.00 1.59 2.56 6.13
## 6 2.59 10.00 10.90 1.23 0.37 0.00 0.86 0.99 2.72
## 8 1.34 5.71 6.56 0.96 0.32 0.00 0.64 0.44 1.43
## Clr Err Touches TouDefPen TouDef3rd TouMid3rd TouAtt3rd TouAttPen TouLive
## 1 2.19 0 58.0 5.06 23.30 23.8 15.0 0.91 47.8
## 2 3.20 0 57.3 8.28 32.80 25.7 2.9 0.85 54.5
## 3 0.55 0 70.4 2.01 22.70 41.8 10.9 0.62 69.9
## 4 0.34 0 61.6 0.67 13.70 40.3 11.6 0.46 60.9
## 6 0.86 0 33.0 1.11 3.46 15.6 15.6 3.83 31.9
## 8 0.90 0 32.4 0.96 2.54 16.3 15.2 5.86 30.4
## DriSucc DriAtt DriSucc% DriPast DriMegs Carries CarTotDist CarPrgDist CarProg
## 1 1.03 2.44 42.3 1.09 0.19 33.9 199.4 121.7 5.44
## 2 0.48 0.66 72.7 0.48 0.03 35.7 204.7 115.5 2.75
## 3 0.99 1.53 64.3 1.09 0.07 53.5 246.5 106.3 2.85
## 4 1.28 1.98 64.6 1.34 0.09 45.7 171.9 86.4 2.87
## 6 0.74 2.22 33.3 0.86 0.12 19.0 74.7 40.2 2.59
## 8 1.08 2.22 48.7 1.14 0.09 18.0 76.8 39.4 2.45
## Car3rd CPA CarMis CarDis RecTarg Rec Rec% RecProg CrdY CrdR 2CrdY Fls Fld
## 1 1.66 0.41 0.84 0.94 36.0 32.4 89.9 1.28 0.25 0.00 0.00 0.97 1.84
## 2 0.73 0.00 0.45 0.39 37.5 36.3 96.9 0.36 0.15 0.03 0.00 1.30 0.73
## 3 0.73 0.15 0.84 1.46 58.6 54.2 92.5 1.72 0.44 0.11 0.07 1.64 1.28
## 4 1.13 0.09 0.85 1.46 46.3 43.0 93.0 1.86 0.27 0.00 0.00 1.40 2.07
## 6 0.99 0.86 5.06 1.36 41.4 21.1 51.0 5.93 0.37 0.00 0.00 2.22 1.48
## 8 0.82 0.67 2.39 1.28 41.2 22.4 54.3 6.71 0.26 0.00 0.00 1.25 1.46
## Off Crs TklW PKwon PKcon OG Recov AerWon AerLost AerWon%
## 1 0.03 1.41 1.16 0.00 0.06 0.03 5.53 0.47 1.59 22.7
## 2 0.00 0.06 1.39 0.00 0.03 0.00 6.77 2.02 1.36 59.8
## 3 0.00 0.36 1.24 0.00 0.00 0.00 8.76 0.88 0.88 50.0
## 4 0.03 0.79 2.23 0.00 0.00 0.00 8.87 0.43 0.43 50.0
## 6 1.85 0.25 0.86 0.00 0.00 0.00 4.81 2.72 4.94 35.5
## 8 0.50 0.70 0.64 0.03 0.03 0.00 3.67 2.39 2.89 45.3
## Rk Player Nation Pos
## Min. : 1.0 Length:2350 Length:2350 Length:2350
## 1st Qu.: 739.2 Class :character Class :character Class :character
## Median :1458.5 Mode :character Mode :character Mode :character
## Mean :1459.1
## 3rd Qu.:2170.8
## Max. :2921.0
## Squad Comp Age Born
## Length:2350 Length:2350 Min. :17.00 Min. :1981
## Class :character Class :character 1st Qu.:24.00 1st Qu.:1992
## Mode :character Mode :character Median :26.00 Median :1995
## Mean :26.76 Mean :1995
## 3rd Qu.:30.00 3rd Qu.:1998
## Max. :40.00 Max. :2004
## MP Starts Min 90s Goals
## Min. : 2.00 Min. : 0 Min. : 180.0 Min. : 2.00 Min. :0.0000
## 1st Qu.:15.00 1st Qu.: 8 1st Qu.: 764.2 1st Qu.: 8.50 1st Qu.:0.0000
## Median :24.00 Median :16 Median :1447.5 Median :16.10 Median :0.0500
## Mean :22.68 Mean :17 Mean :1520.6 Mean :16.89 Mean :0.1214
## 3rd Qu.:31.00 3rd Qu.:26 3rd Qu.:2221.2 3rd Qu.:24.70 3rd Qu.:0.1800
## Max. :38.00 Max. :38 Max. :3420.0 Max. :38.00 Max. :1.4300
## Shots SoT SoT% G/Sh
## Min. :0.000 Min. :0.0000 Min. : 0.00 Min. :0.00000
## 1st Qu.:0.420 1st Qu.:0.0700 1st Qu.: 14.30 1st Qu.:0.00000
## Median :0.890 Median :0.2400 Median : 28.00 Median :0.05000
## Mean :1.179 Mean :0.3820 Mean : 26.69 Mean :0.07494
## 3rd Qu.:1.830 3rd Qu.:0.5975 3rd Qu.: 37.90 3rd Qu.:0.13000
## Max. :5.100 Max. :2.3500 Max. :100.00 Max. :1.00000
## G/SoT ShoDist ShoFK ShoPK
## Min. :0.0000 Min. : 0.0 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:12.0 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.1700 Median :16.1 Median :0.00000 Median :0.000000
## Mean :0.2257 Mean :15.4 Mean :0.03797 Mean :0.009553
## 3rd Qu.:0.3800 3rd Qu.:20.1 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :2.0000 Max. :71.2 Max. :1.04000 Max. :0.500000
## PKatt PasTotCmp PasTotAtt PasTotCmp%
## Min. :0.00000 Min. : 7.11 Min. : 11.30 Min. :40.50
## 1st Qu.:0.00000 1st Qu.:23.30 1st Qu.: 32.10 1st Qu.:72.00
## Median :0.00000 Median :32.75 Median : 41.95 Median :78.10
## Mean :0.01221 Mean :34.26 Mean : 43.21 Mean :77.45
## 3rd Qu.:0.00000 3rd Qu.:43.08 3rd Qu.: 52.80 3rd Qu.:83.60
## Max. :0.80000 Max. :96.10 Max. :100.60 Max. :96.80
## PasTotDist PasTotPrgDist PasShoCmp PasShoAtt
## Min. : 103.7 Min. : 11.4 Min. : 0.500 Min. : 0.50
## 1st Qu.: 431.9 1st Qu.: 116.6 1st Qu.: 9.453 1st Qu.:11.60
## Median : 641.1 Median : 206.7 Median :13.500 Median :15.70
## Mean : 661.9 Mean : 217.9 Mean :14.136 Mean :16.22
## 3rd Qu.: 835.8 3rd Qu.: 299.6 3rd Qu.:17.900 3rd Qu.:20.10
## Max. :2062.9 Max. :1019.6 Max. :46.400 Max. :49.10
## PasShoCmp% PasMedCmp PasMedAtt PasMedCmp%
## Min. : 57.90 Min. : 1.300 Min. : 3.00 Min. : 27.30
## 1st Qu.: 82.72 1st Qu.: 8.402 1st Qu.:10.50 1st Qu.: 76.70
## Median : 87.50 Median :13.600 Median :16.30 Median : 84.10
## Mean : 86.73 Mean :14.470 Mean :16.92 Mean : 83.03
## 3rd Qu.: 91.30 3rd Qu.:18.800 3rd Qu.:21.88 3rd Qu.: 90.40
## Max. :100.00 Max. :46.200 Max. :48.90 Max. :100.00
## PasLonCmp PasLonAtt PasLonCmp% Assists
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. :0.00000
## 1st Qu.: 2.410 1st Qu.: 4.420 1st Qu.: 50.00 1st Qu.:0.00000
## Median : 4.360 Median : 7.515 Median : 59.15 Median :0.05000
## Mean : 4.957 Mean : 8.288 Mean : 59.53 Mean :0.08705
## 3rd Qu.: 6.990 3rd Qu.:11.100 3rd Qu.: 70.00 3rd Qu.:0.14000
## Max. :22.800 Max. :42.600 Max. :100.00 Max. :1.30000
## PasAss Pas3rd PPA CrsPA
## Min. :0.0000 Min. : 0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.2725 1st Qu.: 1.180 1st Qu.:0.2225 1st Qu.:0.000
## Median :0.7300 Median : 2.155 Median :0.6100 Median :0.100
## Mean :0.8339 Mean : 2.402 Mean :0.7257 Mean :0.198
## 3rd Qu.:1.2000 3rd Qu.: 3.237 3rd Qu.:1.0900 3rd Qu.:0.310
## Max. :5.0000 Max. :11.700 Max. :3.7700 Max. :2.000
## PasProg PasAtt PasLive PasDead
## Min. : 0.000 Min. : 11.30 Min. :10.10 Min. : 0.000
## 1st Qu.: 1.680 1st Qu.: 32.10 1st Qu.:28.10 1st Qu.: 1.312
## Median : 2.670 Median : 41.95 Median :37.50 Median : 2.520
## Mean : 2.794 Mean : 43.21 Mean :38.98 Mean : 4.239
## 3rd Qu.: 3.800 3rd Qu.: 52.80 3rd Qu.:47.80 3rd Qu.: 7.140
## Max. :10.300 Max. :100.60 Max. :98.10 Max. :18.500
## PasFK TB PasPress Sw
## Min. :0.000 Min. :0.00000 Min. : 0.000 Min. :0.000
## 1st Qu.:0.290 1st Qu.:0.00000 1st Qu.: 5.030 1st Qu.:0.570
## Median :0.890 Median :0.00000 Median : 6.355 Median :1.000
## Mean :1.099 Mean :0.06671 Mean : 6.444 Mean :1.225
## 3rd Qu.:1.640 3rd Qu.:0.10000 3rd Qu.: 7.810 3rd Qu.:1.640
## Max. :8.000 Max. :1.00000 Max. :15.300 Max. :6.880
## PasCrs CK CkIn CkOut
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.190 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.760 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :1.104 Mean :0.4404 Mean :0.1806 Mean :0.1645
## 3rd Qu.:1.790 3rd Qu.:0.2600 3rd Qu.:0.0600 3rd Qu.:0.0600
## Max. :9.130 Max. :5.9400 Max. :2.9100 Max. :3.3000
## CkStr PasGround PasLow PasHigh
## Min. :0.00000 Min. : 3.67 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.00000 1st Qu.:19.43 1st Qu.: 3.712 1st Qu.: 5.210
## Median :0.00000 Median :26.90 Median : 4.960 Median : 7.870
## Mean :0.01907 Mean :29.13 Mean : 5.644 Mean : 8.446
## 3rd Qu.:0.00000 3rd Qu.:36.40 3rd Qu.: 6.880 3rd Qu.:10.700
## Max. :1.04000 Max. :91.30 Max. :18.300 Max. :37.900
## PaswLeft PaswRight PaswHead TI
## Min. : 0.000 Min. : 0.85 Min. :0.000 Min. : 0.000
## 1st Qu.: 3.473 1st Qu.: 9.97 1st Qu.:0.980 1st Qu.: 0.080
## Median : 5.900 Median :24.90 Median :1.640 Median : 0.270
## Mean :12.686 Mean :25.17 Mean :1.713 Mean : 1.894
## 3rd Qu.:16.875 3rd Qu.:35.60 3rd Qu.:2.370 3rd Qu.: 1.300
## Max. :75.900 Max. :91.60 Max. :6.150 Max. :15.700
## PaswOther PasCmp PasOff PasOut
## Min. :0.0000 Min. : 7.11 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1100 1st Qu.:23.30 1st Qu.:0.0400 1st Qu.:0.4700
## Median :0.2200 Median :32.75 Median :0.1200 Median :0.7000
## Mean :0.5636 Mean :34.26 Mean :0.1437 Mean :0.7497
## 3rd Qu.:0.4200 3rd Qu.:43.08 3rd Qu.:0.2100 3rd Qu.:0.9800
## Max. :7.0900 Max. :96.10 Max. :1.0000 Max. :2.6500
## PasInt PasBlocks SCA ScaPassLive
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.870 1st Qu.:0.600 1st Qu.:0.770 1st Qu.:0.580
## Median :1.300 Median :1.050 Median :1.680 Median :1.240
## Mean :1.341 Mean :1.096 Mean :1.788 Mean :1.284
## 3rd Qu.:1.760 3rd Qu.:1.540 3rd Qu.:2.538 3rd Qu.:1.810
## Max. :5.000 Max. :4.500 Max. :7.310 Max. :5.600
## ScaPassDead ScaDrib ScaSh ScaFld
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.06000 Median :0.04000
## Mean :0.1616 Mean :0.09674 Mean :0.09749 Mean :0.09969
## 3rd Qu.:0.1500 3rd Qu.:0.14000 3rd Qu.:0.15000 3rd Qu.:0.15000
## Max. :3.0000 Max. :1.39000 Max. :0.91000 Max. :1.00000
## ScaDef GCA GcaPassLive GcaPassDead
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.1500 Median :0.1000 Median :0.00000
## Mean :0.04902 Mean :0.1989 Mean :0.1358 Mean :0.01273
## 3rd Qu.:0.08000 3rd Qu.:0.3000 3rd Qu.:0.2100 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.5400 Max. :1.3000 Max. :0.56000
## GcaDrib GcaSh GcaFld GcaDef
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.000000
## Mean :0.01206 Mean :0.01797 Mean :0.01511 Mean :0.005119
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :0.50000 Max. :0.49000 Max. :0.50000 Max. :0.370000
## Tkl TklWon TklDef3rd TklMid3rd
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.030 1st Qu.:0.600 1st Qu.:0.3300 1st Qu.:0.3400
## Median :1.650 Median :0.975 Median :0.7400 Median :0.6100
## Mean :1.674 Mean :1.020 Mean :0.7936 Mean :0.6524
## 3rd Qu.:2.310 3rd Qu.:1.400 3rd Qu.:1.1800 3rd Qu.:0.9200
## Max. :5.880 Max. :3.390 Max. :4.1200 Max. :2.6900
## TklAtt3rd TklDri TklDriAtt TklDri%
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. : 0.00
## 1st Qu.:0.0700 1st Qu.:0.2700 1st Qu.:0.860 1st Qu.: 26.23
## Median :0.2000 Median :0.5600 Median :1.400 Median : 37.95
## Mean :0.2277 Mean :0.5959 Mean :1.477 Mean : 36.99
## 3rd Qu.:0.3400 3rd Qu.:0.8600 3rd Qu.:2.020 3rd Qu.: 50.00
## Max. :1.7600 Max. :2.6100 Max. :5.240 Max. :100.00
## TklDriPast Press PresSucc Press%
## Min. :0.0000 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.:0.4725 1st Qu.:10.30 1st Qu.: 3.180 1st Qu.: 25.70
## Median :0.8100 Median :14.80 Median : 4.340 Median : 29.70
## Mean :0.8812 Mean :14.42 Mean : 4.261 Mean : 29.22
## 3rd Qu.:1.2200 3rd Qu.:18.80 3rd Qu.: 5.490 3rd Qu.: 33.50
## Max. :4.2900 Max. :33.80 Max. :11.500 Max. :100.00
## PresDef3rd PresMid3rd PresAtt3rd Blocks
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 2.712 1st Qu.: 3.933 1st Qu.: 1.000 1st Qu.:0.9325
## Median : 4.660 Median : 6.230 Median : 2.915 Median :1.4300
## Mean : 4.545 Mean : 6.418 Mean : 3.459 Mean :1.4061
## 3rd Qu.: 6.250 3rd Qu.: 8.818 3rd Qu.: 5.395 3rd Qu.:1.8900
## Max. :13.900 Max. :19.200 Max. :15.900 Max. :5.2200
## BlkSh BlkShSv BlkPass Int
## Min. :0.0000 Min. :0.000000 Min. :0.00 Min. :0.000
## 1st Qu.:0.0600 1st Qu.:0.000000 1st Qu.:0.69 1st Qu.:0.700
## Median :0.2100 Median :0.000000 Median :1.08 Median :1.390
## Mean :0.3159 Mean :0.006277 Mean :1.09 Mean :1.379
## 3rd Qu.:0.4700 3rd Qu.:0.000000 3rd Qu.:1.47 3rd Qu.:1.970
## Max. :2.5800 Max. :0.500000 Max. :5.22 Max. :6.360
## Tkl+Int Clr Err Touches
## Min. : 0.000 Min. :0.000 Min. :0.0000 Min. : 20.60
## 1st Qu.: 1.920 1st Qu.:0.500 1st Qu.:0.0000 1st Qu.: 42.70
## Median : 3.190 Median :1.170 Median :0.0000 Median : 53.10
## Mean : 3.053 Mean :1.683 Mean :0.0208 Mean : 54.21
## 3rd Qu.: 4.180 3rd Qu.:2.458 3rd Qu.:0.0300 3rd Qu.: 64.10
## Max. :10.500 Max. :8.510 Max. :1.0000 Max. :109.60
## TouDefPen TouDef3rd TouMid3rd TouAtt3rd
## Min. : 0.000 Min. : 0.740 Min. : 0.00 Min. : 0.000
## 1st Qu.: 1.120 1st Qu.: 6.982 1st Qu.:18.82 1st Qu.: 6.183
## Median : 2.590 Median :15.300 Median :25.30 Median :14.950
## Mean : 5.472 Mean :17.467 Mean :25.84 Mean :14.223
## 3rd Qu.: 5.928 3rd Qu.:25.975 3rd Qu.:33.00 3rd Qu.:20.600
## Max. :44.800 Max. :59.200 Max. :70.90 Max. :47.000
## TouAttPen TouLive DriSucc DriAtt
## Min. : 0.000 Min. : 15.3 Min. :0.0000 Min. :0.000
## 1st Qu.: 0.710 1st Qu.: 40.0 1st Qu.:0.2500 1st Qu.:0.470
## Median : 1.490 Median : 49.2 Median :0.6400 Median :1.210
## Mean : 2.231 Mean : 50.1 Mean :0.8075 Mean :1.516
## 3rd Qu.: 3.498 3rd Qu.: 59.3 3rd Qu.:1.1700 3rd Qu.:2.230
## Max. :11.200 Max. :105.9 Max. :5.5800 Max. :8.750
## DriSucc% DriPast DriMegs Carries
## Min. : 0.00 Min. :0.0000 Min. :0.00000 Min. : 8.67
## 1st Qu.: 41.83 1st Qu.:0.2800 1st Qu.:0.00000 1st Qu.:25.30
## Median : 52.90 Median :0.7100 Median :0.00000 Median :32.90
## Mean : 51.44 Mean :0.8881 Mean :0.07947 Mean :34.04
## 3rd Qu.: 65.67 3rd Qu.:1.2800 3rd Qu.:0.12000 3rd Qu.:41.60
## Max. :100.00 Max. :5.7700 Max. :1.67000 Max. :87.70
## CarTotDist CarPrgDist CarProg Car3rd
## Min. : 30.6 Min. : 6.19 Min. : 0.000 Min. :0.000
## 1st Qu.:120.9 1st Qu.: 59.52 1st Qu.: 2.020 1st Qu.:0.450
## Median :167.8 Median : 87.70 Median : 3.375 Median :1.020
## Mean :172.5 Mean : 92.16 Mean : 3.668 Mean :1.128
## 3rd Qu.:215.2 3rd Qu.:117.50 3rd Qu.: 4.980 3rd Qu.:1.637
## Max. :500.7 Max. :305.30 Max. :15.300 Max. :5.000
## CPA CarMis CarDis RecTarg
## Min. :0.000 Min. :0.000 Min. :0.000 Min. : 6.50
## 1st Qu.:0.000 1st Qu.:0.370 1st Qu.:0.370 1st Qu.:33.40
## Median :0.190 Median :1.000 Median :0.970 Median :41.50
## Mean :0.386 Mean :1.293 Mean :1.122 Mean :41.33
## 3rd Qu.:0.550 3rd Qu.:2.030 3rd Qu.:1.670 3rd Qu.:49.10
## Max. :4.760 Max. :9.170 Max. :6.220 Max. :94.90
## Rec Rec% RecProg CrdY
## Min. : 6.25 Min. : 38.20 Min. : 0.000 Min. :0.0000
## 1st Qu.:26.20 1st Qu.: 77.70 1st Qu.: 0.430 1st Qu.:0.0800
## Median :33.10 Median : 89.90 Median : 2.205 Median :0.1800
## Mean :34.86 Mean : 85.07 Mean : 3.098 Mean :0.2054
## 3rd Qu.:42.08 3rd Qu.: 96.30 3rd Qu.: 5.350 3rd Qu.:0.2900
## Max. :90.70 Max. :100.00 Max. :13.200 Max. :1.6000
## CrdR 2CrdY Fls Fld
## Min. :0.000000 Min. :0.000000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.840 1st Qu.:0.630
## Median :0.000000 Median :0.000000 Median :1.295 Median :1.140
## Mean :0.009855 Mean :0.004272 Mean :1.338 Mean :1.252
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:1.780 3rd Qu.:1.730
## Max. :0.510000 Max. :0.260000 Max. :4.780 Max. :5.000
## Off Crs TklW PKwon
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:0.190 1st Qu.:0.600 1st Qu.:0.00000
## Median :0.050 Median :0.760 Median :0.975 Median :0.00000
## Mean :0.181 Mean :1.104 Mean :1.020 Mean :0.01097
## 3rd Qu.:0.240 3rd Qu.:1.790 3rd Qu.:1.400 3rd Qu.:0.00000
## Max. :2.070 Max. :9.130 Max. :3.390 Max. :0.50000
## PKcon OG Recov AerWon
## Min. :0.0000 Min. :0.000000 Min. : 1.430 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.: 5.530 1st Qu.: 0.660
## Median :0.0000 Median :0.000000 Median : 7.500 Median : 1.310
## Mean :0.0143 Mean :0.004289 Mean : 7.516 Mean : 1.611
## 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.: 9.260 3rd Qu.: 2.270
## Max. :0.5000 Max. :0.500000 Max. :15.700 Max. :12.400
## AerLost AerWon%
## Min. :0.000 Min. : 0.00
## 1st Qu.:0.950 1st Qu.: 33.30
## Median :1.410 Median : 45.55
## Mean :1.702 Mean : 43.48
## 3rd Qu.:2.087 3rd Qu.: 57.90
## Max. :9.720 Max. :100.00
## 'data.frame': 2350 obs. of 143 variables:
## $ Rk : int 1 2 3 4 6 8 9 10 11 13 ...
## $ Player : chr "Max Aarons" "Yunis Abdelhamid" "Salis Abdul Samed" "Laurent Abergel" ...
## $ Nation : chr "ENG" "MAR" "GHA" "FRA" ...
## $ Pos : chr "DF" "DF" "MF" "MF" ...
## $ Squad : chr "Norwich City" "Reims" "Clermont Foot" "Lorient" ...
## $ Comp : chr "Premier League" "Ligue 1" "Ligue 1" "Ligue 1" ...
## $ Age : int 22 34 22 29 23 24 26 34 23 30 ...
## $ Born : int 2000 1987 2000 1993 1998 1997 1996 1988 1998 1991 ...
## $ MP : int 34 34 31 34 24 37 8 30 13 31 ...
## $ Starts : int 32 34 29 34 5 36 6 29 1 26 ...
## $ Min : int 2881 2983 2462 2956 726 3084 560 2536 259 2260 ...
## $ 90s : num 32 33.1 27.4 32.8 8.1 34.3 6.2 28.2 2.9 25.1 ...
## $ Goals : num 0 0.06 0.04 0 0 0.5 0 0.14 0 0.04 ...
## $ Shots : num 0.41 0.54 0.66 0.91 2.22 2.71 0.32 0.57 3.79 0.68 ...
## $ SoT : num 0.06 0.18 0.18 0.21 0.49 0.93 0 0.25 0.69 0.2 ...
## $ SoT% : num 15.4 33.3 27.8 23.3 22.2 34.4 0 43.8 18.2 29.4 ...
## $ G/Sh : num 0 0.11 0.06 0 0 0.15 0 0.25 0 0.06 ...
## $ G/SoT : num 0 0.33 0.2 0 0 0.44 0 0.57 0 0.2 ...
## $ ShoDist : num 20.5 18.7 20.3 22.6 15.6 12.2 7 9.2 15.8 22.1 ...
## $ ShoFK : num 0 0 0 0 0 0.06 0 0 0 0.04 ...
## $ ShoPK : num 0 0 0 0 0 0.09 0 0 0 0 ...
## $ PKatt : num 0 0 0 0 0 0.09 0 0 0 0 ...
## $ PasTotCmp : num 34 38.7 55.9 40.7 11.1 14.6 31.3 64.3 14.5 62.3 ...
## $ PasTotAtt : num 45 47 61 49.8 17.2 20.2 35.5 71.1 24.5 78.5 ...
## $ PasTotCmp% : num 75.5 82.4 91.7 81.6 64.7 72 88.2 90.3 59.2 79.4 ...
## $ PasTotDist : num 574 836 1033 781 160 ...
## $ PasTotPrgDist: num 214.8 287.9 184.4 206 40.7 ...
## $ PasShoCmp : num 17.5 10.2 22.5 16.3 6.67 8.69 6.77 20.1 7.59 23.5 ...
## $ PasShoAtt : num 19.4 11.4 24.1 18.4 9.63 11.4 7.42 21.6 12.1 25.3 ...
## $ PasShoCmp% : num 90 89.9 93.5 88.6 69.2 76.2 91.3 93.1 62.9 93.1 ...
## $ PasMedCmp : num 13.1 22.4 25.8 17.3 3.09 4.05 18.5 33.3 5.86 27.2 ...
## $ PasMedAtt : num 17 25 27.2 19.6 4.69 5.86 20 35.3 9.31 32 ...
## $ PasMedCmp% : num 77 89.4 94.9 87.9 65.8 69.2 92.7 94.4 63 84.9 ...
## $ PasLonCmp : num 3.06 5.65 6.72 6.25 0.62 1.28 5.81 9.93 0.34 11 ...
## $ PasLonAtt : num 6.78 9.15 7.81 9.39 1.23 1.6 7.74 12.9 1.03 18.9 ...
## $ PasLonCmp% : num 45.2 61.7 86 66.6 50 80 75 77.1 33.3 58 ...
## $ Assists : num 0.06 0 0 0.06 0.12 0.12 0 0 0.34 0.12 ...
## $ PasAss : num 0.59 0.24 0.55 0.91 0.99 1.02 0 0.21 0.69 1.63 ...
## $ Pas3rd : num 1.56 2.45 2.81 3.87 0.86 1.08 1.29 4.33 1.03 4.1 ...
## $ PPA : num 1.13 0.18 0.47 0.58 0.74 0.82 0 0.21 0.69 1.67 ...
## $ CrsPA : num 0.25 0 0.04 0.18 0 0.12 0 0 0.69 0.92 ...
## $ PasProg : num 2.94 2.72 2.96 4.18 1.6 1.78 1.45 3.12 1.03 4.9 ...
## $ PasAtt : num 45 47 61 49.8 17.2 20.2 35.5 71.1 24.5 78.5 ...
## $ PasLive : num 34.4 44 60.3 49 16 18.3 34 68.6 23.8 65.3 ...
## $ PasDead : num 10.6 3.02 0.73 0.85 1.11 1.98 1.45 2.52 0.69 13.2 ...
## $ PasFK : num 0.84 2.45 0.58 0.64 0.12 0.17 0.81 2.2 0 2.39 ...
## $ TB : num 0.06 0 0.04 0.18 0.12 0.15 0 0.07 0 0.12 ...
## $ PasPress : num 5.41 5.68 8.03 9.48 5.19 5.92 1.61 5.96 8.97 6.77 ...
## $ Sw : num 0.59 1.66 0.8 1.49 0.25 0.32 1.13 2.66 0.34 4.22 ...
## $ PasCrs : num 1.41 0.06 0.36 0.79 0.25 0.7 0 0.04 2.07 4.94 ...
## $ CK : num 0 0 0 0.03 0 0 0 0 0 1.59 ...
## $ CkIn : num 0 0 0 0 0 0 0 0 0 0.04 ...
## $ CkOut : num 0 0 0 0 0 0 0 0 0 1.16 ...
## $ CkStr : num 0 0 0 0 0 0 0 0 0 0.04 ...
## $ PasGround : num 26.5 35.3 52.6 37.6 10.7 12.8 28.2 52.8 12.1 46.1 ...
## $ PasLow : num 9.59 3.78 4.71 5.64 3.95 4.37 2.42 5.6 6.21 12.4 ...
## $ PasHigh : num 8.94 7.95 3.72 6.65 2.47 3.03 4.84 12.7 6.21 20.1 ...
## $ PaswLeft : num 4.91 31.7 4.82 4.48 3.33 1.84 29.2 54.1 4.14 58.5 ...
## $ PaswRight : num 29 12.1 53.1 43.9 9.38 14.1 3.55 12.2 12.1 7.69 ...
## $ PaswHead : num 0.91 1.48 1.9 0.73 2.35 1.92 1.77 2.77 1.72 1.87 ...
## $ TI : num 9.72 0.42 0.15 0.15 0 0.03 0.32 0.04 0.69 9.2 ...
## $ PaswOther : num 0.06 0.12 0.29 0.15 0.37 0.7 0.16 0.28 0.34 0.2 ...
## $ PasCmp : num 34 38.7 55.9 40.7 11.1 14.6 31.3 64.3 14.5 62.3 ...
## $ PasOff : num 0.22 0.15 0.07 0.21 0.25 0 0.16 0.04 0.34 0.24 ...
## $ PasOut : num 0.88 0.97 0.58 0.55 0.62 0.29 0.16 0.78 0.69 1.16 ...
## $ PasInt : num 1.63 1.24 1.24 1.83 0.62 1.02 0.32 0.85 1.38 2.47 ...
## $ PasBlocks : num 1.75 0.88 0.84 1.68 1.11 0.7 0.16 0.43 1.72 1.63 ...
## $ SCA : num 1.19 0.63 1.46 2.01 2.47 2.33 0.48 0.64 1.72 2.63 ...
## $ ScaPassLive : num 0.84 0.42 1.09 1.49 1.48 1.63 0.16 0.5 1.03 1.67 ...
## $ ScaPassDead : num 0.06 0 0 0.06 0 0.03 0 0.04 0 0.64 ...
## $ ScaDrib : num 0.09 0.09 0 0.03 0 0.03 0 0 0.34 0.04 ...
## $ ScaSh : num 0.13 0.03 0.15 0.03 0.12 0.38 0.32 0.04 0.34 0.12 ...
## $ ScaFld : num 0.06 0 0.15 0.21 0.25 0.17 0 0.07 0 0.16 ...
## $ ScaDef : num 0 0.09 0.07 0.18 0.62 0.09 0 0 0 0 ...
## $ GCA : num 0.16 0.03 0.04 0.15 0.25 0.35 0.16 0.07 0.34 0.12 ...
## $ GcaPassLive : num 0.16 0 0.04 0.12 0.12 0.15 0 0.04 0 0.04 ...
## $ GcaPassDead : num 0 0 0 0 0 0 0 0 0 0.08 ...
## $ GcaDrib : num 0 0.03 0 0 0 0.03 0 0 0 0 ...
## $ GcaSh : num 0 0 0 0 0.12 0.09 0.16 0 0.34 0 ...
## $ GcaFld : num 0 0 0 0.03 0 0.09 0 0.04 0 0 ...
## $ GcaDef : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Tkl : num 2.16 1.87 2.01 3.57 1.73 0.99 1.13 0.96 0.69 2.31 ...
## $ TklWon : num 1.16 1.39 1.24 2.23 0.86 0.64 0.97 0.43 0.69 1.55 ...
## $ TklDef3rd : num 1.56 1.24 0.91 1.49 0.37 0.29 0.81 0.64 0 1.31 ...
## $ TklMid3rd : num 0.59 0.6 0.91 1.71 0.86 0.44 0.16 0.32 0.34 0.6 ...
## $ TklAtt3rd : num 0 0.03 0.18 0.37 0.49 0.26 0.16 0 0.34 0.4 ...
## $ TklDri : num 1.16 0.39 0.69 1.8 0.25 0.29 0.48 0.32 0.69 0.84 ...
## $ TklDriAtt : num 1.81 0.82 2.15 4.97 1.36 0.67 0.65 0.71 1.03 1.39 ...
## $ TklDri% : num 63.8 48.1 32.2 36.2 18.2 43.5 75 45 66.7 60 ...
## $ TklDriPast : num 0.66 0.42 1.46 3.17 1.11 0.38 0.16 0.39 0.34 0.56 ...
## $ Press : num 13.6 13.6 23.4 28 23.5 13.6 5.65 5.85 23.8 11.7 ...
## $ PresSucc : num 3.53 4.89 6.53 7.9 4.81 4.02 2.1 1.7 6.9 3.98 ...
## $ Press% : num 26 35.9 27.9 28.2 20.5 29.6 37.1 29.1 29 34 ...
## $ PresDef3rd : num 7.97 7.61 7.19 9.27 2.59 1.34 2.26 3.05 1.72 5.3 ...
## $ PresMid3rd : num 4.38 5.14 12.3 15.3 10 5.71 3.23 2.41 10 4.74 ...
## $ PresAtt3rd : num 1.22 0.88 3.94 3.41 10.9 6.56 0.16 0.39 12.1 1.67 ...
## $ Blocks : num 2.69 1.87 0.99 1.68 1.23 0.96 2.1 1.7 1.03 1.43 ...
## $ BlkSh : num 0.69 0.79 0.04 0.09 0.37 0.32 0.81 1.06 0 0.16 ...
## $ BlkShSv : num 0.03 0.06 0 0 0 0 0.32 0 0 0 ...
## [list output truncated]
As you can see, there is too much information. Analyzing dataset with such a number of variables is very impractical. But there are some overall insights are can obtain:
many variables are right-skewed
in almost every statistic max value is hard outlier. It isn’t error; that’s just world-class players
excluding first 6 variables, all of them are integers and numbers
In conducting clustering important part is assuring that none NAs occur
## Rk Player Nation Pos Squad
## 0 0 0 0 0
## Comp Age Born MP Starts
## 0 0 0 0 0
## Min 90s Goals Shots SoT
## 0 0 0 0 0
## SoT% G/Sh G/SoT ShoDist ShoFK
## 0 0 0 0 0
## ShoPK PKatt PasTotCmp PasTotAtt PasTotCmp%
## 0 0 0 0 0
## PasTotDist PasTotPrgDist PasShoCmp PasShoAtt PasShoCmp%
## 0 0 0 0 0
## PasMedCmp PasMedAtt PasMedCmp% PasLonCmp PasLonAtt
## 0 0 0 0 0
## PasLonCmp% Assists PasAss Pas3rd PPA
## 0 0 0 0 0
## CrsPA PasProg PasAtt PasLive PasDead
## 0 0 0 0 0
## PasFK TB PasPress Sw PasCrs
## 0 0 0 0 0
## CK CkIn CkOut CkStr PasGround
## 0 0 0 0 0
## PasLow PasHigh PaswLeft PaswRight PaswHead
## 0 0 0 0 0
## TI PaswOther PasCmp PasOff PasOut
## 0 0 0 0 0
## PasInt PasBlocks SCA ScaPassLive ScaPassDead
## 0 0 0 0 0
## ScaDrib ScaSh ScaFld ScaDef GCA
## 0 0 0 0 0
## GcaPassLive GcaPassDead GcaDrib GcaSh GcaFld
## 0 0 0 0 0
## GcaDef Tkl TklWon TklDef3rd TklMid3rd
## 0 0 0 0 0
## TklAtt3rd TklDri TklDriAtt TklDri% TklDriPast
## 0 0 0 0 0
## Press PresSucc Press% PresDef3rd PresMid3rd
## 0 0 0 0 0
## PresAtt3rd Blocks BlkSh BlkShSv BlkPass
## 0 0 0 0 0
## Int Tkl+Int Clr Err Touches
## 0 0 0 0 0
## TouDefPen TouDef3rd TouMid3rd TouAtt3rd TouAttPen
## 0 0 0 0 0
## TouLive DriSucc DriAtt DriSucc% DriPast
## 0 0 0 0 0
## DriMegs Carries CarTotDist CarPrgDist CarProg
## 0 0 0 0 0
## Car3rd CPA CarMis CarDis RecTarg
## 0 0 0 0 0
## Rec Rec% RecProg CrdY CrdR
## 0 0 0 0 0
## 2CrdY Fls Fld Off Crs
## 0 0 0 0 0
## TklW PKwon PKcon OG Recov
## 0 0 0 0 0
## AerWon AerLost AerWon%
## 0 0 0
As we can see - there are none NA values. Probably author from Kaggle already handled it.
In our dataset we have nearly 150 columns. That is pretty impressive amount but plenty of them are too detailed/provide no valuable information. In order to choose the best I studied every columns. What’s more, I’m a football fan for more than 12 years (majority of my life!). Thanks to that experience I was able to choose the most valuable variables.
After plenty of testing (including calculating Hopkins statistics for each set of columns) I decided to limit number of variables from 143 to 8. Part of Data Scientist’s job is using previous experience and knowledge to better understand and solve problems. So did I here!
I’ve tried to choose as diverse and valuable variables as I could. Let’s decode them:
By creating new data frame with less variables, I could easily continue my analysis.
In order to check how clusterable is my dataset, I’ll compute Hopkins test. It compares the distances between randomly sampled points in the dataset and synthetic points generated uniformly in the feature space. A value close to 1 indicates strong clustering tendency, while a value near 0.5 suggests the data is uniformly distributed and unlikely to form meaningful clusters.
## Warning in hopkins(less_stats, n = nrow(less_stats) - 1): Package `clustertend`
## is deprecated. Use package `hopkins` instead.
## $H
## [1] 0.2225858
The value of Hopkins statistic is pretty low. If we would interpret it in proper way, we could conclude that our data is uniformly distributed or random. But we won’t be so serious… That low value comes from high number of observations and variables. Obtaining high Hopkins statistics from such a complex dataset is not common.
In order present some of the statistics of charts, I’ll create another variable for those purposes.
First thing we want to visualize is played time. It’s important to see if your results won’t be biased by footballers who played very limited time.
We could observe that played time is more or less equally distributed. Biggest differences are visible in the beginning and end of the plot - that’s pretty intuitive. Those differences aren’t big enough to influence our analysis strongly.
Next statistics we will deep into are scored goals. In order to better visualize it, I will multiply the “goals/90min” times “played 90s”. That way I am able to present the most goalscoring players.
From this chart we see that majority of players didn’t score any goal (nearly 1000 footballers). This could be somehow intuitive because positions like goalkeepers or defenders rarely celebrate goals.
The plot has extremely long right tail. Those single values aren’t errors. They are just super-star forwards like Lewandowski or Haaland.
If we talked about scoring goals, now let’s check how many passes players make per 90 minutes. This plot should be more normally distributed because passing aren’t only trait of midfielders.
As predicted, distribution of passes attempts is more normal distribution. It also has longer right tail but it isn’t that extreme as in goals scored case. Once again, “the outliers” are usually the world class players. What’s interesting, we could find some defenders with one of the highest values (like Sergio Ramos or Jordi Alba). In the end, the most frequent passing players are midfielders, for example Toni Kroos and Marco Veratti.
We analyzed domain of forwards and midfielders. Now it’s time for defenders! Last statistic we will analyze is percentage of won aerial duels. By stereotype the tallest players in each team are defenders. Their role is to protect the goal, often by winning aerial duels in penalty area.
With every plot we are step closer to normal distribution shape! This time we have “problem” from the other side - nearly 200 players didn’t win any aerial ball! This may sound ridiculous but if you discover that some players are only 150-160 cm height, it appears more real. Such a footballers have little chance of winning aerial ball with 190 cm enemy defender. When it comes to players with 100%, they are almost all defenders.
Crucial part of conducting analysis such a clustering is checking correlations between variables. If they are be too high, it will negatively impact quality of the output. In our case situation is a little bit easier - in the beginning I reduced number of variables from 143 to only 8. They are all describing overall different characteristics of players.
In order to prepare visually appealing heatmap, I have to calculate the correlations between my variables. Having this done I’ll convert the data into a molten data frame using melt function from reshape2 package.
With ready data, I am able to create a correlation heatmap. It’ll visualize connections between variables in very intuitive and clean way.
As we could see on the heatmap, no extreme correlations occur. Two of the highest correlation values are connected with offsides (Off). First one is positive correlation with shots (Shots) - this could be explain by the fact that most shots are hit by forwards, playing in the front. They are most exposed to be caught on the offside.
Negatively correlated variable with offside is pass attempts (PasTotAtt). This could also be explained by player positions. Most passes, as described in Data Visualization section, are done by midfielders and defenders. Those players stay more in the back and are not exposed to be on offside position.
We could conclude that no strong correlations occur - they won’t disturb the results of our analysis.
Scaling data is necessary - variable such as “AerWon%” would have stronger impact on results of analysis due to it’s format (range from 0 to 100). Rest variables are calculated per 90 minutes.
## Shots PasTotAtt Assists Tkl Blocks Fls Off
## 1 -0.7884370 0.1163052 -0.2260469 0.51349324 1.7739078 -0.48631207 -0.5174874
## 2 -0.6551024 0.2465772 -0.7275041 0.20733175 0.6409632 -0.05024222 -0.6202879
## 3 -0.5320244 1.1584811 -0.7275041 0.35513385 -0.5748797 0.39904187 -0.6202879
## 4 -0.2756119 0.4289579 -0.2260469 2.00207151 0.3784517 0.08190016 -0.5174874
## 6 1.0679899 -1.6944754 0.2754102 0.05952965 -0.2432862 1.16546767 5.7190751
## AerWon%
## 1 -1.0492802
## 2 0.8244920
## 3 0.3295333
## 4 0.3295333
## 6 -0.4028035
## Shots PasTotAtt Assists Tkl
## Min. :-1.2090 Min. :-2.07878 Min. :-0.7275 Min. :-1.76688
## 1st Qu.:-0.7782 1st Qu.:-0.72395 1st Qu.:-0.7275 1st Qu.:-0.67948
## Median :-0.2961 Median :-0.08236 Median :-0.3096 Median :-0.02493
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6680 3rd Qu.: 0.62437 3rd Qu.: 0.4426 3rd Qu.: 0.67185
## Max. : 4.0219 Max. : 3.73787 Max. :10.1374 Max. : 4.44081
## Blocks Fls Off AerWon%
## Min. :-1.94270 Min. :-1.76809 Min. :-0.6203 Min. :-2.1958
## 1st Qu.:-0.65432 1st Qu.:-0.65810 1st Qu.:-0.6203 1st Qu.:-0.5139
## Median : 0.03304 Median :-0.05685 Median :-0.4490 Median : 0.1048
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.66860 3rd Qu.: 0.58404 3rd Qu.: 0.2021 3rd Qu.: 0.7285
## Max. : 5.26946 Max. : 4.54831 Max. : 6.4729 Max. : 2.8548
After scaling we could see that our data is more reliable showing variance in variables. It’ll be perfect for clustering.
In order to get optimal number of clusters, we will use NbClust and Optimal_Clusters_KMeans functions.
First is NbClust:
opt_clusters <- NbClust(stats_scaled, distance = 'euclidean', min.nc = 2, max.nc = 10, method = 'complete',
index = 'silhouette')
opt_clusters$All.index## 2 3 4 5 6 7 8 9 10
## 0.6339 0.0603 0.0639 0.0672 0.0370 0.0374 0.0321 0.0705 0.0872
## Number_clusters Value_Index
## 2.0000 0.6339
Accroding to NbClust, the best number of clusters is 2. There is significant difference (10x) between it and other options.
This time we will use Optimal_Clusters_KMeans function. It will be conducted 3 times - once with default criterion (variance_explained), then with silhouette and last one with AIC. Each function has set max number or clusters to 10.
Decision about number of clusters is not easy. NbClust extremely highly suggested only 2 - in my opinion such a small number won’t be especially informative and insightful. That’s why I’ve chose conducting my analysis on 3 clusters. All three Optimal_Clusters_KMeans plots suggest that number. Silhouette shows the highest value for 3 clusters. On variance_explained plot I’m looking for well-known “elbow point” ;) Drops in values are significant till 3, then they’re visibly smaller. When it comes to AIC - the lower value, the better fit. Regarding data from chart and other information number of 3 is here also optimal choice.
To sum up this section - final number of clusters is 3.
As a method of our clustering analysis I’m choosing k-means. Why? It is:
As being said, let’s cluster some data!
cluster_km <- kmeans(stats_scaled, 3)
fviz_cluster(cluster_km, geom = "point", stats_scaled) + ggtitle("Number of cluster - 3")On the graph we see 3 groups. Cluster 1 stands out from the rest - almost all assigned observations are highly concentrated far from others. There is no clear space between cluster 2 and 3. The blue one (3) is seem to have lower variance (high density and no strong “outliers”). The green cluster (2) looks the most differentiated - that’s mainly by some extreme observations which extend its area on the plot. My guess about this players - that could be some superstars with unreal statistics which look like outliers.
To check how well observation fits into assigned cluster, we will calculate and visualize Silhouette width. It has range from -1 to 1, and we could interpret certain values in that way:
Knowing the theory behind Silhouette width, we could calculate and visualize it.
## cluster size ave.sil.width
## 1 1 173 0.87
## 2 2 723 0.18
## 3 3 1454 0.36
First thing we could comment is clusters size. Without any doubt the biggest one in Cluster 3. It includes nearly 62% of all observations. Its average Silhouette width is 0.36, which is moderate value. On the other side is Cluster 1 - it covers over 7% of players but have incredibly high Silhouette width - 0.87. Such a value indicates that this is our best cluster in terms of fitting observations. Last cluster, number 2, is not that perfect. It groups nearly 31% of players while having avg. width equal to 0.18. What’s more, that cluster has some observations below 0 - that means that they’re on decision boundary and are not easy to assign.
In short words - it evaluates the compactness and separation of clusters. A higher CH Index value indicates better-defined clusters, as it suggests high between-cluster variance and low within-cluster variance. We will calculate CH index for our choice (3 clusters) and additionaly two more options - 2 and 4 clusters.
## [1] "Calinski-Harabasz index for 3 clusters"
## [1] 823.42
## [1] "Calinski-Harabasz index for 2 clusters"
## [1] 688.7
## [1] "Calinski-Harabasz index for 4 clusters"
## [1] 702.73
Highest value of CH index was obtained for 3 clusters. This means that our original choice provides a more compact and well-separated clustering structure. It is the most optimal cluster.
Shadow statistics is used to evaluate the quality of clustering. It is similar in concept to the Silhouette Score. The shadow statistic provides insights into how well an individual data point fits into its assigned cluster compared to other clusters.
It’s interpretation is very similar to Silhouette:
## Found more than one class "kcca" in cache; using the first, from namespace 'kernlab'
## Also defined by 'flexclust'
## Found more than one class "kcca" in cache; using the first, from namespace 'kernlab'
## Also defined by 'flexclust'
## 1 2 3
## 0.1568998 0.6779121 0.7601752
So there are the results. We could observe that Cluster 1 have very low shadow values. On the other hand, Cluster 3 most points have high shadow values (near 0.7 – 0.8), suggesting it’s strong cohesion. Cluster 2 presents wide range of values but moreover it’s pretty good one.
Why this graph differs so much from Silhouette
one?
Main reason is differences in methodologies. Silhouette score calculates
distance of a point to its cluster versus the closest other cluster. On
the other hand, shadow statistic compares a point’s distance to its
cluster’s centroid versus the nearest cluster’s centroid. That’s
why.
Seeing distribution of observation into clusters without analyzing their centers is worthless. Let’s see features of our clusters.
## [1] "Cluster Centers:"
## Shots PasTotAtt Assists Tkl Blocks Fls Off
## 1 -1.1811484 -0.5707101 -0.7033491 -1.7381393 -1.9113965 -1.7163056 -0.6183072
## 2 1.1615894 -0.7829139 0.6185003 -0.5077606 -0.5071927 0.1994758 0.9994477
## 3 -0.4370636 0.4572074 -0.2238627 0.4592909 0.4796231 0.1050205 -0.4234068
## AerWon%
## 1 -2.1771111
## 2 -0.4027337
## 3 0.4592962
Already knowing centers for every statistic allows us to create insightful clusters which could describe whole dataset well. I’ll also make a use of my expertise knowledge in naming them.
Cluster 1 - Goalkeepers
(n = 173, avg. sil. = 0.87)
Very homogeneous group. They have almost all statistics centers on the lowest level. This could be explained by two factors. First - their real role during the match. Goalkeepers’ mission is to protest their teams from losing goals. They are not shooting, assisting, blocking or being on offside. They just stay mainly in penalty area and secure. Important statistic are won aerial duels. It may be obvious that 2m players have nearly 100% won balls. The hook is in the methodology - mentioned statistic describe won aerial HEAD duels. Goalkeepers mostly catch the ball into hands, so it doesn’t count.
Cluster 2 - Offensive Players
(n = 723, avg. sil. = 0.18)
This cluster is not as good fitted as previous one. It’s characterized by highest number of shots and assists - two metrics directly connected with scoring goals. Players from this cluster are also often caught on offside position, which is due to their front positioning on the pitch. What’s little surprising - they tend to make the least passes, even less than goalkeepers. That may show that this group of players don’t need that many passes to do their job.
Cluster 3 - Defensive Players
(n = 1454, avg. sil. = 0.36)
The largest cluster with moderate fit. I’ve named it defensive because most statistics have reversed values compared to Cluster 2. It contains large group of players - not only defenders, but also plenty of midfielders. They tend to shoot less and pass more (while having less assists). Clear defensive statistics like blocks and tackles (Tkl) are also highest in this cluster. Those players are also winning most aerial (head) duels; that could be influenced by height of defenders.
In this report we have gone through every process of clustering analysis. Started with loading data and choosing most appropriate variables. Then some visualizations to better understand the whole dataset. In the end, crème de la crème, proper clustering.
Results could be for someone interested in football pretty intuitive. This analysis also confirms that expertise knowledge agrees with independent data. Using such interesting tools to describe my hobby was an awesome experience. I’ve enjoyed it a lot :)