Cezary Kuźmowicz
We are meeting (I hope) once again! This project can be treated as next episode of my clustering analysis. Here, I’ll also use statistics of football players from TOP 5 European leagues
I’ll just use some copy of clustering part text. I hope you will still enjoy this! :)
Football is the most popular sport on Earth. Millions of people around the globe play it on daily basis. Most of countries have their own national leagues. But the best of the best take place in Europe. In football’s nomenclature, while sharing some graphs and statistics, often used concept is “TOP 5 European Leagues”. That means clearly five best national competitions in Europe.
This group consists of the English Premier League, Spanish La Liga, Italian Serie A, German Bundesliga and French Ligue 1. In this report I will conduct dimension reduction analysis using players statistics from season 2021/22.
In order to achieve satisfactory results many tests and visualizations will be presented. As the dimension reduction method I’ve chose PCA, mainly due to better interpretation (at least for me).
## Loading required package: ggplot2
## Loading required package: lattice
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
My dataset comes from Kaggle. It was prepared based on data from fbref - online website gathering huge amount of informations from plenty of sports.
raw_stats <- read.csv("/Users/czarek/Downloads/2021-2022 Football Player Stats.csv", sep = ";", dec = ".",
check.names = FALSE)Raw dataset includes nearly 3000 observations (individual players), each described by 143 variables! It’s important to add that every statistic is calculated “per 90 minutes”. This means that author already unified data, so one step less for us!
First step during our data preparation will be excluding footballers who played less than 180 minutes through whole season. That operation will assure us that we don’t analyze player who scored 4 goals in his only played match.
By conducting such an operation we got rid of nearly 600 observations. That will definitely improve overview of our dataset and quality of analysis.
In this case, I’ve also dropped some columns according to my knowledge. In dataset there were variables which described one certain statistic in almost same way. For example, when it comes to total passes we had: PasTotAtt (total attempts), PasTotCmp (completed attempts), PasCmp% (% of completed attempts). That’s why I decided to save only versions with total attempts (I applied it to every similar case).
to_stay <- c("Goals", "Shots", "SoT", "ShoDist", "ShoPK", "PasTotAtt", "PasTotPrgDist", "PasShoAtt",
"PasMedAtt","PasLonAtt", "Assists", "Pas3rd", "PPA", "CrsPA", "PasProg", "PasAtt", "PasFK", "TB",
"PasPress","Sw", "PasCrs", "CK", "PasGround", "PasHigh", "PaswHead", "PasOff", "PasInt", "PasBlocks",
"SCA", "ScaDrib", "ScaSh", "ScaFld", "GCA", "GcaPassLive", "GcaDrib", "GcaSh", "GcaFld", "GcaDef",
"Tkl", "TklDef3rd","TklMid3rd", "TklAtt3rd", "TklDriAtt", "TklDriPast","Press", "PresDef3rd","PresMid3rd",
"PresAtt3rd", "Blocks", "BlkSh", "BlkShSv", "BlkPass", "Int", "Tkl+Int", "Clr", "Err", "Touches",
"TouDefPen", "TouDef3rd", "TouMid3rd", "TouAtt3rd", "TouAttPen", "TouLive","DriAtt", "DriPast","DriMegs",
"Carries", "CarTotDist", "CarPrgDist", "CarProg", "Car3rd", "CPA", "CarMis", "CarDis", "Rec", "RecProg",
"CrdY", "CrdR", "Fls", "Fld", "Off", "Crs", "PKwon", "PKcon", "OG", "Recov", "AerWon%")
only_stats <- stats_180[,to_stay]That’s how I got rid off over 50 variables. Due to this results of dimension reduction would be more relevant.
In the beginning I’ll standardize my data. The preProccess function finds mean and SD of every variable and then predict applies transformation.
preproc1 <- preProcess(only_stats, method = c("center", "scale"))
only_stats_predicted <- predict(preproc1, only_stats)
summary(only_stats_predicted)## Goals Shots SoT ShoDist
## Min. :-0.7031 Min. :-1.2090 Min. :-0.9602 Min. :-2.16358
## 1st Qu.:-0.7031 1st Qu.:-0.7782 1st Qu.:-0.7842 1st Qu.:-0.47746
## Median :-0.4136 Median :-0.2961 Median :-0.3569 Median : 0.09863
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.3392 3rd Qu.: 0.6680 3rd Qu.: 0.5417 3rd Qu.: 0.66066
## Max. : 7.5779 Max. : 4.0219 Max. : 4.9467 Max. : 7.84071
## ShoPK PasTotAtt PasTotPrgDist PasShoAtt
## Min. :-0.2521 Min. :-2.07878 Min. :-1.56473 Min. :-2.25758
## 1st Qu.:-0.2521 1st Qu.:-0.72395 1st Qu.:-0.76748 1st Qu.:-0.66343
## Median :-0.2521 Median :-0.08236 Median :-0.08466 Median :-0.07459
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.2521 3rd Qu.: 0.62437 3rd Qu.: 0.61919 3rd Qu.: 0.55732
## Max. :12.9450 Max. : 3.73787 Max. : 6.07584 Max. : 4.72222
## PasMedAtt PasLonAtt Assists Pas3rd
## Min. :-1.71214 Min. :-1.5624 Min. :-0.7275 Min. :-1.4693
## 1st Qu.:-0.78991 1st Qu.:-0.7292 1st Qu.:-0.7275 1st Qu.:-0.7475
## Median :-0.07672 Median :-0.1458 Median :-0.3096 Median :-0.1512
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.60880 3rd Qu.: 0.5300 3rd Qu.: 0.4426 3rd Qu.: 0.5110
## Max. : 3.93189 Max. : 6.4677 Max. :10.1374 Max. : 5.6872
## PPA CrsPA PasProg PasAtt
## Min. :-1.1817 Min. :-0.7702 Min. :-1.70669 Min. :-2.07878
## 1st Qu.:-0.8194 1st Qu.:-0.7702 1st Qu.:-0.68055 1st Qu.:-0.72395
## Median :-0.1884 Median :-0.3812 Median :-0.07586 Median :-0.08236
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.5932 3rd Qu.: 0.4358 3rd Qu.: 0.61435 3rd Qu.: 0.62437
## Max. : 4.9570 Max. : 7.0108 Max. : 4.58454 Max. : 3.73787
## PasFK TB PasPress Sw
## Min. :-1.0972 Min. :-0.6094 Min. :-2.8370 Min. :-1.3242
## 1st Qu.:-0.8078 1st Qu.:-0.6094 1st Qu.:-0.6227 1st Qu.:-0.7081
## Median :-0.2090 Median :-0.6094 Median :-0.0394 Median :-0.2433
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5395 3rd Qu.: 0.3041 3rd Qu.: 0.6011 3rd Qu.: 0.4485
## Max. : 6.8866 Max. : 8.5258 Max. : 3.8983 Max. : 6.1126
## PasCrs CK PasGround PasHigh
## Min. :-0.9739 Min. :-0.4626 Min. :-1.9080 Min. :-1.9102
## 1st Qu.:-0.8063 1st Qu.:-0.4626 1st Qu.:-0.7271 1st Qu.:-0.7319
## Median :-0.3035 Median :-0.4626 Median :-0.1668 Median :-0.1303
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6052 3rd Qu.:-0.1895 3rd Qu.: 0.5452 3rd Qu.: 0.5098
## Max. : 7.0807 Max. : 5.7770 Max. : 4.6603 Max. : 6.6615
## PaswHead PasOff PasInt PasBlocks
## Min. :-1.70141 Min. :-1.0673 Min. :-1.97775 Min. :-1.65784
## 1st Qu.:-0.72788 1st Qu.:-0.7701 1st Qu.:-0.69450 1st Qu.:-0.75013
## Median :-0.07223 Median :-0.1758 Median :-0.06024 Median :-0.06935
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.65296 3rd Qu.: 0.4928 3rd Qu.: 0.61826 3rd Qu.: 0.67195
## Max. : 4.40803 Max. : 6.3616 Max. : 5.39728 Max. : 5.14999
## SCA ScaDrib ScaSh ScaFld
## Min. :-1.44614 Min. :-0.6253 Min. :-0.7931 Min. :-0.7001
## 1st Qu.:-0.82346 1st Qu.:-0.6253 1st Qu.:-0.7931 1st Qu.:-0.7001
## Median :-0.08756 Median :-0.6253 Median :-0.3050 Median :-0.4192
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.60588 3rd Qu.: 0.2796 3rd Qu.: 0.4273 3rd Qu.: 0.3534
## Max. : 4.46528 Max. : 8.3598 Max. : 6.6106 Max. : 6.3235
## GCA GcaPassLive GcaDrib GcaSh
## Min. :-0.9659 Min. :-0.8854 Min. :-0.3096 Min. :-0.3962
## 1st Qu.:-0.9659 1st Qu.:-0.8854 1st Qu.:-0.3096 1st Qu.:-0.3962
## Median :-0.2375 Median :-0.2332 Median :-0.3096 Median :-0.3962
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4909 3rd Qu.: 0.4841 3rd Qu.:-0.3096 3rd Qu.:-0.3962
## Max. : 6.5127 Max. : 7.5925 Max. :12.5305 Max. :10.4093
## GcaFld GcaDef Tkl TklDef3rd
## Min. :-0.358 Min. :-0.214 Min. :-1.76688 Min. :-1.34760
## 1st Qu.:-0.358 1st Qu.:-0.214 1st Qu.:-0.67948 1st Qu.:-0.78725
## Median :-0.358 Median :-0.214 Median :-0.02493 Median :-0.09106
## Mean : 0.000 Mean : 0.000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.358 3rd Qu.:-0.214 3rd Qu.: 0.67185 3rd Qu.: 0.65607
## Max. :11.489 Max. :15.255 Max. : 4.44081 Max. : 5.64828
## TklMid3rd TklAtt3rd TklDriAtt TklDriPast
## Min. :-1.47357 Min. :-1.1448 Min. :-1.67002 Min. :-1.5186
## 1st Qu.:-0.70559 1st Qu.:-0.7929 1st Qu.:-0.69776 1st Qu.:-0.7044
## Median :-0.09573 Median :-0.1394 Median :-0.08728 Median :-0.1228
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.60448 3rd Qu.: 0.5644 3rd Qu.: 0.61365 3rd Qu.: 0.5838
## Max. : 4.60245 Max. : 7.7031 Max. : 4.25396 Max. : 5.8743
## Press PresDef3rd PresMid3rd PresAtt3rd
## Min. :-2.19371 Min. :-1.77376 Min. :-1.82796 Min. :-1.2066
## 1st Qu.:-0.62706 1st Qu.:-0.71508 1st Qu.:-0.70791 1st Qu.:-0.8578
## Median : 0.05739 Median : 0.04503 Median :-0.05353 Median :-0.1899
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.66580 3rd Qu.: 0.66560 3rd Qu.: 0.68344 3rd Qu.: 0.6752
## Max. : 2.94732 Max. : 3.65138 Max. : 3.64059 Max. : 4.3393
## Blocks BlkSh BlkShSv BlkPass
## Min. :-1.94270 Min. :-0.9242 Min. :-0.216 Min. :-1.80805
## 1st Qu.:-0.65432 1st Qu.:-0.7487 1st Qu.:-0.216 1st Qu.:-0.66381
## Median : 0.03304 Median :-0.3098 Median :-0.216 Median :-0.01706
## Mean : 0.00000 Mean : 0.0000 Mean : 0.000 Mean : 0.00000
## 3rd Qu.: 0.66860 3rd Qu.: 0.4510 3rd Qu.:-0.216 3rd Qu.: 0.62968
## Max. : 5.26946 Max. : 6.6249 Max. :16.994 Max. : 6.84840
## Int Tkl+Int Clr Err
## Min. :-1.58550 Min. :-1.87289 Min. :-1.0713 Min. :-0.4005
## 1st Qu.:-0.78056 1st Qu.:-0.69486 1st Qu.:-0.7531 1st Qu.:-0.4005
## Median : 0.01288 Median : 0.08436 Median :-0.3266 Median :-0.4005
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.67983 3rd Qu.: 0.69179 3rd Qu.: 0.4929 3rd Qu.: 0.1773
## Max. : 5.72796 Max. : 4.56948 Max. : 4.3453 Max. :18.8571
## Touches TouDefPen TouDef3rd TouMid3rd
## Min. :-2.20332 Min. :-0.68082 Min. :-1.4029 Min. :-2.15054
## 1st Qu.:-0.75442 1st Qu.:-0.54146 1st Qu.:-0.8793 1st Qu.:-0.58386
## Median :-0.07258 Median :-0.35855 Median :-0.1817 Median :-0.04498
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.64859 3rd Qu.: 0.05672 3rd Qu.: 0.7135 3rd Qu.: 0.59584
## Max. : 3.63163 Max. : 4.89352 Max. : 3.5001 Max. : 3.75002
## TouAtt3rd TouAttPen TouLive DriAtt
## Min. :-1.52449 Min. :-1.1154 Min. :-2.31058 Min. :-1.1240
## 1st Qu.:-0.86183 1st Qu.:-0.7605 1st Qu.:-0.67057 1st Qu.:-0.7755
## Median : 0.07789 Median :-0.3706 Median :-0.05971 Median :-0.2267
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.68347 3rd Qu.: 0.6330 3rd Qu.: 0.61090 3rd Qu.: 0.5296
## Max. : 3.51309 Max. : 4.4835 Max. : 3.70502 Max. : 5.3643
## DriPast DriMegs Carries CarTotDist
## Min. :-1.1105 Min. :-0.6217 Min. :-2.06355 Min. :-1.97513
## 1st Qu.:-0.7604 1st Qu.:-0.6217 1st Qu.:-0.71087 1st Qu.:-0.71822
## Median :-0.2227 Median :-0.6217 Median :-0.09269 Median :-0.06664
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.4901 3rd Qu.: 0.3171 3rd Qu.: 0.61497 3rd Qu.: 0.59400
## Max. : 6.1046 Max. :12.4441 Max. : 4.36473 Max. : 4.56649
## CarPrgDist CarProg Car3rd CPA
## Min. :-1.9518 Min. :-1.5466 Min. :-1.3179 Min. :-0.7464
## 1st Qu.:-0.7410 1st Qu.:-0.6949 1st Qu.:-0.7920 1st Qu.:-0.7464
## Median :-0.1013 Median :-0.1237 Median :-0.1260 Median :-0.3790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5752 3rd Qu.: 0.5530 3rd Qu.: 0.5956 3rd Qu.: 0.3172
## Max. : 4.8386 Max. : 4.9038 Max. : 4.5247 Max. : 8.4583
## CarMis CarDis Rec RecProg
## Min. :-1.1537 Min. :-1.2325 Min. :-2.2469 Min. :-1.0569
## 1st Qu.:-0.8235 1st Qu.:-0.8262 1st Qu.:-0.6802 1st Qu.:-0.9102
## Median :-0.2614 Median :-0.1674 Median :-0.1384 Median :-0.3047
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6576 3rd Qu.: 0.6013 3rd Qu.: 0.5664 3rd Qu.: 0.7681
## Max. : 7.0282 Max. : 5.5974 Max. : 4.3849 Max. : 3.4459
## CrdY CrdR Fls Fld
## Min. :-1.1845 Min. :-0.2937 Min. :-1.76809 Min. :-1.4838
## 1st Qu.:-0.7231 1st Qu.:-0.2937 1st Qu.:-0.65810 1st Qu.:-0.7370
## Median :-0.1464 Median :-0.2937 Median :-0.05685 Median :-0.1324
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.4880 3rd Qu.:-0.2937 3rd Qu.: 0.58404 3rd Qu.: 0.5670
## Max. : 8.0429 Max. :14.9060 Max. : 4.54831 Max. : 4.4434
## Off Crs PKwon PKcon
## Min. :-0.6203 Min. :-0.9739 Min. :-0.2877 Min. :-0.3678
## 1st Qu.:-0.6203 1st Qu.:-0.8063 1st Qu.:-0.2877 1st Qu.:-0.3678
## Median :-0.4490 Median :-0.3035 Median :-0.2877 Median :-0.3678
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2021 3rd Qu.: 0.6052 3rd Qu.:-0.2877 3rd Qu.:-0.3678
## Max. : 6.4729 Max. : 7.0807 Max. :12.8239 Max. :12.4946
## OG Recov AerWon%
## Min. :-0.1901 Min. :-2.429940 Min. :-2.1958
## 1st Qu.:-0.1901 1st Qu.:-0.792977 1st Qu.:-0.5139
## Median :-0.1901 Median :-0.006436 Median : 0.1048
## Mean : 0.0000 Mean : 0.000000 Mean : 0.0000
## 3rd Qu.:-0.1901 3rd Qu.: 0.696261 3rd Qu.: 0.7285
## Max. :21.9727 Max. : 3.267491 Max. : 2.8548
The results are clear - every variable has mean equal to 0. Dataset is well prepared for further analysis.
In the next step I will be computing the covariance matrix. Having this, I could perform an eigen-decomposition. It will show how much variance each component explains. Then, the eigenvalues would tell me importance of every component.
only_stats_predicted.cov <- cov(only_stats_predicted)
only_stats_predicted.eigen <- eigen(only_stats_predicted.cov)
only_stats_predicted.eigen$values## [1] 2.085804e+01 1.835735e+01 7.257873e+00 4.019115e+00 2.694859e+00
## [6] 2.417349e+00 1.992602e+00 1.644692e+00 1.520041e+00 1.326265e+00
## [11] 1.187489e+00 1.113714e+00 1.031181e+00 1.007011e+00 9.727987e-01
## [16] 9.481628e-01 9.178450e-01 9.036441e-01 8.492774e-01 8.415536e-01
## [21] 7.920640e-01 7.692663e-01 7.486301e-01 7.263357e-01 6.713695e-01
## [26] 6.196703e-01 5.940074e-01 5.659104e-01 5.394744e-01 5.002340e-01
## [31] 4.843110e-01 4.360038e-01 4.316396e-01 4.217109e-01 4.108663e-01
## [36] 3.782525e-01 3.607542e-01 3.558669e-01 3.479115e-01 3.336735e-01
## [41] 3.219335e-01 2.993852e-01 2.849657e-01 2.687422e-01 2.591075e-01
## [46] 2.557401e-01 2.365938e-01 2.270151e-01 2.031520e-01 1.956902e-01
## [51] 1.912948e-01 1.822992e-01 1.655536e-01 1.593010e-01 1.583929e-01
## [56] 1.376131e-01 1.284020e-01 1.157858e-01 9.947336e-02 9.000246e-02
## [61] 8.909214e-02 8.000108e-02 6.987183e-02 6.882059e-02 5.423179e-02
## [66] 4.694702e-02 4.140641e-02 3.669594e-02 3.490325e-02 3.306607e-02
## [71] 3.100752e-02 2.226825e-02 1.949984e-02 1.440471e-02 1.316622e-02
## [76] 7.669267e-03 4.668464e-03 1.712842e-03 6.101561e-04 4.247715e-04
## [81] 2.120239e-04 2.184301e-05 1.912316e-05 1.091479e-05 5.230782e-06
## [86] 1.646285e-15 3.565051e-16
From eigenvalues we could see that only the first 14 components have it above 1. That is sign of not so great quality of our dataset.
We could finally perform a PCA analysis. We don’t have to center or scale them because it has been done before.
stats <- only_stats_predicted # for easier references :)
stats.pca1 <- prcomp(stats, center = FALSE, scale. = FALSE)
summary(stats.pca1)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.5671 4.2845 2.69404 2.0048 1.64160 1.55478 1.4116
## Proportion of Variance 0.2397 0.2110 0.08342 0.0462 0.03098 0.02779 0.0229
## Cumulative Proportion 0.2397 0.4507 0.53418 0.5804 0.61135 0.63913 0.6620
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.2825 1.23290 1.15164 1.08972 1.0553 1.01547 1.00350
## Proportion of Variance 0.0189 0.01747 0.01524 0.01365 0.0128 0.01185 0.01157
## Cumulative Proportion 0.6809 0.69841 0.71366 0.72731 0.7401 0.75196 0.76354
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.98631 0.9737 0.95804 0.95060 0.92156 0.91736 0.8900
## Proportion of Variance 0.01118 0.0109 0.01055 0.01039 0.00976 0.00967 0.0091
## Cumulative Proportion 0.77472 0.7856 0.79617 0.80655 0.81631 0.82599 0.8351
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.87708 0.8652 0.85225 0.81937 0.78719 0.77072 0.7523
## Proportion of Variance 0.00884 0.0086 0.00835 0.00772 0.00712 0.00683 0.0065
## Cumulative Proportion 0.84393 0.8525 0.86089 0.86860 0.87573 0.88255 0.8891
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.7345 0.70727 0.69592 0.66031 0.65699 0.64939 0.64099
## Proportion of Variance 0.0062 0.00575 0.00557 0.00501 0.00496 0.00485 0.00472
## Cumulative Proportion 0.8953 0.90101 0.90658 0.91159 0.91655 0.92140 0.92612
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.61502 0.60063 0.59655 0.5898 0.57764 0.5674 0.54716
## Proportion of Variance 0.00435 0.00415 0.00409 0.0040 0.00384 0.0037 0.00344
## Cumulative Proportion 0.93047 0.93461 0.93870 0.9427 0.94654 0.9502 0.95368
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.53382 0.51840 0.50903 0.50571 0.48641 0.47646 0.45072
## Proportion of Variance 0.00328 0.00309 0.00298 0.00294 0.00272 0.00261 0.00234
## Cumulative Proportion 0.95696 0.96004 0.96302 0.96596 0.96868 0.97129 0.97363
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.44237 0.4374 0.4270 0.4069 0.39913 0.39799 0.37096
## Proportion of Variance 0.00225 0.0022 0.0021 0.0019 0.00183 0.00182 0.00158
## Cumulative Proportion 0.97588 0.9781 0.9802 0.9821 0.98390 0.98572 0.98731
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.35833 0.34027 0.31539 0.30000 0.29848 0.28284 0.2643
## Proportion of Variance 0.00148 0.00133 0.00114 0.00103 0.00102 0.00092 0.0008
## Cumulative Proportion 0.98878 0.99011 0.99126 0.99229 0.99331 0.99423 0.9950
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.26234 0.23288 0.21667 0.20349 0.19156 0.1868 0.18184
## Proportion of Variance 0.00079 0.00062 0.00054 0.00048 0.00042 0.0004 0.00038
## Cumulative Proportion 0.99583 0.99645 0.99699 0.99747 0.99789 0.9983 0.99867
## PC71 PC72 PC73 PC74 PC75 PC76 PC77
## Standard deviation 0.17609 0.14923 0.13964 0.12002 0.11474 0.08757 0.06833
## Proportion of Variance 0.00036 0.00026 0.00022 0.00017 0.00015 0.00009 0.00005
## Cumulative Proportion 0.99903 0.99928 0.99951 0.99967 0.99982 0.99991 0.99997
## PC78 PC79 PC80 PC81 PC82 PC83
## Standard deviation 0.04139 0.02470 0.02061 0.01456 0.004674 0.004373
## Proportion of Variance 0.00002 0.00001 0.00000 0.00000 0.000000 0.000000
## Cumulative Proportion 0.99999 0.99999 1.00000 1.00000 1.000000 1.000000
## PC84 PC85 PC86 PC87
## Standard deviation 0.003304 0.002287 2.028e-15 3.247e-16
## Proportion of Variance 0.000000 0.000000 0.000e+00 0.000e+00
## Cumulative Proportion 1.000000 1.000000 1.000e+00 1.000e+00
There are the results! The first insight we can see is that the first 8 components account for 68% of the total variance. If we would like to have more than 80%, we should use 18 components. Important thing is also noticing that first 3 components describe over 50% of variance.
Next we can visualize relationships between all variables. Interpretation is quite intuitive:
Knowing theory besides that plot, we could interpret it.
On the first look we have 5 main groups of variables. The one in top right corner is more attacking oriented. We could find there goals, offsides or dribblings. Negatively correlated to them is group in bottom left corner. It contains more defensive attributes - clearances, blocked shots or long passes.
I won’t describe every group in such a detailed way. I just want to mention one, alone variable - TouDefPen. That statistics show touches in own penalty area. That is only stat in 100% characteristic for goalkeepers.
Let’s deep into quality of our components. To obtain this plot showing percentage of explained variance will be provided. It would help better understand results of our PCA analysis.
As said previously, first 3 components describe majority of variance. On the plot it is visible even harder. First two components are significantly far more important than others. Since 4th dimension, neither component are responsible for more than 5% of variance. Quite poor…
There could appear a question - which variables contribute to each component the most? Or in other words - how could we interpret each dimension?
To learn this we will analyze the first component. I won’t do it for more variables - something more interesting will be performed later ;)
Let’s show Dimension 1:
## TouAtt3rd SCA RecProg TouDef3rd DriAtt
## 0.1995550 0.1898025 0.1879124 -0.1874853 0.1765267
## PresAtt3rd CarDis TouAttPen CarMis Shots
## 0.1757626 0.1747973 0.1745943 0.1703377 0.1698483
## DriPast TouDefPen CPA PPA CarProg
## 0.1676889 -0.1620910 0.1598353 0.1563879 0.1551204
## Car3rd GCA SoT PasTotPrgDist PasLonAtt
## 0.1549210 0.1546843 0.1529966 -0.1440599 -0.1430160
## Press ScaDrib Fld GcaPassLive PasBlocks
## 0.1419094 0.1372202 0.1365483 0.1325316 0.1283341
From 25 most influential variables we could see, that this dimension is more attacking oriented. Touches in attacking 1/3 of field is the most important factor, while touches in defensive 1/3 has reversed direction. One of the most important variables are shot-creating actions, shots or received progressive passes. Overall, we could name this component “Attacking”
One way to better interpret certain components is to perform a rotation. This will allow us to get simpler structure and deep more into meaning of each dimension. We will rotate TOP 3 components because the next ones don’t explain much variance. Cut off point will be set to 0.4 to better visualize significant loadings.
## Warning in log(det(m.inv.r)): NaNs produced
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined for the null model either.
## The Chi square is thus based upon observed correlations.
## Warning in principal(stats, nfactors = 3, rotate = "varimax"): The matrix is
## not positive semi-definite, scores found from Structure loadings
##
## Loadings:
## RC1 RC2 RC3
## Goals 0.610
## Shots 0.823
## SoT 0.764
## PasTotPrgDist -0.683 0.491
## PasLonAtt -0.645 -0.428
## Assists 0.561
## PPA 0.650 0.509
## PasFK -0.511
## PasCrs 0.520
## SCA 0.826
## ScaDrib 0.654
## ScaSh 0.544
## ScaFld 0.581
## GCA 0.704
## GcaPassLive 0.585
## PresAtt3rd 0.838
## BlkSh -0.542
## Clr -0.625
## TouDef3rd -0.876
## TouAtt3rd 0.862
## TouAttPen 0.864
## DriAtt 0.788
## DriPast 0.735
## Car3rd 0.616 0.543
## CPA 0.769
## CarMis 0.826
## CarDis 0.800
## RecProg 0.917
## Fld 0.578
## Off 0.574
## Crs 0.520
## PasTotAtt 0.897
## PasShoAtt 0.759
## PasMedAtt -0.493 0.761
## Pas3rd 0.767
## PasProg 0.790
## PasAtt 0.897
## Sw 0.625
## PasGround 0.832
## Touches 0.901
## TouMid3rd 0.745 0.527
## TouLive 0.851
## Carries 0.908
## CarTotDist 0.851
## CarPrgDist 0.808
## CarProg 0.599 0.641
## Rec 0.880
## PaswHead 0.504
## Tkl 0.842
## TklDef3rd 0.691
## TklMid3rd 0.721
## TklDriAtt 0.767
## TklDriPast 0.663
## Press 0.515 0.732
## PresDef3rd 0.808
## PresMid3rd 0.449 0.667
## Blocks 0.730
## BlkPass 0.689
## Int 0.632
## Tkl+Int 0.827
## TouDefPen -0.609 -0.635
## Fls 0.625
## Recov -0.430 0.426 0.550
## AerWon% 0.521
## ShoDist 0.496
## ShoPK
## CrsPA
## TB
## PasPress 0.437 0.423
## CK
## PasHigh -0.497
## PasOff
## PasInt 0.499
## PasBlocks 0.492
## GcaDrib
## GcaSh
## GcaFld
## GcaDef
## TklAtt3rd 0.484
## BlkShSv
## Err
## DriMegs 0.470
## CrdY
## CrdR
## PKwon
## PKcon
## OG
##
## RC1 RC2 RC3
## SS loadings 20.608 14.716 11.150
## Proportion Var 0.237 0.169 0.128
## Cumulative Var 0.237 0.406 0.534
We will address each component individually.
Dimension 1
23,7% explained variance
As described during non-rotated loadings analysis - this component is
mainly offensive oriented. It’s correlated positively the most with
received progressive passes, touches in opponents’ penalty area or
presses in attacking 1/3 of the field. On the other hand, most negative
correlations come from touches in defensive 1/3 and clearances in
defend. As said before - we can give this dimension umbrella name
“Attacking”
Dimension 2
16,9% explained variance
This component is more playmaking oriented. Almost all statistics
connected with ball playing is positively correlated there. One of the
highest values have factors like number of controlled balls, total
passes attempts or received balls. For this component we can give name
“Playmaking”
Dimension 3
12,8% explained variance
We had attacking, we had playmaking - time for some defending! The last
component on podium reflects more stats characteristic for defensive
players. We could find there metrics known in protecting own penalty
area. The most positively correlated variables are number of players
tackled, presses in defensive 1/3 or blocked balls. Quite surprising
could be strong negative correlation with touches in own penalty area.
For answer we have to look broader - that statistic is the only
extremely characteristic for goalkeepers. They touch ball almost always
in their own penalty area. That’s why for every other players than them
it’ll be negatively correlated. After all consideretions we can name
this component “Defending”
By performing such analysis we could use this knowledge to better interpret further results.
Next analyzed aspect would be complexity. It shows how many variables constitute single factor. The higher the value, the more loads have values grater than zero. High complexity is not positive sign because it requires more complex interpretation of factors.
On the graph above we could observe complexity and number of represented variables for each factor. From that visualization we can obtain some insights. Factors like shots, received progressive passes or yellow cards have low complexity. That means they’re highly associated with a single components and are easier to interpret. On the other side we have stats like balls recovered, shots from distance or passes under pressure - they present highest complexity values.
We could also interpret this chart in more straight-forward way. Our “desired” quarter is bottom left - complexity and number of constituted variables is the lowest. If factor is there, we gucci. On the other hand, when both values are high we are in top right corner. Ideal example is Recov, which isn’t the greatest factor in our analysis.
Next interesting aspect - uniqueness! It is the proportion of variance that isn’t shared with other variables. We want to keep it low - what’s an easy way to reduce the space to a smaller number of components. Small value is a sign that certain variable does not carry additional information in relation to other variables.
Based on the graph, statistics with lowest uniqueness are mainly that connected with passes and touches of the ball. On the x-axis presented is proportion of variance that is not shared with other variables. As description suggests - the higher value, the worse.
In that case navigation about best and worst areas are similar as for complexity. Bottom left is still most desirable place for variable to be. On the other hand, top right quarter shows the worst statistics regarding uniqueness.
We have discussed about complexity and uniqueness separately. Let’s present those two metrics on one graph!
On this plot we could observe relation between complexity and uniqueness for each variable. Based on previous knowledge, we want both metrics to be as low as possible. Once again, bottom left quarter is place for best statistics regarding complexity and uniqueness. We could find there variables like received progressive passes, touches in enemy’s penalty area or number of controlled ball.
On the either top or right of the plot we could observe variables with extreme value for one metric. For complexity there are passed under pressure, number of ball recovered or shots from distance. Examples for high uniqueness could be own goals or red cards.
On the plot some two, red lines are visible. Smart observer could notice that in artificially created top right quarter there is no variables. It was my conscious decision. In order to find the worst statistics regarding complexity and uniqueness I’ve created a special data frame with poor statistics.
I’ve set a complexity level for 1.8 and uniqueness for 0.78 (these are also that red lines). In my previous calculations result of that action was set of 4, really poor variables. Some of them had complexity near 4 and uniqueness close to 1! That’s why I decided to remove them before computing final version.
Below code used for that purpose is presented. In current version the output is 0 rows. It shows that all present variables meet conditions of maximum complexity and uniqueness levels.
set <- data.frame(complexity = stats.pca2$complexity, uniqueness = stats.pca2$uniqueness)
set.worst <- set[set$complexity > 1.8 & set$uniqueness > 0.78,]
set.worst## [1] complexity uniqueness
## <0 rows> (or 0-length row.names)
In this report we’ve computed a PCA analysis of dataset with 87 variables. As described above, TOP 3 components are describing nearly 54% of variance. If we want to have above 80% of variance, we have to choose 18 dimensions.
High amount of visualizations was presented. Varimax rotation and loadings analysis was also computed. According to that methods best three dimensions were explored in details. They were given easy to interpret umbrella names: “Attacking”, “Playmaking” and “Defending”. Results for this analysis could be once again intuitive for someone interested in football. The reason of that are characteristics of dataset - it gathers over 2000 players from 5 best competitions. Randomness there is much smaller than in for example our Polish Ekstraklasa :)
Due to this my result may seem “too obvious” but I interpret them in other way. They confirm that my football intuition got through all the years are on the same side with plenty of independent data.
I got something extra - I’ll use results from PCA to cluster data once again. I want to highlight it in the beginning - I won’t provide very detailed descriptions of each step. I’ve done it in clustering project and now we are making it only “for the plot” ;)
First, we have to load necessary clustering packages.
library(psych)
library(NbClust)
library(ClusterR)
library(factoextra)
library(cluster)
library(flexclust)## Loading required package: grid
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:car':
##
## Predict
In our clustering analysis we will use varimax-rotated data. We would conduct this analysis in “easy way” because only TOP 18 components will be used. They respond for over 80% of variance so pretty nice number.
## Warning in log(det(m.inv.r)): NaNs produced
## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined for the null model either.
## The Chi square is thus based upon observed correlations.
## Warning in principal(stats, nfactors = 18, rotate = "varimax"): The matrix is
## not positive semi-definite, scores found from Structure loadings
Now, we will try to find the optimal number of clusters. First will be used NbClust function.
opt_clusters <- NbClust(stats_18D, distance = 'euclidean', min.nc = 2, max.nc = 10,
method = 'complete', index = 'silhouette')
opt_clusters$All.index## 2 3 4 5 6 7 8 9 10
## 0.2612 0.1580 0.2398 0.2303 0.2735 0.2521 0.2527 0.2456 0.2283
## Number_clusters Value_Index
## 6.0000 0.2735
NbClust propose for us 6 clusters. In my project about clustering that function proposed 2 clusters but I went with three due to other methods results. Let’s try them!
Decision about number of clusters is not easy. NbClust suggested 6. All three Optimal_Clusters_KMeans plots suggest different number. Silhouette shows the highest value for 3 clusters. On variance explained plot I’m looking for well-known “elbow point” ;) Drops in values are significant till 3, then they’re visibly smaller. When it comes to AIC - the lower value, the better fit. Regarding data from chart and other information number of 3 is here also optimal choice.
So I will conduct further analysis with 3 clusters.
As we could see on the plot - the areas of clusters overlap.That is not perfect sign but let’s calculate silhouette width.
## cluster size ave.sil.width
## 1 1 861 0.38
## 2 2 1280 0.36
## 3 3 209 0.68
From the silhouette we see that results are very similar to the clustering project. Once again we have 3 clusters. One of them, smallest one, have the greatest avg. width = 0.68. Result there is little bit worse - smallest cluster in Clustering Project had avg. sil. = 0.87.
Biggest cluster here have identical avg. sil. as corresponding one in Clustering Project (0.36). Biggest difference between PCA results clustering and “original” one is third cluster. In my previous project it presents avg. sil. = 0.18. Here, it’s equal to 0.38.
Thanks to this the average silhouette width increased to 0.40. In Clustering Project it was 0.34. It is only a slight increase regarding quality of clustering.
In the end let’s get quickly through clusters centers. I will interpret only first 3 factors because their detailed interpretation was conducted before. Quick review: PCA1 - Attacking, PCA2 - Playmaking, PCA3 - Defending.
## [1] "Cluster Centers:"
## RC1 RC2 RC3 RC4 RC5 RC6 RC8
## 1 18.005544 -6.348205 -2.943917 5.637515 -6.215380 6.268122 4.702641
## 2 -7.984966 7.991702 5.735153 -1.384932 4.944140 -2.802302 -1.939366
## 3 -25.272808 -22.792221 -22.996572 -14.742522 -4.674912 -8.659840 -7.495623
## RC7 RC9 RC12 RC15 RC14 RC11 RC10
## 1 3.813888 0.5261882 2.609494 2.1810058 0.9490802 -1.3135302 -0.5468197
## 2 -1.895366 0.2050389 -1.334746 -1.4194490 -0.1169470 0.7211619 0.3848417
## 3 -4.103774 -3.4234347 -2.575592 -0.2916329 -3.1936168 0.9945565 -0.1042373
## RC16 RC18 RC17 RC13
## 1 -0.2123549 -1.1728312 0.8183964 -1.0288779
## 2 0.4061842 0.7442271 -0.4704431 0.4191509
## 3 -1.6128145 0.2736698 -0.4902970 1.6715344
Cluster 1 has extremely high result in Attacking factor - we could assume that they are offensive players. Cluster 2 has negative value for PCA1 but pretty high for PCA2 and 3. We see that players from that cluster present both midfielders’ and defenders’ characteristics - we could name them midfielders & defenders. The last cluster, smallest one, have extremely low values for each of 3 factors. With knowledge from previous project I could say that these are goalkeepers. Only “valid” stats for them is touches in own penalty area. The rest stats present very low level.
To sum up - clustering performed on PCA results doesn’t differ significantly from the original version. Once again, it could be caused by “legitness” of my dataset :)