Dimension Reduction of TOP 5 European Leagues Players statistics

Cezary Kuźmowicz

“Hello there” ~ Obi-Wan Kenobi

We are meeting (I hope) once again! This project can be treated as next episode of my clustering analysis. Here, I’ll also use statistics of football players from TOP 5 European leagues

I’ll just use some copy of clustering part text. I hope you will still enjoy this! :)

Football is the most popular sport on Earth. Millions of people around the globe play it on daily basis. Most of countries have their own national leagues. But the best of the best take place in Europe. In football’s nomenclature, while sharing some graphs and statistics, often used concept is “TOP 5 European Leagues”. That means clearly five best national competitions in Europe.

This group consists of the English Premier League, Spanish La Liga, Italian Serie A, German Bundesliga and French Ligue 1. In this report I will conduct dimension reduction analysis using players statistics from season 2021/22.

In order to achieve satisfactory results many tests and visualizations will be presented. As the dimension reduction method I’ve chose PCA, mainly due to better interpretation (at least for me).

Installing neccesary packages

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(ggfortify)
library(gridExtra)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(car) # instead of maptools

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

ANALYSIS PREPARATION

Data Preparation

Loading dataset into R

My dataset comes from Kaggle. It was prepared based on data from fbref - online website gathering huge amount of informations from plenty of sports.

raw_stats <- read.csv("/Users/czarek/Downloads/2021-2022 Football Player Stats.csv", sep = ";", dec = ".",
                      check.names = FALSE)

Raw dataset includes nearly 3000 observations (individual players), each described by 143 variables! It’s important to add that every statistic is calculated “per 90 minutes”. This means that author already unified data, so one step less for us!

First step during our data preparation will be excluding footballers who played less than 180 minutes through whole season. That operation will assure us that we don’t analyze player who scored 4 goals in his only played match.

stats_180 <- raw_stats[raw_stats$Min >= 180,]

By conducting such an operation we got rid of nearly 600 observations. That will definitely improve overview of our dataset and quality of analysis.

In this case, I’ve also dropped some columns according to my knowledge. In dataset there were variables which described one certain statistic in almost same way. For example, when it comes to total passes we had: PasTotAtt (total attempts), PasTotCmp (completed attempts), PasCmp% (% of completed attempts). That’s why I decided to save only versions with total attempts (I applied it to every similar case).

to_stay <- c("Goals", "Shots", "SoT", "ShoDist", "ShoPK", "PasTotAtt", "PasTotPrgDist", "PasShoAtt",
  "PasMedAtt","PasLonAtt", "Assists", "Pas3rd", "PPA", "CrsPA", "PasProg", "PasAtt", "PasFK", "TB", 
  "PasPress","Sw", "PasCrs", "CK", "PasGround", "PasHigh", "PaswHead", "PasOff", "PasInt", "PasBlocks",
  "SCA", "ScaDrib", "ScaSh", "ScaFld", "GCA", "GcaPassLive", "GcaDrib", "GcaSh", "GcaFld", "GcaDef", 
  "Tkl", "TklDef3rd","TklMid3rd", "TklAtt3rd", "TklDriAtt", "TklDriPast","Press", "PresDef3rd","PresMid3rd",
  "PresAtt3rd", "Blocks", "BlkSh", "BlkShSv", "BlkPass", "Int", "Tkl+Int", "Clr", "Err", "Touches", 
  "TouDefPen", "TouDef3rd", "TouMid3rd", "TouAtt3rd", "TouAttPen", "TouLive","DriAtt",  "DriPast","DriMegs",
  "Carries", "CarTotDist", "CarPrgDist", "CarProg", "Car3rd", "CPA", "CarMis", "CarDis", "Rec", "RecProg", 
  "CrdY", "CrdR", "Fls", "Fld", "Off", "Crs", "PKwon", "PKcon", "OG", "Recov", "AerWon%")

only_stats <- stats_180[,to_stay]

That’s how I got rid off over 50 variables. Due to this results of dimension reduction would be more relevant.

DIMENSION REDUCTION

In the beginning I’ll standardize my data. The preProccess function finds mean and SD of every variable and then predict applies transformation.

preproc1 <- preProcess(only_stats, method = c("center", "scale"))
only_stats_predicted <- predict(preproc1, only_stats)
summary(only_stats_predicted)

##      Goals             Shots              SoT             ShoDist        
##  Min.   :-0.7031   Min.   :-1.2090   Min.   :-0.9602   Min.   :-2.16358  
##  1st Qu.:-0.7031   1st Qu.:-0.7782   1st Qu.:-0.7842   1st Qu.:-0.47746  
##  Median :-0.4136   Median :-0.2961   Median :-0.3569   Median : 0.09863  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.3392   3rd Qu.: 0.6680   3rd Qu.: 0.5417   3rd Qu.: 0.66066  
##  Max.   : 7.5779   Max.   : 4.0219   Max.   : 4.9467   Max.   : 7.84071  
##      ShoPK           PasTotAtt        PasTotPrgDist        PasShoAtt       
##  Min.   :-0.2521   Min.   :-2.07878   Min.   :-1.56473   Min.   :-2.25758  
##  1st Qu.:-0.2521   1st Qu.:-0.72395   1st Qu.:-0.76748   1st Qu.:-0.66343  
##  Median :-0.2521   Median :-0.08236   Median :-0.08466   Median :-0.07459  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:-0.2521   3rd Qu.: 0.62437   3rd Qu.: 0.61919   3rd Qu.: 0.55732  
##  Max.   :12.9450   Max.   : 3.73787   Max.   : 6.07584   Max.   : 4.72222  
##    PasMedAtt          PasLonAtt          Assists            Pas3rd       
##  Min.   :-1.71214   Min.   :-1.5624   Min.   :-0.7275   Min.   :-1.4693  
##  1st Qu.:-0.78991   1st Qu.:-0.7292   1st Qu.:-0.7275   1st Qu.:-0.7475  
##  Median :-0.07672   Median :-0.1458   Median :-0.3096   Median :-0.1512  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.60880   3rd Qu.: 0.5300   3rd Qu.: 0.4426   3rd Qu.: 0.5110  
##  Max.   : 3.93189   Max.   : 6.4677   Max.   :10.1374   Max.   : 5.6872  
##       PPA              CrsPA            PasProg             PasAtt        
##  Min.   :-1.1817   Min.   :-0.7702   Min.   :-1.70669   Min.   :-2.07878  
##  1st Qu.:-0.8194   1st Qu.:-0.7702   1st Qu.:-0.68055   1st Qu.:-0.72395  
##  Median :-0.1884   Median :-0.3812   Median :-0.07586   Median :-0.08236  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.5932   3rd Qu.: 0.4358   3rd Qu.: 0.61435   3rd Qu.: 0.62437  
##  Max.   : 4.9570   Max.   : 7.0108   Max.   : 4.58454   Max.   : 3.73787  
##      PasFK               TB             PasPress             Sw         
##  Min.   :-1.0972   Min.   :-0.6094   Min.   :-2.8370   Min.   :-1.3242  
##  1st Qu.:-0.8078   1st Qu.:-0.6094   1st Qu.:-0.6227   1st Qu.:-0.7081  
##  Median :-0.2090   Median :-0.6094   Median :-0.0394   Median :-0.2433  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5395   3rd Qu.: 0.3041   3rd Qu.: 0.6011   3rd Qu.: 0.4485  
##  Max.   : 6.8866   Max.   : 8.5258   Max.   : 3.8983   Max.   : 6.1126  
##      PasCrs              CK            PasGround          PasHigh       
##  Min.   :-0.9739   Min.   :-0.4626   Min.   :-1.9080   Min.   :-1.9102  
##  1st Qu.:-0.8063   1st Qu.:-0.4626   1st Qu.:-0.7271   1st Qu.:-0.7319  
##  Median :-0.3035   Median :-0.4626   Median :-0.1668   Median :-0.1303  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6052   3rd Qu.:-0.1895   3rd Qu.: 0.5452   3rd Qu.: 0.5098  
##  Max.   : 7.0807   Max.   : 5.7770   Max.   : 4.6603   Max.   : 6.6615  
##     PaswHead            PasOff            PasInt           PasBlocks       
##  Min.   :-1.70141   Min.   :-1.0673   Min.   :-1.97775   Min.   :-1.65784  
##  1st Qu.:-0.72788   1st Qu.:-0.7701   1st Qu.:-0.69450   1st Qu.:-0.75013  
##  Median :-0.07223   Median :-0.1758   Median :-0.06024   Median :-0.06935  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.65296   3rd Qu.: 0.4928   3rd Qu.: 0.61826   3rd Qu.: 0.67195  
##  Max.   : 4.40803   Max.   : 6.3616   Max.   : 5.39728   Max.   : 5.14999  
##       SCA              ScaDrib            ScaSh             ScaFld       
##  Min.   :-1.44614   Min.   :-0.6253   Min.   :-0.7931   Min.   :-0.7001  
##  1st Qu.:-0.82346   1st Qu.:-0.6253   1st Qu.:-0.7931   1st Qu.:-0.7001  
##  Median :-0.08756   Median :-0.6253   Median :-0.3050   Median :-0.4192  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.60588   3rd Qu.: 0.2796   3rd Qu.: 0.4273   3rd Qu.: 0.3534  
##  Max.   : 4.46528   Max.   : 8.3598   Max.   : 6.6106   Max.   : 6.3235  
##       GCA           GcaPassLive         GcaDrib            GcaSh        
##  Min.   :-0.9659   Min.   :-0.8854   Min.   :-0.3096   Min.   :-0.3962  
##  1st Qu.:-0.9659   1st Qu.:-0.8854   1st Qu.:-0.3096   1st Qu.:-0.3962  
##  Median :-0.2375   Median :-0.2332   Median :-0.3096   Median :-0.3962  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4909   3rd Qu.: 0.4841   3rd Qu.:-0.3096   3rd Qu.:-0.3962  
##  Max.   : 6.5127   Max.   : 7.5925   Max.   :12.5305   Max.   :10.4093  
##      GcaFld           GcaDef            Tkl             TklDef3rd       
##  Min.   :-0.358   Min.   :-0.214   Min.   :-1.76688   Min.   :-1.34760  
##  1st Qu.:-0.358   1st Qu.:-0.214   1st Qu.:-0.67948   1st Qu.:-0.78725  
##  Median :-0.358   Median :-0.214   Median :-0.02493   Median :-0.09106  
##  Mean   : 0.000   Mean   : 0.000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:-0.358   3rd Qu.:-0.214   3rd Qu.: 0.67185   3rd Qu.: 0.65607  
##  Max.   :11.489   Max.   :15.255   Max.   : 4.44081   Max.   : 5.64828  
##    TklMid3rd          TklAtt3rd         TklDriAtt          TklDriPast     
##  Min.   :-1.47357   Min.   :-1.1448   Min.   :-1.67002   Min.   :-1.5186  
##  1st Qu.:-0.70559   1st Qu.:-0.7929   1st Qu.:-0.69776   1st Qu.:-0.7044  
##  Median :-0.09573   Median :-0.1394   Median :-0.08728   Median :-0.1228  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.60448   3rd Qu.: 0.5644   3rd Qu.: 0.61365   3rd Qu.: 0.5838  
##  Max.   : 4.60245   Max.   : 7.7031   Max.   : 4.25396   Max.   : 5.8743  
##      Press            PresDef3rd         PresMid3rd         PresAtt3rd     
##  Min.   :-2.19371   Min.   :-1.77376   Min.   :-1.82796   Min.   :-1.2066  
##  1st Qu.:-0.62706   1st Qu.:-0.71508   1st Qu.:-0.70791   1st Qu.:-0.8578  
##  Median : 0.05739   Median : 0.04503   Median :-0.05353   Median :-0.1899  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.66580   3rd Qu.: 0.66560   3rd Qu.: 0.68344   3rd Qu.: 0.6752  
##  Max.   : 2.94732   Max.   : 3.65138   Max.   : 3.64059   Max.   : 4.3393  
##      Blocks             BlkSh            BlkShSv          BlkPass        
##  Min.   :-1.94270   Min.   :-0.9242   Min.   :-0.216   Min.   :-1.80805  
##  1st Qu.:-0.65432   1st Qu.:-0.7487   1st Qu.:-0.216   1st Qu.:-0.66381  
##  Median : 0.03304   Median :-0.3098   Median :-0.216   Median :-0.01706  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.000   Mean   : 0.00000  
##  3rd Qu.: 0.66860   3rd Qu.: 0.4510   3rd Qu.:-0.216   3rd Qu.: 0.62968  
##  Max.   : 5.26946   Max.   : 6.6249   Max.   :16.994   Max.   : 6.84840  
##       Int              Tkl+Int              Clr               Err         
##  Min.   :-1.58550   Min.   :-1.87289   Min.   :-1.0713   Min.   :-0.4005  
##  1st Qu.:-0.78056   1st Qu.:-0.69486   1st Qu.:-0.7531   1st Qu.:-0.4005  
##  Median : 0.01288   Median : 0.08436   Median :-0.3266   Median :-0.4005  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67983   3rd Qu.: 0.69179   3rd Qu.: 0.4929   3rd Qu.: 0.1773  
##  Max.   : 5.72796   Max.   : 4.56948   Max.   : 4.3453   Max.   :18.8571  
##     Touches           TouDefPen          TouDef3rd         TouMid3rd       
##  Min.   :-2.20332   Min.   :-0.68082   Min.   :-1.4029   Min.   :-2.15054  
##  1st Qu.:-0.75442   1st Qu.:-0.54146   1st Qu.:-0.8793   1st Qu.:-0.58386  
##  Median :-0.07258   Median :-0.35855   Median :-0.1817   Median :-0.04498  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.64859   3rd Qu.: 0.05672   3rd Qu.: 0.7135   3rd Qu.: 0.59584  
##  Max.   : 3.63163   Max.   : 4.89352   Max.   : 3.5001   Max.   : 3.75002  
##    TouAtt3rd          TouAttPen          TouLive             DriAtt       
##  Min.   :-1.52449   Min.   :-1.1154   Min.   :-2.31058   Min.   :-1.1240  
##  1st Qu.:-0.86183   1st Qu.:-0.7605   1st Qu.:-0.67057   1st Qu.:-0.7755  
##  Median : 0.07789   Median :-0.3706   Median :-0.05971   Median :-0.2267  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.68347   3rd Qu.: 0.6330   3rd Qu.: 0.61090   3rd Qu.: 0.5296  
##  Max.   : 3.51309   Max.   : 4.4835   Max.   : 3.70502   Max.   : 5.3643  
##     DriPast           DriMegs           Carries           CarTotDist      
##  Min.   :-1.1105   Min.   :-0.6217   Min.   :-2.06355   Min.   :-1.97513  
##  1st Qu.:-0.7604   1st Qu.:-0.6217   1st Qu.:-0.71087   1st Qu.:-0.71822  
##  Median :-0.2227   Median :-0.6217   Median :-0.09269   Median :-0.06664  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.4901   3rd Qu.: 0.3171   3rd Qu.: 0.61497   3rd Qu.: 0.59400  
##  Max.   : 6.1046   Max.   :12.4441   Max.   : 4.36473   Max.   : 4.56649  
##    CarPrgDist         CarProg            Car3rd             CPA         
##  Min.   :-1.9518   Min.   :-1.5466   Min.   :-1.3179   Min.   :-0.7464  
##  1st Qu.:-0.7410   1st Qu.:-0.6949   1st Qu.:-0.7920   1st Qu.:-0.7464  
##  Median :-0.1013   Median :-0.1237   Median :-0.1260   Median :-0.3790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5752   3rd Qu.: 0.5530   3rd Qu.: 0.5956   3rd Qu.: 0.3172  
##  Max.   : 4.8386   Max.   : 4.9038   Max.   : 4.5247   Max.   : 8.4583  
##      CarMis            CarDis             Rec             RecProg       
##  Min.   :-1.1537   Min.   :-1.2325   Min.   :-2.2469   Min.   :-1.0569  
##  1st Qu.:-0.8235   1st Qu.:-0.8262   1st Qu.:-0.6802   1st Qu.:-0.9102  
##  Median :-0.2614   Median :-0.1674   Median :-0.1384   Median :-0.3047  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6576   3rd Qu.: 0.6013   3rd Qu.: 0.5664   3rd Qu.: 0.7681  
##  Max.   : 7.0282   Max.   : 5.5974   Max.   : 4.3849   Max.   : 3.4459  
##       CrdY              CrdR              Fls                Fld         
##  Min.   :-1.1845   Min.   :-0.2937   Min.   :-1.76809   Min.   :-1.4838  
##  1st Qu.:-0.7231   1st Qu.:-0.2937   1st Qu.:-0.65810   1st Qu.:-0.7370  
##  Median :-0.1464   Median :-0.2937   Median :-0.05685   Median :-0.1324  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.4880   3rd Qu.:-0.2937   3rd Qu.: 0.58404   3rd Qu.: 0.5670  
##  Max.   : 8.0429   Max.   :14.9060   Max.   : 4.54831   Max.   : 4.4434  
##       Off               Crs              PKwon             PKcon        
##  Min.   :-0.6203   Min.   :-0.9739   Min.   :-0.2877   Min.   :-0.3678  
##  1st Qu.:-0.6203   1st Qu.:-0.8063   1st Qu.:-0.2877   1st Qu.:-0.3678  
##  Median :-0.4490   Median :-0.3035   Median :-0.2877   Median :-0.3678  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2021   3rd Qu.: 0.6052   3rd Qu.:-0.2877   3rd Qu.:-0.3678  
##  Max.   : 6.4729   Max.   : 7.0807   Max.   :12.8239   Max.   :12.4946  
##        OG              Recov              AerWon%       
##  Min.   :-0.1901   Min.   :-2.429940   Min.   :-2.1958  
##  1st Qu.:-0.1901   1st Qu.:-0.792977   1st Qu.:-0.5139  
##  Median :-0.1901   Median :-0.006436   Median : 0.1048  
##  Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000  
##  3rd Qu.:-0.1901   3rd Qu.: 0.696261   3rd Qu.: 0.7285  
##  Max.   :21.9727   Max.   : 3.267491   Max.   : 2.8548

The results are clear - every variable has mean equal to 0. Dataset is well prepared for further analysis.

Covariance and eigenvalues

In the next step I will be computing the covariance matrix. Having this, I could perform an eigen-decomposition. It will show how much variance each component explains. Then, the eigenvalues would tell me importance of every component.

only_stats_predicted.cov <- cov(only_stats_predicted)
only_stats_predicted.eigen <- eigen(only_stats_predicted.cov)
only_stats_predicted.eigen$values

##  [1] 2.085804e+01 1.835735e+01 7.257873e+00 4.019115e+00 2.694859e+00
##  [6] 2.417349e+00 1.992602e+00 1.644692e+00 1.520041e+00 1.326265e+00
## [11] 1.187489e+00 1.113714e+00 1.031181e+00 1.007011e+00 9.727987e-01
## [16] 9.481628e-01 9.178450e-01 9.036441e-01 8.492774e-01 8.415536e-01
## [21] 7.920640e-01 7.692663e-01 7.486301e-01 7.263357e-01 6.713695e-01
## [26] 6.196703e-01 5.940074e-01 5.659104e-01 5.394744e-01 5.002340e-01
## [31] 4.843110e-01 4.360038e-01 4.316396e-01 4.217109e-01 4.108663e-01
## [36] 3.782525e-01 3.607542e-01 3.558669e-01 3.479115e-01 3.336735e-01
## [41] 3.219335e-01 2.993852e-01 2.849657e-01 2.687422e-01 2.591075e-01
## [46] 2.557401e-01 2.365938e-01 2.270151e-01 2.031520e-01 1.956902e-01
## [51] 1.912948e-01 1.822992e-01 1.655536e-01 1.593010e-01 1.583929e-01
## [56] 1.376131e-01 1.284020e-01 1.157858e-01 9.947336e-02 9.000246e-02
## [61] 8.909214e-02 8.000108e-02 6.987183e-02 6.882059e-02 5.423179e-02
## [66] 4.694702e-02 4.140641e-02 3.669594e-02 3.490325e-02 3.306607e-02
## [71] 3.100752e-02 2.226825e-02 1.949984e-02 1.440471e-02 1.316622e-02
## [76] 7.669267e-03 4.668464e-03 1.712842e-03 6.101561e-04 4.247715e-04
## [81] 2.120239e-04 2.184301e-05 1.912316e-05 1.091479e-05 5.230782e-06
## [86] 1.646285e-15 3.565051e-16

From eigenvalues we could see that only the first 14 components have it above 1. That is sign of not so great quality of our dataset.

PCA Analysis

We could finally perform a PCA analysis. We don’t have to center or scale them because it has been done before.

stats <- only_stats_predicted # for easier references :)
stats.pca1 <- prcomp(stats, center = FALSE, scale. = FALSE)
summary(stats.pca1)

## Importance of components:
##                           PC1    PC2     PC3    PC4     PC5     PC6    PC7
## Standard deviation     4.5671 4.2845 2.69404 2.0048 1.64160 1.55478 1.4116
## Proportion of Variance 0.2397 0.2110 0.08342 0.0462 0.03098 0.02779 0.0229
## Cumulative Proportion  0.2397 0.4507 0.53418 0.5804 0.61135 0.63913 0.6620
##                           PC8     PC9    PC10    PC11   PC12    PC13    PC14
## Standard deviation     1.2825 1.23290 1.15164 1.08972 1.0553 1.01547 1.00350
## Proportion of Variance 0.0189 0.01747 0.01524 0.01365 0.0128 0.01185 0.01157
## Cumulative Proportion  0.6809 0.69841 0.71366 0.72731 0.7401 0.75196 0.76354
##                           PC15   PC16    PC17    PC18    PC19    PC20   PC21
## Standard deviation     0.98631 0.9737 0.95804 0.95060 0.92156 0.91736 0.8900
## Proportion of Variance 0.01118 0.0109 0.01055 0.01039 0.00976 0.00967 0.0091
## Cumulative Proportion  0.77472 0.7856 0.79617 0.80655 0.81631 0.82599 0.8351
##                           PC22   PC23    PC24    PC25    PC26    PC27   PC28
## Standard deviation     0.87708 0.8652 0.85225 0.81937 0.78719 0.77072 0.7523
## Proportion of Variance 0.00884 0.0086 0.00835 0.00772 0.00712 0.00683 0.0065
## Cumulative Proportion  0.84393 0.8525 0.86089 0.86860 0.87573 0.88255 0.8891
##                          PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.7345 0.70727 0.69592 0.66031 0.65699 0.64939 0.64099
## Proportion of Variance 0.0062 0.00575 0.00557 0.00501 0.00496 0.00485 0.00472
## Cumulative Proportion  0.8953 0.90101 0.90658 0.91159 0.91655 0.92140 0.92612
##                           PC36    PC37    PC38   PC39    PC40   PC41    PC42
## Standard deviation     0.61502 0.60063 0.59655 0.5898 0.57764 0.5674 0.54716
## Proportion of Variance 0.00435 0.00415 0.00409 0.0040 0.00384 0.0037 0.00344
## Cumulative Proportion  0.93047 0.93461 0.93870 0.9427 0.94654 0.9502 0.95368
##                           PC43    PC44    PC45    PC46    PC47    PC48    PC49
## Standard deviation     0.53382 0.51840 0.50903 0.50571 0.48641 0.47646 0.45072
## Proportion of Variance 0.00328 0.00309 0.00298 0.00294 0.00272 0.00261 0.00234
## Cumulative Proportion  0.95696 0.96004 0.96302 0.96596 0.96868 0.97129 0.97363
##                           PC50   PC51   PC52   PC53    PC54    PC55    PC56
## Standard deviation     0.44237 0.4374 0.4270 0.4069 0.39913 0.39799 0.37096
## Proportion of Variance 0.00225 0.0022 0.0021 0.0019 0.00183 0.00182 0.00158
## Cumulative Proportion  0.97588 0.9781 0.9802 0.9821 0.98390 0.98572 0.98731
##                           PC57    PC58    PC59    PC60    PC61    PC62   PC63
## Standard deviation     0.35833 0.34027 0.31539 0.30000 0.29848 0.28284 0.2643
## Proportion of Variance 0.00148 0.00133 0.00114 0.00103 0.00102 0.00092 0.0008
## Cumulative Proportion  0.98878 0.99011 0.99126 0.99229 0.99331 0.99423 0.9950
##                           PC64    PC65    PC66    PC67    PC68   PC69    PC70
## Standard deviation     0.26234 0.23288 0.21667 0.20349 0.19156 0.1868 0.18184
## Proportion of Variance 0.00079 0.00062 0.00054 0.00048 0.00042 0.0004 0.00038
## Cumulative Proportion  0.99583 0.99645 0.99699 0.99747 0.99789 0.9983 0.99867
##                           PC71    PC72    PC73    PC74    PC75    PC76    PC77
## Standard deviation     0.17609 0.14923 0.13964 0.12002 0.11474 0.08757 0.06833
## Proportion of Variance 0.00036 0.00026 0.00022 0.00017 0.00015 0.00009 0.00005
## Cumulative Proportion  0.99903 0.99928 0.99951 0.99967 0.99982 0.99991 0.99997
##                           PC78    PC79    PC80    PC81     PC82     PC83
## Standard deviation     0.04139 0.02470 0.02061 0.01456 0.004674 0.004373
## Proportion of Variance 0.00002 0.00001 0.00000 0.00000 0.000000 0.000000
## Cumulative Proportion  0.99999 0.99999 1.00000 1.00000 1.000000 1.000000
##                            PC84     PC85      PC86      PC87
## Standard deviation     0.003304 0.002287 2.028e-15 3.247e-16
## Proportion of Variance 0.000000 0.000000 0.000e+00 0.000e+00
## Cumulative Proportion  1.000000 1.000000 1.000e+00 1.000e+00

There are the results! The first insight we can see is that the first 8 components account for 68% of the total variance. If we would like to have more than 80%, we should use 18 components. Important thing is also noticing that first 3 components describe over 50% of variance.

Visualisations

Next we can visualize relationships between all variables. Interpretation is quite intuitive:

If variables are grouped together, they are positively correlated
If variables are on opposite sides of the plot, they are negatively correlated

Knowing theory besides that plot, we could interpret it.

On the first look we have 5 main groups of variables. The one in top right corner is more attacking oriented. We could find there goals, offsides or dribblings. Negatively correlated to them is group in bottom left corner. It contains more defensive attributes - clearances, blocked shots or long passes.

I won’t describe every group in such a detailed way. I just want to mention one, alone variable - TouDefPen. That statistics show touches in own penalty area. That is only stat in 100% characteristic for goalkeepers.

Let’s deep into quality of our components. To obtain this plot showing percentage of explained variance will be provided. It would help better understand results of our PCA analysis.

As said previously, first 3 components describe majority of variance. On the plot it is visible even harder. First two components are significantly far more important than others. Since 4th dimension, neither component are responsible for more than 5% of variance. Quite poor…

There could appear a question - which variables contribute to each component the most? Or in other words - how could we interpret each dimension?

To learn this we will analyze the first component. I won’t do it for more variables - something more interesting will be performed later ;)

Let’s show Dimension 1:

##     TouAtt3rd           SCA       RecProg     TouDef3rd        DriAtt 
##     0.1995550     0.1898025     0.1879124    -0.1874853     0.1765267 
##    PresAtt3rd        CarDis     TouAttPen        CarMis         Shots 
##     0.1757626     0.1747973     0.1745943     0.1703377     0.1698483 
##       DriPast     TouDefPen           CPA           PPA       CarProg 
##     0.1676889    -0.1620910     0.1598353     0.1563879     0.1551204 
##        Car3rd           GCA           SoT PasTotPrgDist     PasLonAtt 
##     0.1549210     0.1546843     0.1529966    -0.1440599    -0.1430160 
##         Press       ScaDrib           Fld   GcaPassLive     PasBlocks 
##     0.1419094     0.1372202     0.1365483     0.1325316     0.1283341

From 25 most influential variables we could see, that this dimension is more attacking oriented. Touches in attacking 1/3 of field is the most important factor, while touches in defensive 1/3 has reversed direction. One of the most important variables are shot-creating actions, shots or received progressive passes. Overall, we could name this component “Attacking”

Rotation

One way to better interpret certain components is to perform a rotation. This will allow us to get simpler structure and deep more into meaning of each dimension. We will rotate TOP 3 components because the next ones don’t explain much variance. Cut off point will be set to 0.4 to better visualize significant loadings.

## Warning in log(det(m.inv.r)): NaNs produced

## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined for the null model either.
## The Chi square is thus based upon observed correlations.

## Warning in principal(stats, nfactors = 3, rotate = "varimax"): The matrix is
## not positive semi-definite, scores found from Structure loadings

## 
## Loadings:
##               RC1    RC2    RC3   
## Goals          0.610              
## Shots          0.823              
## SoT            0.764              
## PasTotPrgDist -0.683  0.491       
## PasLonAtt     -0.645        -0.428
## Assists        0.561              
## PPA            0.650  0.509       
## PasFK         -0.511              
## PasCrs         0.520              
## SCA            0.826              
## ScaDrib        0.654              
## ScaSh          0.544              
## ScaFld         0.581              
## GCA            0.704              
## GcaPassLive    0.585              
## PresAtt3rd     0.838              
## BlkSh         -0.542              
## Clr           -0.625              
## TouDef3rd     -0.876              
## TouAtt3rd      0.862              
## TouAttPen      0.864              
## DriAtt         0.788              
## DriPast        0.735              
## Car3rd         0.616  0.543       
## CPA            0.769              
## CarMis         0.826              
## CarDis         0.800              
## RecProg        0.917              
## Fld            0.578              
## Off            0.574              
## Crs            0.520              
## PasTotAtt             0.897       
## PasShoAtt             0.759       
## PasMedAtt     -0.493  0.761       
## Pas3rd                0.767       
## PasProg               0.790       
## PasAtt                0.897       
## Sw                    0.625       
## PasGround             0.832       
## Touches               0.901       
## TouMid3rd             0.745  0.527
## TouLive               0.851       
## Carries               0.908       
## CarTotDist            0.851       
## CarPrgDist            0.808       
## CarProg        0.599  0.641       
## Rec                   0.880       
## PaswHead                     0.504
## Tkl                          0.842
## TklDef3rd                    0.691
## TklMid3rd                    0.721
## TklDriAtt                    0.767
## TklDriPast                   0.663
## Press          0.515         0.732
## PresDef3rd                   0.808
## PresMid3rd     0.449         0.667
## Blocks                       0.730
## BlkPass                      0.689
## Int                          0.632
## Tkl+Int                      0.827
## TouDefPen     -0.609        -0.635
## Fls                          0.625
## Recov         -0.430  0.426  0.550
## AerWon%                      0.521
## ShoDist                      0.496
## ShoPK                             
## CrsPA                             
## TB                                
## PasPress              0.437  0.423
## CK                                
## PasHigh       -0.497              
## PasOff                            
## PasInt                0.499       
## PasBlocks      0.492              
## GcaDrib                           
## GcaSh                             
## GcaFld                            
## GcaDef                            
## TklAtt3rd      0.484              
## BlkShSv                           
## Err                               
## DriMegs        0.470              
## CrdY                              
## CrdR                              
## PKwon                             
## PKcon                             
## OG                                
## 
##                   RC1    RC2    RC3
## SS loadings    20.608 14.716 11.150
## Proportion Var  0.237  0.169  0.128
## Cumulative Var  0.237  0.406  0.534

We will address each component individually.

Dimension 1
23,7% explained variance
As described during non-rotated loadings analysis - this component is mainly offensive oriented. It’s correlated positively the most with received progressive passes, touches in opponents’ penalty area or presses in attacking 1/3 of the field. On the other hand, most negative correlations come from touches in defensive 1/3 and clearances in defend. As said before - we can give this dimension umbrella name “Attacking”

Dimension 2
16,9% explained variance
This component is more playmaking oriented. Almost all statistics connected with ball playing is positively correlated there. One of the highest values have factors like number of controlled balls, total passes attempts or received balls. For this component we can give name “Playmaking”

Dimension 3
12,8% explained variance
We had attacking, we had playmaking - time for some defending! The last component on podium reflects more stats characteristic for defensive players. We could find there metrics known in protecting own penalty area. The most positively correlated variables are number of players tackled, presses in defensive 1/3 or blocked balls. Quite surprising could be strong negative correlation with touches in own penalty area. For answer we have to look broader - that statistic is the only extremely characteristic for goalkeepers. They touch ball almost always in their own penalty area. That’s why for every other players than them it’ll be negatively correlated. After all consideretions we can name this component “Defending”

By performing such analysis we could use this knowledge to better interpret further results.

Complexity

Next analyzed aspect would be complexity. It shows how many variables constitute single factor. The higher the value, the more loads have values grater than zero. High complexity is not positive sign because it requires more complex interpretation of factors.

On the graph above we could observe complexity and number of represented variables for each factor. From that visualization we can obtain some insights. Factors like shots, received progressive passes or yellow cards have low complexity. That means they’re highly associated with a single components and are easier to interpret. On the other side we have stats like balls recovered, shots from distance or passes under pressure - they present highest complexity values.

We could also interpret this chart in more straight-forward way. Our “desired” quarter is bottom left - complexity and number of constituted variables is the lowest. If factor is there, we gucci. On the other hand, when both values are high we are in top right corner. Ideal example is Recov, which isn’t the greatest factor in our analysis.

Uniqueness

Next interesting aspect - uniqueness! It is the proportion of variance that isn’t shared with other variables. We want to keep it low - what’s an easy way to reduce the space to a smaller number of components. Small value is a sign that certain variable does not carry additional information in relation to other variables.

Based on the graph, statistics with lowest uniqueness are mainly that connected with passes and touches of the ball. On the x-axis presented is proportion of variance that is not shared with other variables. As description suggests - the higher value, the worse.

In that case navigation about best and worst areas are similar as for complexity. Bottom left is still most desirable place for variable to be. On the other hand, top right quarter shows the worst statistics regarding uniqueness.

Why not both?

We have discussed about complexity and uniqueness separately. Let’s present those two metrics on one graph!

On this plot we could observe relation between complexity and uniqueness for each variable. Based on previous knowledge, we want both metrics to be as low as possible. Once again, bottom left quarter is place for best statistics regarding complexity and uniqueness. We could find there variables like received progressive passes, touches in enemy’s penalty area or number of controlled ball.

On the either top or right of the plot we could observe variables with extreme value for one metric. For complexity there are passed under pressure, number of ball recovered or shots from distance. Examples for high uniqueness could be own goals or red cards.

On the plot some two, red lines are visible. Smart observer could notice that in artificially created top right quarter there is no variables. It was my conscious decision. In order to find the worst statistics regarding complexity and uniqueness I’ve created a special data frame with poor statistics.

I’ve set a complexity level for 1.8 and uniqueness for 0.78 (these are also that red lines). In my previous calculations result of that action was set of 4, really poor variables. Some of them had complexity near 4 and uniqueness close to 1! That’s why I decided to remove them before computing final version.

Below code used for that purpose is presented. In current version the output is 0 rows. It shows that all present variables meet conditions of maximum complexity and uniqueness levels.

set <- data.frame(complexity = stats.pca2$complexity, uniqueness = stats.pca2$uniqueness)
set.worst <- set[set$complexity > 1.8 & set$uniqueness > 0.78,]
set.worst

## [1] complexity uniqueness
## <0 rows> (or 0-length row.names)

Summary of Dimension Reduction

In this report we’ve computed a PCA analysis of dataset with 87 variables. As described above, TOP 3 components are describing nearly 54% of variance. If we want to have above 80% of variance, we have to choose 18 dimensions.

High amount of visualizations was presented. Varimax rotation and loadings analysis was also computed. According to that methods best three dimensions were explored in details. They were given easy to interpret umbrella names: “Attacking”, “Playmaking” and “Defending”. Results for this analysis could be once again intuitive for someone interested in football. The reason of that are characteristics of dataset - it gathers over 2000 players from 5 best competitions. Randomness there is much smaller than in for example our Polish Ekstraklasa :)

Due to this my result may seem “too obvious” but I interpret them in other way. They confirm that my football intuition got through all the years are on the same side with plenty of independent data.

Clustering using PCA results (extra part)

I got something extra - I’ll use results from PCA to cluster data once again. I want to highlight it in the beginning - I won’t provide very detailed descriptions of each step. I’ve done it in clustering project and now we are making it only “for the plot” ;)

First, we have to load necessary clustering packages.

library(psych)       
library(NbClust)     
library(ClusterR)    
library(factoextra)
library(cluster)  
library(flexclust)

## Loading required package: grid

## Loading required package: modeltools

## Loading required package: stats4

## 
## Attaching package: 'modeltools'

## The following object is masked from 'package:car':
## 
##     Predict

In our clustering analysis we will use varimax-rotated data. We would conduct this analysis in “easy way” because only TOP 18 components will be used. They respond for over 80% of variance so pretty nice number.

stats.pca3<-principal(stats, nfactors=18, rotate="varimax")

## Warning in log(det(m.inv.r)): NaNs produced

## The determinant of the smoothed correlation was zero.
## This means the objective function is not defined for the null model either.
## The Chi square is thus based upon observed correlations.

## Warning in principal(stats, nfactors = 18, rotate = "varimax"): The matrix is
## not positive semi-definite, scores found from Structure loadings

stats_18D <- data.frame(stats.pca3$scores)

Now, we will try to find the optimal number of clusters. First will be used NbClust function.

opt_clusters <- NbClust(stats_18D, distance = 'euclidean', min.nc = 2, max.nc = 10, 
                        method = 'complete', index = 'silhouette')
opt_clusters$All.index

##      2      3      4      5      6      7      8      9     10 
## 0.2612 0.1580 0.2398 0.2303 0.2735 0.2521 0.2527 0.2456 0.2283

opt_clusters$Best.nc

## Number_clusters     Value_Index 
##          6.0000          0.2735

NbClust propose for us 6 clusters. In my project about clustering that function proposed 2 clusters but I went with three due to other methods results. Let’s try them!

Decision about number of clusters is not easy. NbClust suggested 6. All three Optimal_Clusters_KMeans plots suggest different number. Silhouette shows the highest value for 3 clusters. On variance explained plot I’m looking for well-known “elbow point” ;) Drops in values are significant till 3, then they’re visibly smaller. When it comes to AIC - the lower value, the better fit. Regarding data from chart and other information number of 3 is here also optimal choice.

So I will conduct further analysis with 3 clusters.

As we could see on the plot - the areas of clusters overlap.That is not perfect sign but let’s calculate silhouette width.

##   cluster size ave.sil.width
## 1       1  861          0.38
## 2       2 1280          0.36
## 3       3  209          0.68

From the silhouette we see that results are very similar to the clustering project. Once again we have 3 clusters. One of them, smallest one, have the greatest avg. width = 0.68. Result there is little bit worse - smallest cluster in Clustering Project had avg. sil. = 0.87.

Biggest cluster here have identical avg. sil. as corresponding one in Clustering Project (0.36). Biggest difference between PCA results clustering and “original” one is third cluster. In my previous project it presents avg. sil. = 0.18. Here, it’s equal to 0.38.

Thanks to this the average silhouette width increased to 0.40. In Clustering Project it was 0.34. It is only a slight increase regarding quality of clustering.

In the end let’s get quickly through clusters centers. I will interpret only first 3 factors because their detailed interpretation was conducted before. Quick review: PCA1 - Attacking, PCA2 - Playmaking, PCA3 - Defending.

## [1] "Cluster Centers:"

##          RC1        RC2        RC3        RC4       RC5       RC6       RC8
## 1  18.005544  -6.348205  -2.943917   5.637515 -6.215380  6.268122  4.702641
## 2  -7.984966   7.991702   5.735153  -1.384932  4.944140 -2.802302 -1.939366
## 3 -25.272808 -22.792221 -22.996572 -14.742522 -4.674912 -8.659840 -7.495623
##         RC7        RC9      RC12       RC15       RC14       RC11       RC10
## 1  3.813888  0.5261882  2.609494  2.1810058  0.9490802 -1.3135302 -0.5468197
## 2 -1.895366  0.2050389 -1.334746 -1.4194490 -0.1169470  0.7211619  0.3848417
## 3 -4.103774 -3.4234347 -2.575592 -0.2916329 -3.1936168  0.9945565 -0.1042373
##         RC16       RC18       RC17       RC13
## 1 -0.2123549 -1.1728312  0.8183964 -1.0288779
## 2  0.4061842  0.7442271 -0.4704431  0.4191509
## 3 -1.6128145  0.2736698 -0.4902970  1.6715344

Cluster 1 has extremely high result in Attacking factor - we could assume that they are offensive players. Cluster 2 has negative value for PCA1 but pretty high for PCA2 and 3. We see that players from that cluster present both midfielders’ and defenders’ characteristics - we could name them midfielders & defenders. The last cluster, smallest one, have extremely low values for each of 3 factors. With knowledge from previous project I could say that these are goalkeepers. Only “valid” stats for them is touches in own penalty area. The rest stats present very low level.

To sum up - clustering performed on PCA results doesn’t differ significantly from the original version. Once again, it could be caused by “legitness” of my dataset :)