Introduction

We often use to perceive the quality of a goalkeeper simply by how good he is at saving shots or by spectacular saves which require amazing reflexes. In modern football, however, there are other factors that determine how good a goalkeeper is. More and more often the man between the bars is being required to act as an 11th outfield player and to help defenders play the ball up the pitch from under the opposite team's pressure. Joe Hart, who at the time was a no. 1 England's goalkeeper was axed by Pep Guardiola from Manchester City's first team because he couldn't pass the ball effectively. The Man City's legend was then replaced by a goalkeeper who felt more comfortable with the ball at his feet. On the other hand, a team defending with a deep block needs a goalkeeper who is proficient at saving shots from distance, as well as being dominant in the air to intercept inevitable crosses into the box. A goalkeeper therefore is usually a part of a system deployed by the club's manager and a club is looking on the transfermarket for a player that will suit the team's tactics and style. Clubs using statistics for determining potential transfer targets may want to use clustering or other algorithms to find the right player. Nowadays, however, there is a large amount of statistics available and we may not always know which ones distinguish players the best. I will show how to use Principal component analysis to reduce the dimension of a dataset and leave only the most important variables to explain variation among observations.

Preparing the data

Our dataset comes from fbref.com site (link). Please note that all the metrics are converted to a per 90 minutes format. If you want to download it please check "Toggle per90 Stats" and then download the csv file. In the data we can find various statistics of all the goalkeepers from the 5 big European leagues.

Advanced statistics such as xG (expected goals) and xA (expected assists) are supplied by StatsBomb. More information what is xG and how it is calculated can be found under this link.

For the data manipulation I will use data.table package

if(!require(data.table)){
    install.packages("data.table")
    library(data.table)
}

Load the data.

data_90 <- as.data.table(read.csv("goalies_90.csv", encoding = "UTF-8"))
head(data_90)
##    X.U.FEFF.Rk                                     Player Nation Pos     Squad
## 1:           1                             AdriĂĄn\\Adrian es ESP  GK Liverpool
## 2:           2 RĂșnar Alex RĂșnarsson\\Runar-Alex-Runarsson is ISL  GK   Arsenal
## 3:           3                           Alisson\\Alisson br BRA  GK Liverpool
## 4:           4         Saturnin Allagbé\\Saturnin-Allagbe bj BEN  GK     Dijon
## 5:           5           NĂ­colas Andrade\\Nicolas-Andrade br BRA  GK   Udinese
## 6:           6           Alphonse Areola\\Alphonse-Areola fr FRA  GK    Fulham
##                  Comp    Age Born X90s   GA  PKA   FK   CK OG PSxG PSxG.SoT
## 1: eng Premier League 34-050 1987  2.0 4.50 0.00 0.00 1.00  0 3.55     0.45
## 2: eng Premier League 26-004 1995  0.2 0.00 0.00 0.00 0.00  0 0.50     0.04
## 3: eng Premier League 28-143 1992 21.0 1.14 0.19 0.05 0.14  0 1.21     0.34
## 4:         fr Ligue 1 27-092 1993  4.0 2.00 0.00 0.25 0.00  0 1.52     0.30
## 5:         it Serie A 32-316 1988  2.0 2.50 0.00 0.00 0.00  0 2.00     0.58
## 6: eng Premier League 27-361 1993 24.0 1.21 0.21 0.00 0.08  0 1.47     0.29
##    PSxG...  X.90  Cmp   Att Cmp. Att.1   Thr Launch. AvgLen Att.2 Launch..1
## 1:   -0.95 -0.93 4.50 12.50 36.0  24.0  7.00    37.5   36.0  7.00      50.0
## 2:    0.50  0.50 0.00 15.00  0.0  50.0 10.00    30.0   31.8  5.00       0.0
## 3:    0.07  0.07 3.62  7.81 46.3  28.6  4.67    21.8   30.4  4.90      32.0
## 4:   -0.47 -0.48 5.00 13.30 37.7  25.8  5.00    40.8   35.1  6.75      40.7
## 5:   -0.50 -0.48 3.00  6.50 46.2  15.5  2.50    25.8   32.1  4.50      55.6
## 6:    0.26  0.26 4.79 12.30 39.1  19.5  4.25    40.6   35.1  7.04      61.5
##    AvgLen.1   Opp  Stp Stp. X.OPA X.OPA.90 AvgDist Matches
## 1:     38.7  9.00 1.00 11.1  0.00     0.00    6.15 Matches
## 2:     30.0 20.00 0.00  0.0  0.00     0.00   87.50 Matches
## 3:     34.4  5.52 0.43  7.8  1.33     1.33    0.88 Matches
## 4:     34.9  8.75 0.00  0.0  1.25     1.25    4.10 Matches
## 5:     47.4  8.00 0.00  0.0  0.00     0.00    7.60 Matches
## 6:     45.0 10.00 1.21 12.1  0.63     0.63    0.60 Matches

Select only players who played more than 4 games, select only complete rows and finally select and rename the appropriate columns. I will also delete the first column because it is not needed. After filtering players with at least 5 games there is no missing data, so there is no need to handle missing values. I will also fix players' names, because there are two variants of the name in a single row. There are also 2 rows describing the same player, so I will only keep rows with a distinct player name.

setDT(data_90)
data_90 <- data_90[X90s >= 5, -1]
sum(is.na(data_90))
## [1] 0
data_90[, Player:= sub("\\\\.*", "", Player)]
library(tidyverse)
data_90 <- distinct(data_90, Player, .keep_all = T)

After inspecting all the variables I will also delete the ones that are his personal details, do not depend on the goalkeeper or are duplicates of another columns. That is:
- Columns 2:6 - information about a player.
- Column 7 - Number of full games played by a player - not dependent on him.
- Columns 8:13 - Statistics such as goals conceded or penalty kicks allowed mainly depend on a whole team.
- Columns 14 - Post shot expected goals - this is a statistic depicting the total quality of the shots a goalkeepers was facing. Doesn't depend on the goalkeepers at all.
- Column 15 - Post shot expected goals per shot on target. The same statistic as the previous one but divided by number of shots. Not needed either.
- Column 16 - Post shot expected goals minus goals allowed. This is an import variable because it reflects a true quality of shot stopping abilities. It compares the amount of goals an average goalkeeper would have conceded from the shots he faced with the actual number of goals conceded. However it is an absolute value and will be weighted by the number of minutes played so I will only leave the next column which is the same statistic but per 90 minutes.
- Column 25 - Goal kicks attempted. It is also irrespective of the goalkeepers.
- Column 28 - Opponent's attempted crosses into penalty area. Same as in the previous one.
- Column 32 - The same information as in column 31.

We are left with following columns (per 90 minutes if possible):
- Player name,
- Post shot expected goals minus goals allowed per 90 minutes (PSxG90),
- Long passes (longer than 40 yards) completed (long_cmpl),
- Long passes attempted (long_att),
- Long passes completion percentage (long_cmpl_perc),
- Passes attempted (passes_att),
- Throws attempted (throws_att),
- Percentage of passes that were launched (long passes) (launch_perc),
- Pass average length (pass_avg_length),
- Percentage of goal kicks that were launched (goal_kick_launch_perc),
- Average length of a goal kick (goal_kick_avg_length),
- Number of opponent's crosses into penalty area that were successfully stopped by goalkeeper (crosses_stp),
- Percentage of opponent's crosses into penalty area that were successfully stopped by goalkeeper (crosses_stp_perc),
- Number of defensive actions outside of the penalty area (out_penalty_area_actions90),
- Average distance from goal of all defensive actions (def_action_distance).

data_90 <- data_90[, c(1, 17:24, 26:27, 29:31, 33)]
setnames(data_90, c("Player", "PSxG90", "long_cmpl", "long_att", "long_cmpl_perc", "passes_att", "throws_att", "launch_perc", "pass_avg_length", "goal_kick_launch_perc", "goal_kick_avg_length", "crosses_stp", "crosses_stp_perc", "out_penalty_area_actions90", "def_action_distance"))
head(data_90)
##             Player PSxG90 long_cmpl long_att long_cmpl_perc passes_att
## 1:         Alisson   0.07      3.62     7.81           46.3       28.6
## 2: Alphonse Areola   0.26      4.79    12.30           39.1       19.5
## 3:   Sergio Asenjo  -0.20      5.50    11.70           47.1       26.8
## 4:     Emil Audero  -0.03      6.70    16.10           41.5       24.8
## 5:     Édgar Badía   0.06      8.18    16.30           50.1       30.3
## 6:  Oliver Baumann  -0.08      6.14    14.50           42.4       32.1
##    throws_att launch_perc pass_avg_length goal_kick_launch_perc
## 1:       4.67        21.8            30.4                  32.0
## 2:       4.25        40.6            35.1                  61.5
## 3:       5.46        31.5            34.2                  47.5
## 4:       5.26        46.6            38.8                  62.9
## 5:       4.95        36.0            34.1                  57.8
## 6:       5.24        36.8            34.3                  39.2
##    goal_kick_avg_length crosses_stp crosses_stp_perc out_penalty_area_actions90
## 1:                 34.4        0.43              7.8                       1.33
## 2:                 45.0        1.21             12.1                       0.63
## 3:                 41.6        1.25             11.5                       0.42
## 4:                 45.7        0.78              7.8                       0.43
## 5:                 39.3        0.68              6.5                       0.73
## 6:                 35.6        0.29              3.3                       1.10
##    def_action_distance
## 1:                0.88
## 2:                0.60
## 3:                0.55
## 4:                0.54
## 5:                0.60
## 6:                0.74

I will also standardize the data. Standardization is required because some of the variables are in different units (for example long passes completion rate is in percentages and long passes completed is an absolute value) and we may give larger importance to variables with larger magnitude. To avoid this such a transformation is made that changes the mean of all variables to 0 with a unit variance. At the same time I will create another data frame without player names (with only numerical variables) for clustering

data_90_cl <- as.data.frame(lapply(data_90[,2:15], scale))
data_90[,2:15] <- as.data.frame(lapply(data_90[,2:15], scale))

Summary of the data after tidying.

summary(data_90_cl)
##      PSxG90           long_cmpl          long_att        long_cmpl_perc    
##  Min.   :-2.80317   Min.   :-2.3851   Min.   :-2.42516   Min.   :-1.93200  
##  1st Qu.:-0.53338   1st Qu.:-0.6365   1st Qu.:-0.68364   1st Qu.:-0.64451  
##  Median :-0.01379   Median :-0.1342   Median :-0.05125   Median :-0.06549  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.66988   3rd Qu.: 0.5773   3rd Qu.: 0.53250   3rd Qu.: 0.65659  
##  Max.   : 2.17396   Max.   : 3.7072   Max.   : 2.84805   Max.   : 2.99995  
##    passes_att         throws_att         launch_perc      pass_avg_length  
##  Min.   :-2.08351   Min.   :-3.602374   Min.   :-2.1775   Min.   :-2.0093  
##  1st Qu.:-0.66658   1st Qu.:-0.536087   1st Qu.:-0.6554   1st Qu.:-0.7683  
##  Median :-0.08415   Median : 0.001777   Median :-0.1096   Median :-0.1079  
##  Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.43742   3rd Qu.: 0.686750   3rd Qu.: 0.4769   3rd Qu.: 0.5452  
##  Max.   : 4.34921   Max.   : 2.907164   Max.   : 2.8261   Max.   : 2.8239  
##  goal_kick_launch_perc goal_kick_avg_length  crosses_stp      
##  Min.   :-2.1091       Min.   :-1.88539     Min.   :-2.44399  
##  1st Qu.:-0.7465       1st Qu.:-0.77529     1st Qu.:-0.69222  
##  Median : 0.1039       Median : 0.09879     Median :-0.08114  
##  Mean   : 0.0000       Mean   : 0.00000     Mean   : 0.00000  
##  3rd Qu.: 0.6910       3rd Qu.: 0.64947     3rd Qu.: 0.59106  
##  Max.   : 1.9981       Max.   : 2.70358     Max.   : 2.93354  
##  crosses_stp_perc   out_penalty_area_actions90 def_action_distance
##  Min.   :-2.64529   Min.   :-1.59106           Min.   :-0.8759    
##  1st Qu.:-0.66712   1st Qu.:-0.69033           1st Qu.:-0.5912    
##  Median :-0.06994   Median :-0.09494           Median :-0.4584    
##  Mean   : 0.00000   Mean   : 0.00000           Mean   : 0.0000    
##  3rd Qu.: 0.56456   3rd Qu.: 0.31726           3rd Qu.: 0.1962    
##  Max.   : 2.87864   Max.   : 3.90492           Max.   : 3.3175

Before performing clustering I will also try to look more into variables that describe goalkeepers. There are still many variables in the dataset and some may be correlated with each other and some may not well distinguish players. In other words they may explain little variance in the data. I will try to reduce the number of dimensions by using a Principal Component Analysis (PCA) and transform data by projecting it into a smaller space, while preserving most of the information. It will help us determine which original variables should be included in designing for example a clustering model and also may help us to discover which variables explain the variance among goalkeepers the most.

Principal Component Analysis

In the principal component analysis method highly correlated variables are transformed into a so called principal components. Each component is a linear combination of initial variables and is not correlated with other components. PCA will give the same amount of principal components as there are dimensions (variables), however the most information is gathered in the first principal component with the next one explaining maximum variance left and so on. The main idea is to reduce the dimensionality without losing much information.

From a linear algebra's point of view, principal components are actually eigenvectors of a covariance matrix. They correspond to the direction of the axes in a subspace where there is the most variance. Eigenvalue of the eigenvector is it's coefficient and describes the amount of variance explained by the eigenvector.

Before doing principal component analysis I will check for skewness in the data, because a high skewness can affect the results of PCA.

library(psych)
skew(data_90_cl)
##  [1] -0.3040563  0.8641225  0.2928582  0.4978172  1.1472852 -0.3784273
##  [7]  0.5068486  0.7355048 -0.1130899  0.1397338  0.5183499  0.5228872
## [13]  1.1177733  1.7868261

General rule of thumb is that skewness above 1 or lower than -1 is considered to be high. We have 3 such variables in the dataset, however there are also a couple of variables for which skewness is in range of 0.5 to 1 or -0.5 to -1. Therefore I will transform the data with a Box-Cox transformation.

library(caret)
preProcessValues <- preProcess(data_90, method = "BoxCox")
data_90_bc <- as.data.frame(predict(preProcessValues, data_90))

For the principal component analysis I will use the "factoextra" package with a "prcomp" function. "Scale" parameter enables us to rescale the variance to 1 unit and "center" enables us to shift the variables in the space to around 0, but our data is already standardized so it is redundant and will not change anything. I will set both parameters to False.

library(factoextra)
goalies_pca <- prcomp(data_90_bc[2:14], scale. = F, center = F)

We can quickly see eigenvalues for the calculated principal components with get_eigenvalue() function. Each eigenvalue reflects the variance explained by the corresponding principal component. In the next column there is also the proportion of the total variance explained. In our case first principal value explains almost 43,2% of variance, second principal component - 13,9% and third 12,7% of variance.

eigen_values <- get_eigenvalue(goalies_pca)
eigen_values
##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  5.620758130      43.23660100                    43.23660
## Dim.2  1.810924186      13.93018605                    57.16679
## Dim.3  1.654756073      12.72889287                    69.89568
## Dim.4  1.103385048       8.48757729                    78.38326
## Dim.5  0.945148163       7.27037048                    85.65363
## Dim.6  0.807252992       6.20963840                    91.86327
## Dim.7  0.570870189       4.39130914                    96.25458
## Dim.8  0.297848168       2.29113975                    98.54571
## Dim.9  0.111901711       0.86078239                    99.40650
## Dim.10 0.047019785       0.36169065                    99.76819
## Dim.11 0.016689035       0.12837720                    99.89657
## Dim.12 0.007484392       0.05757225                    99.95414
## Dim.13 0.005962128       0.04586252                   100.00000

For better understanding we can also visualize the eigenvalues with a barplot. As can be seen after the 3rd PC there is significant fall in variance explained.

fviz_eig(goalies_pca)

One of the problems now could be how many principal components to retain. However, there is no straight answer to this question. One of the approaches, suggested by Kaiser (1961), is to keep all the PCs with eigenvalue greater than 1. There are 5 such PCs in our case. Another possibility is to set some subjective cut off point of total variance explained. Our dataset doesn't have a large number of observations and hence there is no need to reduce the dimensionality significantly. I will therefore decide for 7 principal components, which explains 96% of the variance in total, leaving out 6 PCs with the lowest eigenvalues.

With the corrplot package we can also visualize the quality of representation of variables (cos2) in each PC. A high quality of representation means that the variable is well represented in a principal component. Cos2 sums up to 1, so if for example a variable is highly represented in PC1, then it will not feature much in other PCs. As we can see there are some variables that are mostly a part of PC1 and PC2 (which explain the most variance): long passes completed (long_cmpl), long passes attempted (long_att), percentage of passes that were launched (launch_perc), pass average length (pass_avg_length), percentage of goal kicks that were launched (goal_kick_launch_perc), goal kick average length (goal_kick_average_length). Those are variables mainly detecting whether a goalkeeper preferred to play short passes to defenders or play long balls higher up the pitch.

library("corrplot")
var <- get_pca_var(goalies_pca)
corrplot(var$cos2[, 1:7], is.corr=FALSE)

With the same function we can also see how each variable contributed to a variability in each principal component. The most important are variables that contribute to the first few PCs. Variables that contributed the most to PC1 (which explain around 40% of the variance) are the variables mentioned previously - so the ones connected to passing.

corrplot(var$contrib[, 1:7], is.corr=FALSE)

# Contributions of variables to PC1
a <- fviz_contrib(goalies_pca, choice = "var", axes = 1, top = 7)
# Contributions of variables to PC2
b <- fviz_contrib(goalies_pca, choice = "var", axes = 2, top = 7)
library(pdp)
grid.arrange(a,b, top= "Contribution to PC1 and PC2")

It is also possible to visualize the relationship between variables with a correlation circle. In such a circle positively correlated variables point in a similar direction and negatively correlated variables point in an opposite direction. Also the variables that are close to the center of the plot are less important to first PCs and those far from the center are well represented.

library(RColorBrewer)
fviz_pca_var(goalies_pca,
             col.var = "contrib", 
             gradient.cols = brewer.pal(3,"RdYlBu"),
               repel = TRUE
  )

It confirms that a lot of variables describing style of passing and entering the ball into play are correlated with each other (but surprisingly not passes attempted - passes_att). Percentage of passes that were launched and pass average length is almost perfectly correlated, maybe it is not a surprise because the more passes are launched, the higher will be it's average length.

I will now look more into how individual observations contribute to the first two principal components. In order to do that first I need to change the column with player name to row names and perform PCA once again on a new data frame.

data_90_bc <- data_90_bc %>%
  remove_rownames() %>%
  column_to_rownames(var = 'Player')

goalies_pca2 <- prcomp(data_90_bc, scale. = F, center = F)

fviz_contrib(goalies_pca2, choice = "ind", axes = 1:2, top = 20)

The goalkeeper that contributes the most to the first two components is Sergio Herrera of CA Osasuna, followed by Stefan Ortega (Arminia Bielefeld) and David Soria (Getafe CF).

It is also possible to plot the quality of representation of observations in the 2 first PCs in a coordinate system

fviz_pca_ind(goalies_pca2)

I will also check quality measures of the principal components. First I will look into complexity. Complexity describes how many variables account for a composite variable. Generally the closer the complexity is to 1, the better, because it is simpler to interpret composite variables. To compute complexity stats first I need to once again do PCA but with another function - a one from psych package.

library(psych)
goalies_pca3 <-principal(data_90_bc, nfactors=7, rotate="varimax")
goalies_pca3$complexity
##                     PSxG90                  long_cmpl 
##                   1.020857                   1.779176 
##                   long_att             long_cmpl_perc 
##                   1.181026                   1.214779 
##                 passes_att                 throws_att 
##                   1.329956                   2.563240 
##                launch_perc            pass_avg_length 
##                   1.142168                   1.148086 
##      goal_kick_launch_perc       goal_kick_avg_length 
##                   1.476335                   1.477736 
##                crosses_stp           crosses_stp_perc 
##                   1.229179                   1.032169 
## out_penalty_area_actions90        def_action_distance 
##                   1.055419                   1.055587

Uniqueness can be described as a proportion of variance not shared with other variables. For PCA we want it to be low because if much of variance is shared among variables then it is easier to reduce the dimensionality. In our case the uniqueness for all variables is relatively low - the highest one is for the launch_perc variable, but it is equal only to 0.064. It means that this variable only carries 6,4% additional information in relation to other variables in the model.

goalies_pca3$uniqueness
##                     PSxG90                  long_cmpl 
##                0.006123207                0.023728814 
##                   long_att             long_cmpl_perc 
##                0.017404107                0.108396564 
##                 passes_att                 throws_att 
##                0.108680336                0.226812720 
##                launch_perc            pass_avg_length 
##                0.069507142                0.059985108 
##      goal_kick_launch_perc       goal_kick_avg_length 
##                0.120423485                0.145547705 
##                crosses_stp           crosses_stp_perc 
##                0.061213188                0.062577773 
## out_penalty_area_actions90        def_action_distance 
##                0.025188566                0.009110220

It is also possible to show both complexity and uniquness in one plot.

library(maptools)
{plot(goalies_pca3$complexity, goalies_pca3$uniqueness, xlim=c(0, 4))
pointLabel(goalies_pca3$complexity, goalies_pca3$uniqueness, labels=names(goalies_pca3$uniqueness), cex=0.6)}

The most problematic variables would be the ones in the top right corner, however in the dataset there aren't such ones that are clearly problematic. Complexity for long passes completed is quite high (above 2), but it has a very low uniqueness of 0.02. Therefore there is probably no need to exclude any variables from the analysis.

After performing Principal Component Analysis, this is the data that can be used for clustering. It consists of 7 principal components instead of 13, that explain 96% of variance overall.

head(goalies_pca$x[, 1:7])
##             PC1        PC2        PC3        PC4        PC5         PC6
## [1,] -3.0775995  0.1437375 -0.7731065 -0.7011042  1.5735552  1.24085222
## [2,]  0.3591651 -3.1410719 -1.7356328 -0.9467620  0.7305989 -0.33311664
## [3,] -0.8119582 -2.1308502 -2.3016521  1.2141625 -1.1622127  0.02138082
## [4,]  0.4565988 -0.7302583 -0.3510261  0.4313295 -0.3107310 -0.77770295
## [5,] -0.5634862  0.9252398 -1.1623495 -0.3908866 -0.2878309 -0.55746999
## [6,] -1.8007843  1.9620461  0.3761936  0.4348550  1.0217986  0.22534481
##              PC7
## [1,] -0.06347079
## [2,]  0.28391786
## [3,] -0.28716040
## [4,] -0.50668376
## [5,] -0.57917590
## [6,] -0.20456503

Summary

In this paper I showed how to reduce the dimension of a dataset using Principal Component Analysis. For a data with 120 observations and 13 columns maybe the reduction won't result in a better performance (regarding the computation time) in clustering, but for larger dataset it could enable to perform clustering when it wasn't possible before because of the size of the data. At the same time it provides insight into which variables correspond to most of the variance among the goalkeepers.