We often use to perceive the quality of a goalkeeper simply by how good he is at saving shots or by spectacular saves which require amazing reflexes. In modern football, however, there are other factors that determine how good a goalkeeper is. More and more often the man between the bars is being required to act as an 11th outfield player and to help defenders play the ball up the pitch from under the opposite team's pressure. Joe Hart, who at the time was a no. 1 England's goalkeeper was axed by Pep Guardiola from Manchester City's first team because he couldn't pass the ball effectively. The Man City's legend was then replaced by a goalkeeper who felt more comfortable with the ball at his feet. On the other hand, a team defending with a deep block needs a goalkeeper who is proficient at saving shots from distance, as well as being dominant in the air to intercept inevitable crosses into the box. A goalkeeper therefore is usually a part of a system deployed by the club's manager and a club is looking on the transfermarket for a player that will suit the team's tactics and style. Clubs using statistics for determining potential transfer targets may want to use clustering or other algorithms to find the right player. Nowadays, however, there is a large amount of statistics available and we may not always know which ones distinguish players the best. I will show how to use Principal component analysis to reduce the dimension of a dataset and leave only the most important variables to explain variation among observations.
Our dataset comes from fbref.com site (link). Please note that all the metrics are converted to a per 90 minutes format. If you want to download it please check "Toggle per90 Stats" and then download the csv file. In the data we can find various statistics of all the goalkeepers from the 5 big European leagues.
Advanced statistics such as xG (expected goals) and xA (expected assists) are supplied by StatsBomb. More information what is xG and how it is calculated can be found under this link.
For the data manipulation I will use data.table package
if(!require(data.table)){
install.packages("data.table")
library(data.table)
}
Load the data.
data_90 <- as.data.table(read.csv("goalies_90.csv", encoding = "UTF-8"))
head(data_90)
## X.U.FEFF.Rk Player Nation Pos Squad
## 1: 1 AdriĂĄn\\Adrian es ESP GK Liverpool
## 2: 2 RĂșnar Alex RĂșnarsson\\Runar-Alex-Runarsson is ISL GK Arsenal
## 3: 3 Alisson\\Alisson br BRA GK Liverpool
## 4: 4 Saturnin Allagbé\\Saturnin-Allagbe bj BEN GK Dijon
## 5: 5 NĂcolas Andrade\\Nicolas-Andrade br BRA GK Udinese
## 6: 6 Alphonse Areola\\Alphonse-Areola fr FRA GK Fulham
## Comp Age Born X90s GA PKA FK CK OG PSxG PSxG.SoT
## 1: eng Premier League 34-050 1987 2.0 4.50 0.00 0.00 1.00 0 3.55 0.45
## 2: eng Premier League 26-004 1995 0.2 0.00 0.00 0.00 0.00 0 0.50 0.04
## 3: eng Premier League 28-143 1992 21.0 1.14 0.19 0.05 0.14 0 1.21 0.34
## 4: fr Ligue 1 27-092 1993 4.0 2.00 0.00 0.25 0.00 0 1.52 0.30
## 5: it Serie A 32-316 1988 2.0 2.50 0.00 0.00 0.00 0 2.00 0.58
## 6: eng Premier League 27-361 1993 24.0 1.21 0.21 0.00 0.08 0 1.47 0.29
## PSxG... X.90 Cmp Att Cmp. Att.1 Thr Launch. AvgLen Att.2 Launch..1
## 1: -0.95 -0.93 4.50 12.50 36.0 24.0 7.00 37.5 36.0 7.00 50.0
## 2: 0.50 0.50 0.00 15.00 0.0 50.0 10.00 30.0 31.8 5.00 0.0
## 3: 0.07 0.07 3.62 7.81 46.3 28.6 4.67 21.8 30.4 4.90 32.0
## 4: -0.47 -0.48 5.00 13.30 37.7 25.8 5.00 40.8 35.1 6.75 40.7
## 5: -0.50 -0.48 3.00 6.50 46.2 15.5 2.50 25.8 32.1 4.50 55.6
## 6: 0.26 0.26 4.79 12.30 39.1 19.5 4.25 40.6 35.1 7.04 61.5
## AvgLen.1 Opp Stp Stp. X.OPA X.OPA.90 AvgDist Matches
## 1: 38.7 9.00 1.00 11.1 0.00 0.00 6.15 Matches
## 2: 30.0 20.00 0.00 0.0 0.00 0.00 87.50 Matches
## 3: 34.4 5.52 0.43 7.8 1.33 1.33 0.88 Matches
## 4: 34.9 8.75 0.00 0.0 1.25 1.25 4.10 Matches
## 5: 47.4 8.00 0.00 0.0 0.00 0.00 7.60 Matches
## 6: 45.0 10.00 1.21 12.1 0.63 0.63 0.60 Matches
Select only players who played more than 4 games, select only complete rows and finally select and rename the appropriate columns. I will also delete the first column because it is not needed. After filtering players with at least 5 games there is no missing data, so there is no need to handle missing values. I will also fix players' names, because there are two variants of the name in a single row. There are also 2 rows describing the same player, so I will only keep rows with a distinct player name.
setDT(data_90)
data_90 <- data_90[X90s >= 5, -1]
sum(is.na(data_90))
## [1] 0
data_90[, Player:= sub("\\\\.*", "", Player)]
library(tidyverse)
data_90 <- distinct(data_90, Player, .keep_all = T)
After inspecting all the variables I will also delete the ones that are his personal details, do not depend on the goalkeeper or are duplicates of another columns. That is:
- Columns 2:6 - information about a player.
- Column 7 - Number of full games played by a player - not dependent on him.
- Columns 8:13 - Statistics such as goals conceded or penalty kicks allowed mainly depend on a whole team.
- Columns 14 - Post shot expected goals - this is a statistic depicting the total quality of the shots a goalkeepers was facing. Doesn't depend on the goalkeepers at all.
- Column 15 - Post shot expected goals per shot on target. The same statistic as the previous one but divided by number of shots. Not needed either.
- Column 16 - Post shot expected goals minus goals allowed. This is an import variable because it reflects a true quality of shot stopping abilities. It compares the amount of goals an average goalkeeper would have conceded from the shots he faced with the actual number of goals conceded. However it is an absolute value and will be weighted by the number of minutes played so I will only leave the next column which is the same statistic but per 90 minutes.
- Column 25 - Goal kicks attempted. It is also irrespective of the goalkeepers.
- Column 28 - Opponent's attempted crosses into penalty area. Same as in the previous one.
- Column 32 - The same information as in column 31.
We are left with following columns (per 90 minutes if possible):
- Player name,
- Post shot expected goals minus goals allowed per 90 minutes (PSxG90),
- Long passes (longer than 40 yards) completed (long_cmpl),
- Long passes attempted (long_att),
- Long passes completion percentage (long_cmpl_perc),
- Passes attempted (passes_att),
- Throws attempted (throws_att),
- Percentage of passes that were launched (long passes) (launch_perc),
- Pass average length (pass_avg_length),
- Percentage of goal kicks that were launched (goal_kick_launch_perc),
- Average length of a goal kick (goal_kick_avg_length),
- Number of opponent's crosses into penalty area that were successfully stopped by goalkeeper (crosses_stp),
- Percentage of opponent's crosses into penalty area that were successfully stopped by goalkeeper (crosses_stp_perc),
- Number of defensive actions outside of the penalty area (out_penalty_area_actions90),
- Average distance from goal of all defensive actions (def_action_distance).
data_90 <- data_90[, c(1, 17:24, 26:27, 29:31, 33)]
setnames(data_90, c("Player", "PSxG90", "long_cmpl", "long_att", "long_cmpl_perc", "passes_att", "throws_att", "launch_perc", "pass_avg_length", "goal_kick_launch_perc", "goal_kick_avg_length", "crosses_stp", "crosses_stp_perc", "out_penalty_area_actions90", "def_action_distance"))
head(data_90)
## Player PSxG90 long_cmpl long_att long_cmpl_perc passes_att
## 1: Alisson 0.07 3.62 7.81 46.3 28.6
## 2: Alphonse Areola 0.26 4.79 12.30 39.1 19.5
## 3: Sergio Asenjo -0.20 5.50 11.70 47.1 26.8
## 4: Emil Audero -0.03 6.70 16.10 41.5 24.8
## 5: Ădgar BadĂa 0.06 8.18 16.30 50.1 30.3
## 6: Oliver Baumann -0.08 6.14 14.50 42.4 32.1
## throws_att launch_perc pass_avg_length goal_kick_launch_perc
## 1: 4.67 21.8 30.4 32.0
## 2: 4.25 40.6 35.1 61.5
## 3: 5.46 31.5 34.2 47.5
## 4: 5.26 46.6 38.8 62.9
## 5: 4.95 36.0 34.1 57.8
## 6: 5.24 36.8 34.3 39.2
## goal_kick_avg_length crosses_stp crosses_stp_perc out_penalty_area_actions90
## 1: 34.4 0.43 7.8 1.33
## 2: 45.0 1.21 12.1 0.63
## 3: 41.6 1.25 11.5 0.42
## 4: 45.7 0.78 7.8 0.43
## 5: 39.3 0.68 6.5 0.73
## 6: 35.6 0.29 3.3 1.10
## def_action_distance
## 1: 0.88
## 2: 0.60
## 3: 0.55
## 4: 0.54
## 5: 0.60
## 6: 0.74
I will also standardize the data. Standardization is required because some of the variables are in different units (for example long passes completion rate is in percentages and long passes completed is an absolute value) and we may give larger importance to variables with larger magnitude. To avoid this such a transformation is made that changes the mean of all variables to 0 with a unit variance. At the same time I will create another data frame without player names (with only numerical variables) for clustering
data_90_cl <- as.data.frame(lapply(data_90[,2:15], scale))
data_90[,2:15] <- as.data.frame(lapply(data_90[,2:15], scale))
Summary of the data after tidying.
summary(data_90_cl)
## PSxG90 long_cmpl long_att long_cmpl_perc
## Min. :-2.80317 Min. :-2.3851 Min. :-2.42516 Min. :-1.93200
## 1st Qu.:-0.53338 1st Qu.:-0.6365 1st Qu.:-0.68364 1st Qu.:-0.64451
## Median :-0.01379 Median :-0.1342 Median :-0.05125 Median :-0.06549
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.66988 3rd Qu.: 0.5773 3rd Qu.: 0.53250 3rd Qu.: 0.65659
## Max. : 2.17396 Max. : 3.7072 Max. : 2.84805 Max. : 2.99995
## passes_att throws_att launch_perc pass_avg_length
## Min. :-2.08351 Min. :-3.602374 Min. :-2.1775 Min. :-2.0093
## 1st Qu.:-0.66658 1st Qu.:-0.536087 1st Qu.:-0.6554 1st Qu.:-0.7683
## Median :-0.08415 Median : 0.001777 Median :-0.1096 Median :-0.1079
## Mean : 0.00000 Mean : 0.000000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.43742 3rd Qu.: 0.686750 3rd Qu.: 0.4769 3rd Qu.: 0.5452
## Max. : 4.34921 Max. : 2.907164 Max. : 2.8261 Max. : 2.8239
## goal_kick_launch_perc goal_kick_avg_length crosses_stp
## Min. :-2.1091 Min. :-1.88539 Min. :-2.44399
## 1st Qu.:-0.7465 1st Qu.:-0.77529 1st Qu.:-0.69222
## Median : 0.1039 Median : 0.09879 Median :-0.08114
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.6910 3rd Qu.: 0.64947 3rd Qu.: 0.59106
## Max. : 1.9981 Max. : 2.70358 Max. : 2.93354
## crosses_stp_perc out_penalty_area_actions90 def_action_distance
## Min. :-2.64529 Min. :-1.59106 Min. :-0.8759
## 1st Qu.:-0.66712 1st Qu.:-0.69033 1st Qu.:-0.5912
## Median :-0.06994 Median :-0.09494 Median :-0.4584
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.56456 3rd Qu.: 0.31726 3rd Qu.: 0.1962
## Max. : 2.87864 Max. : 3.90492 Max. : 3.3175
Before performing clustering I will also try to look more into variables that describe goalkeepers. There are still many variables in the dataset and some may be correlated with each other and some may not well distinguish players. In other words they may explain little variance in the data. I will try to reduce the number of dimensions by using a Principal Component Analysis (PCA) and transform data by projecting it into a smaller space, while preserving most of the information. It will help us determine which original variables should be included in designing for example a clustering model and also may help us to discover which variables explain the variance among goalkeepers the most.
In the principal component analysis method highly correlated variables are transformed into a so called principal components. Each component is a linear combination of initial variables and is not correlated with other components. PCA will give the same amount of principal components as there are dimensions (variables), however the most information is gathered in the first principal component with the next one explaining maximum variance left and so on. The main idea is to reduce the dimensionality without losing much information.
From a linear algebra's point of view, principal components are actually eigenvectors of a covariance matrix. They correspond to the direction of the axes in a subspace where there is the most variance. Eigenvalue of the eigenvector is it's coefficient and describes the amount of variance explained by the eigenvector.
Before doing principal component analysis I will check for skewness in the data, because a high skewness can affect the results of PCA.
library(psych)
skew(data_90_cl)
## [1] -0.3040563 0.8641225 0.2928582 0.4978172 1.1472852 -0.3784273
## [7] 0.5068486 0.7355048 -0.1130899 0.1397338 0.5183499 0.5228872
## [13] 1.1177733 1.7868261
General rule of thumb is that skewness above 1 or lower than -1 is considered to be high. We have 3 such variables in the dataset, however there are also a couple of variables for which skewness is in range of 0.5 to 1 or -0.5 to -1. Therefore I will transform the data with a Box-Cox transformation.
library(caret)
preProcessValues <- preProcess(data_90, method = "BoxCox")
data_90_bc <- as.data.frame(predict(preProcessValues, data_90))
For the principal component analysis I will use the "factoextra" package with a "prcomp" function. "Scale" parameter enables us to rescale the variance to 1 unit and "center" enables us to shift the variables in the space to around 0, but our data is already standardized so it is redundant and will not change anything. I will set both parameters to False.
library(factoextra)
goalies_pca <- prcomp(data_90_bc[2:14], scale. = F, center = F)
We can quickly see eigenvalues for the calculated principal components with get_eigenvalue() function. Each eigenvalue reflects the variance explained by the corresponding principal component. In the next column there is also the proportion of the total variance explained. In our case first principal value explains almost 43,2% of variance, second principal component - 13,9% and third 12,7% of variance.
eigen_values <- get_eigenvalue(goalies_pca)
eigen_values
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 5.620758130 43.23660100 43.23660
## Dim.2 1.810924186 13.93018605 57.16679
## Dim.3 1.654756073 12.72889287 69.89568
## Dim.4 1.103385048 8.48757729 78.38326
## Dim.5 0.945148163 7.27037048 85.65363
## Dim.6 0.807252992 6.20963840 91.86327
## Dim.7 0.570870189 4.39130914 96.25458
## Dim.8 0.297848168 2.29113975 98.54571
## Dim.9 0.111901711 0.86078239 99.40650
## Dim.10 0.047019785 0.36169065 99.76819
## Dim.11 0.016689035 0.12837720 99.89657
## Dim.12 0.007484392 0.05757225 99.95414
## Dim.13 0.005962128 0.04586252 100.00000
For better understanding we can also visualize the eigenvalues with a barplot. As can be seen after the 3rd PC there is significant fall in variance explained.
fviz_eig(goalies_pca)
One of the problems now could be how many principal components to retain. However, there is no straight answer to this question. One of the approaches, suggested by Kaiser (1961), is to keep all the PCs with eigenvalue greater than 1. There are 5 such PCs in our case. Another possibility is to set some subjective cut off point of total variance explained. Our dataset doesn't have a large number of observations and hence there is no need to reduce the dimensionality significantly. I will therefore decide for 7 principal components, which explains 96% of the variance in total, leaving out 6 PCs with the lowest eigenvalues.
With the corrplot package we can also visualize the quality of representation of variables (cos2) in each PC. A high quality of representation means that the variable is well represented in a principal component. Cos2 sums up to 1, so if for example a variable is highly represented in PC1, then it will not feature much in other PCs. As we can see there are some variables that are mostly a part of PC1 and PC2 (which explain the most variance): long passes completed (long_cmpl), long passes attempted (long_att), percentage of passes that were launched (launch_perc), pass average length (pass_avg_length), percentage of goal kicks that were launched (goal_kick_launch_perc), goal kick average length (goal_kick_average_length). Those are variables mainly detecting whether a goalkeeper preferred to play short passes to defenders or play long balls higher up the pitch.
library("corrplot")
var <- get_pca_var(goalies_pca)
corrplot(var$cos2[, 1:7], is.corr=FALSE)
With the same function we can also see how each variable contributed to a variability in each principal component. The most important are variables that contribute to the first few PCs. Variables that contributed the most to PC1 (which explain around 40% of the variance) are the variables mentioned previously - so the ones connected to passing.
corrplot(var$contrib[, 1:7], is.corr=FALSE)
# Contributions of variables to PC1
a <- fviz_contrib(goalies_pca, choice = "var", axes = 1, top = 7)
# Contributions of variables to PC2
b <- fviz_contrib(goalies_pca, choice = "var", axes = 2, top = 7)
library(pdp)
grid.arrange(a,b, top= "Contribution to PC1 and PC2")
It is also possible to visualize the relationship between variables with a correlation circle. In such a circle positively correlated variables point in a similar direction and negatively correlated variables point in an opposite direction. Also the variables that are close to the center of the plot are less important to first PCs and those far from the center are well represented.
library(RColorBrewer)
fviz_pca_var(goalies_pca,
col.var = "contrib",
gradient.cols = brewer.pal(3,"RdYlBu"),
repel = TRUE
)
It confirms that a lot of variables describing style of passing and entering the ball into play are correlated with each other (but surprisingly not passes attempted - passes_att). Percentage of passes that were launched and pass average length is almost perfectly correlated, maybe it is not a surprise because the more passes are launched, the higher will be it's average length.
I will now look more into how individual observations contribute to the first two principal components. In order to do that first I need to change the column with player name to row names and perform PCA once again on a new data frame.
data_90_bc <- data_90_bc %>%
remove_rownames() %>%
column_to_rownames(var = 'Player')
goalies_pca2 <- prcomp(data_90_bc, scale. = F, center = F)
fviz_contrib(goalies_pca2, choice = "ind", axes = 1:2, top = 20)
The goalkeeper that contributes the most to the first two components is Sergio Herrera of CA Osasuna, followed by Stefan Ortega (Arminia Bielefeld) and David Soria (Getafe CF).
It is also possible to plot the quality of representation of observations in the 2 first PCs in a coordinate system
fviz_pca_ind(goalies_pca2)
I will also check quality measures of the principal components. First I will look into complexity. Complexity describes how many variables account for a composite variable. Generally the closer the complexity is to 1, the better, because it is simpler to interpret composite variables. To compute complexity stats first I need to once again do PCA but with another function - a one from psych package.
library(psych)
goalies_pca3 <-principal(data_90_bc, nfactors=7, rotate="varimax")
goalies_pca3$complexity
## PSxG90 long_cmpl
## 1.020857 1.779176
## long_att long_cmpl_perc
## 1.181026 1.214779
## passes_att throws_att
## 1.329956 2.563240
## launch_perc pass_avg_length
## 1.142168 1.148086
## goal_kick_launch_perc goal_kick_avg_length
## 1.476335 1.477736
## crosses_stp crosses_stp_perc
## 1.229179 1.032169
## out_penalty_area_actions90 def_action_distance
## 1.055419 1.055587
Uniqueness can be described as a proportion of variance not shared with other variables. For PCA we want it to be low because if much of variance is shared among variables then it is easier to reduce the dimensionality. In our case the uniqueness for all variables is relatively low - the highest one is for the launch_perc variable, but it is equal only to 0.064. It means that this variable only carries 6,4% additional information in relation to other variables in the model.
goalies_pca3$uniqueness
## PSxG90 long_cmpl
## 0.006123207 0.023728814
## long_att long_cmpl_perc
## 0.017404107 0.108396564
## passes_att throws_att
## 0.108680336 0.226812720
## launch_perc pass_avg_length
## 0.069507142 0.059985108
## goal_kick_launch_perc goal_kick_avg_length
## 0.120423485 0.145547705
## crosses_stp crosses_stp_perc
## 0.061213188 0.062577773
## out_penalty_area_actions90 def_action_distance
## 0.025188566 0.009110220
It is also possible to show both complexity and uniquness in one plot.
library(maptools)
{plot(goalies_pca3$complexity, goalies_pca3$uniqueness, xlim=c(0, 4))
pointLabel(goalies_pca3$complexity, goalies_pca3$uniqueness, labels=names(goalies_pca3$uniqueness), cex=0.6)}
The most problematic variables would be the ones in the top right corner, however in the dataset there aren't such ones that are clearly problematic. Complexity for long passes completed is quite high (above 2), but it has a very low uniqueness of 0.02. Therefore there is probably no need to exclude any variables from the analysis.
After performing Principal Component Analysis, this is the data that can be used for clustering. It consists of 7 principal components instead of 13, that explain 96% of variance overall.
head(goalies_pca$x[, 1:7])
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] -3.0775995 0.1437375 -0.7731065 -0.7011042 1.5735552 1.24085222
## [2,] 0.3591651 -3.1410719 -1.7356328 -0.9467620 0.7305989 -0.33311664
## [3,] -0.8119582 -2.1308502 -2.3016521 1.2141625 -1.1622127 0.02138082
## [4,] 0.4565988 -0.7302583 -0.3510261 0.4313295 -0.3107310 -0.77770295
## [5,] -0.5634862 0.9252398 -1.1623495 -0.3908866 -0.2878309 -0.55746999
## [6,] -1.8007843 1.9620461 0.3761936 0.4348550 1.0217986 0.22534481
## PC7
## [1,] -0.06347079
## [2,] 0.28391786
## [3,] -0.28716040
## [4,] -0.50668376
## [5,] -0.57917590
## [6,] -0.20456503
In this paper I showed how to reduce the dimension of a dataset using Principal Component Analysis. For a data with 120 observations and 13 columns maybe the reduction won't result in a better performance (regarding the computation time) in clustering, but for larger dataset it could enable to perform clustering when it wasn't possible before because of the size of the data. At the same time it provides insight into which variables correspond to most of the variance among the goalkeepers.