In this report the football matches data is investigated in the context of possible dimension reduction. It is the data of English Premier League for seasons 2014/2015-2020/2021. The data set was scraped from a website providing sport events statistics. Final size of the data set is 2660x23 – 2660 observations and 23 variables (after elimination). Each observation contains the information on the statistics of a single match.
Methods utilized are Principal Component Analysis (PCA) and Multidimensional Scaling (MDS).
setwd("C:/Users/bpop/OneDrive/R/USL/project/dimRed")
library(readr)
matches_data <- read_csv("matches-data.csv")
matches_data
## # A tibble: 2,660 x 29
## MatchID MatchDate Week HomeTeam AwayTeam HomeGoalsHT AwayGoalsHT HomeGoalsFT AwayGoalsFT HomeBallPos AwayBallPos HomeShotsOffTarget AwayShotsOffTarget HomeShotsOnTarget AwayShotsOnTarget HomeBlockedShots AwayBlockedShots HomeCorners AwayCorners HomePassSuccPerc AwayPassSuccPerc HomeAerialsWon AwayAerialsWon HomeFouls AwayFouls HomeYellowCards AwayYellowCards HomeRedCards AwayRedCards
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 Aug 16, 2014 1 Man Utd Swansea 0 1 1 2 60% 40% 5 0 5 4 4 1 4 0 86 80 20 10 14 20 2 4 0 0
## 2 1 Aug 16, 2014 1 QPR Hull City 0 0 0 1 51% 49% 7 3 6 4 6 4 8 9 77 76 30 15 10 10 1 2 0 0
## 3 2 Aug 16, 2014 1 Stoke Aston Villa 0 0 0 1 63% 37% 4 4 2 1 6 2 2 8 84 68 30 9 14 9 0 3 0 0
## 4 3 Aug 16, 2014 1 West Brom Sunderland 1 1 2 2 58% 42% 5 2 5 2 0 3 6 3 80 75 16 15 18 9 3 1 0 0
## 5 4 Aug 16, 2014 1 Leicester City Everton 1 2 2 2 37% 63% 5 5 3 3 3 5 3 6 77 84 27 14 16 10 1 1 0 0
## 6 5 Aug 16, 2014 1 West Ham Tottenham 0 0 0 1 47% 53% 10 2 4 4 4 4 8 5 83 80 15 12 12 10 2 0 0 1
## 7 6 Aug 16, 2014 1 Arsenal Crystal Palace 1 1 2 1 76% 24% 5 0 6 2 3 2 9 3 88 57 23 17 13 19 2 3 0 0
## 8 7 Aug 17, 2014 1 Liverpool Southampton 1 0 2 1 56% 44% 5 4 5 6 2 2 2 6 86 77 23 14 8 11 1 2 0 0
## 9 8 Aug 17, 2014 1 Newcastle Man City 0 1 0 2 44% 56% 9 5 0 5 3 3 3 3 83 86 14 16 8 11 1 5 0 0
## 10 9 Aug 18, 2014 1 Burnley Chelsea 1 3 1 3 39% 61% 6 4 2 3 1 4 4 3 70 82 27 20 6 7 1 1 0 0
## # ... with 2,650 more rows
The data set contains statistics from Premier League football matches. It consists of 7 seasons 2014/2015-2020/2021 which sums up to 2660 observations. The data was gathered using a self-built web scraper. Available variables are:
Next variables have prefixes ‘Home’ and ‘Away’ which points to the team. Only the names without prefixes are listed below.
There are no missing values in the data set.
Changing strings to numbers, etc.
# Change ball possession from % to numbers
matches_data$HomeBallPos = sub('%', '', matches_data$HomeBallPos)
matches_data$AwayBallPos = sub('%', '', matches_data$AwayBallPos)
matches_data$HomeBallPos = as.numeric(matches_data$HomeBallPos)/100
matches_data$AwayBallPos = as.numeric(matches_data$AwayBallPos)/100
# Change successful pass percentage to numbers
matches_data$HomePassSuccPerc = matches_data$HomePassSuccPerc/100
matches_data$AwayPassSuccPerc = matches_data$AwayPassSuccPerc/100
matches_data_clear = subset(matches_data, select= -c(MatchID, MatchDate, Week, HomeTeam, AwayTeam))
library(corrplot)
testRes = cor.mtest(matches_data_clear, conf.level = 0.95) # Significance of the correlation
corrplot(cor(matches_data_clear), p.mat = testRes$p, type = 'lower', method = 'number',
insig = 'blank', tl.cex=1)
Home team ball possession and away team ball possession are perfectly negatively correlated. One of them can be discarded from the data set in advance since they definitely carry the same information.
matches_data_clear = subset(matches_data_clear, select= -c(AwayBallPos))
corrplot(cor(matches_data_clear), type = 'lower', method = 'number')
We see that there are some highly correlated variables, like goals scored in the first half of the match and overall goals scored during the match.
Ball possession is highly correlated with many other variables. The same is observed for the successful percentage of passes.
We can also see some sub-triangle of higher correlations between variables in the correlation matrix.
It could be stated that dimension reduction methods could provide reasonable results.
Firstly, the data must be normalized so that each variable has 0 mean and standard deviation equal to 1.
library(caret)
preproc = preProcess(matches_data_clear, method=c("center", "scale"))
matches_data_scaled = predict(preproc, matches_data_clear)
This method reduces the dimensionality of the data set regarding the correlations between the variables. It creates some new variables (Principal Components) which are the linear combinations of the original variables.
The function ‘prcomp’ is used, which utilizes the Singular Value Decomposition method. This approach is preferred when compared to the Spectral Decomposition approach (used in ‘princomp’ method) since SVD has slightly better accuracy.
pca_res = prcomp(matches_data_scaled, center=FALSE, scale=FALSE) # stats::
pca_res
## Standard deviations (1, .., p=23):
## [1] 2.1245802 1.5603168 1.4482478 1.2589232 1.2285663 1.0912092 1.0462745 0.9912979 0.9708524 0.8720450 0.8582799 0.8335456 0.8043428 0.7487735 0.7255737 0.7166255 0.6893248 0.6800296 0.6358132 0.5764119 0.5053158 0.4897157 0.2010097
##
## Rotation (n x k) = (23 x 23):
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23
## HomeGoalsHT 0.059544346 -0.379695972 0.237050978 0.394275716 -0.13986616 0.002417953 -0.053676350 -0.0130846396 0.058020684 0.036359043 3.447079e-03 -0.0892608655 0.014959825 -0.178126272 0.037369490 -0.21173280 -0.013564543 -0.469315829 0.217574956 -0.108065258 0.090945938 0.489326814 -0.0004447808
## AwayGoalsHT -0.099570964 -0.013787376 -0.509064429 0.266374600 -0.17368060 -0.065829024 -0.091003031 -0.0057889492 -0.082709201 0.013035594 -8.061514e-02 -0.0522777541 0.053518793 -0.163382885 0.270132798 -0.05063956 -0.410912063 0.100407095 -0.153302282 0.049463785 -0.522979760 0.131360455 0.0011183307
## HomeGoalsFT 0.142130148 -0.400820510 0.219393281 0.377320412 -0.15364070 -0.054142586 -0.038402430 -0.0028443560 0.026932135 0.008566551 -1.774004e-02 -0.0539031678 0.023806183 -0.035759888 -0.036567210 -0.05878143 0.053814899 0.018685007 -0.081118224 0.036128252 -0.166659954 -0.742031846 0.0210669358
## AwayGoalsFT -0.162672173 -0.028437578 -0.516067764 0.256754200 -0.19068626 -0.005944587 -0.048157969 0.0008015997 -0.031137579 -0.003410438 -6.461066e-02 -0.0359139159 0.012179888 -0.007449871 0.049321068 0.03696590 -0.020127164 -0.001280146 0.003523524 -0.074444817 0.745834959 -0.170310295 -0.0088881717
## HomeBallPos 0.405491298 0.052710897 -0.112608477 0.007940463 -0.01269872 0.053742254 0.043171618 0.0046337365 -0.021668849 -0.475575240 -1.332786e-01 -0.0779104841 0.042816708 0.027590389 0.025666885 0.01486725 0.058572963 -0.067956435 0.037030791 -0.017545898 -0.021258272 -0.002099574 -0.7422847726
## HomeShotsOffTarget 0.256823782 0.118329437 -0.099825099 -0.019313481 -0.05987413 -0.014522860 0.221180782 0.0103067797 -0.101417504 0.333215490 1.703674e-01 -0.6708491327 -0.473931536 -0.022071463 -0.051475719 -0.04541691 0.073984285 -0.048081060 -0.135492599 0.004403869 -0.006315406 0.004718344 -0.0214707640
## AwayShotsOffTarget -0.235061762 0.003402868 0.139171406 -0.028271245 -0.07790133 0.019683711 0.277544318 -0.0661587739 -0.051838298 0.057710746 -8.845138e-01 -0.1338169633 -0.108965514 0.020375010 -0.013242903 0.11820282 0.017824157 -0.027344758 0.032540890 -0.021050650 -0.036231835 0.001392992 0.0097509750
## HomeShotsOnTarget 0.265515676 -0.259728249 0.112860078 0.271892297 -0.09388999 -0.105506763 0.150723232 -0.0229370945 0.087404152 0.084686716 -1.464625e-03 0.1129511736 -0.014918620 0.168597540 -0.071662449 0.35230934 0.027868316 0.592694204 -0.216399987 0.058966316 0.107657463 0.364162335 -0.0118042422
## AwayShotsOnTarget -0.254114983 -0.048787688 -0.341225504 0.188275201 -0.15000745 -0.014254082 0.114601590 0.0271515806 -0.003254077 -0.074555753 1.017595e-01 0.1131497366 -0.127052401 0.327774571 -0.356286730 0.04908369 0.550842406 -0.100504924 0.208477986 0.039748506 -0.315739722 0.068296173 0.0020783556
## HomeBlockedShots 0.265994555 0.097350761 -0.136590424 -0.012087175 -0.02718337 -0.113444916 0.414213674 -0.0898102072 0.076707222 0.223615194 -9.697300e-02 0.3878174451 0.014899128 -0.143947778 0.038260220 -0.65079657 0.138051039 0.150990605 0.028392491 -0.048413469 0.028151155 -0.016658645 0.0024607307
## AwayBlockedShots -0.266317106 -0.046071691 0.096193562 0.014905334 -0.03354951 0.034261750 0.495902518 -0.0507477877 -0.113331109 -0.249326683 2.727549e-01 -0.0412523762 -0.022916901 -0.413476847 0.421816860 0.22703737 0.197204125 0.142885598 0.179813100 -0.143060414 -0.021549567 -0.044970222 0.0048168863
## HomeCorners 0.292770099 0.111485653 -0.115197545 0.003658071 -0.06585432 -0.074136112 0.340956817 -0.0802963219 0.109944547 0.266560651 5.998230e-02 0.2948680208 0.082721767 0.027234349 -0.009631626 0.51009860 -0.231661611 -0.481105117 0.091242828 0.074338037 -0.034833428 -0.115601589 0.0302929841
## AwayCorners -0.271500787 -0.032668073 0.163495914 0.044511910 -0.10601441 0.040173210 0.445875397 -0.0036571296 -0.106003074 -0.300436815 1.927095e-01 0.0103746054 -0.048610622 0.369230745 -0.157680127 -0.24113930 -0.425920697 -0.123888435 -0.300057218 0.170374026 0.087222405 0.014131863 -0.0276008809
## HomePassSuccPerc 0.318861751 -0.220552100 -0.206361467 -0.132119313 0.14202394 0.069568401 0.071827727 -0.0243624601 -0.032306965 -0.378751255 -1.164250e-01 -0.1294659576 0.025653708 -0.031331225 0.087437417 -0.03370898 0.084705253 -0.046066972 0.112890050 0.564780983 0.054595663 0.042674158 0.4773108009
## AwayPassSuccPerc -0.301110558 -0.322692176 -0.031224687 -0.148547520 0.16604501 -0.009452447 -0.001216707 -0.0459779758 0.016689812 0.386298991 6.098834e-02 0.0399743385 -0.009611503 -0.113037144 0.061469085 -0.01690345 0.020313348 0.016762153 0.060439029 0.596210394 0.053957673 0.001510258 -0.4667988159
## HomeAerialsWon 0.004932741 0.418529362 0.203891424 0.211043288 -0.29917761 -0.085999549 -0.052355519 0.0899698213 0.066355108 -0.012288198 3.702650e-02 -0.0837671775 -0.041409584 0.009853931 -0.077449633 -0.04485985 -0.228293144 0.248998748 0.623476049 0.320070841 0.027970983 -0.043491230 -0.0153196479
## AwayAerialsWon -0.041983297 0.430840063 0.181589801 0.200167460 -0.26622072 -0.054429487 -0.078267928 0.0406111309 0.029627501 0.001607745 -1.900775e-02 -0.0007657784 0.234904102 -0.125894725 0.174304494 0.01290224 0.387370189 -0.180422093 -0.490383622 0.354662683 0.036600043 0.069733752 0.0083145078
## HomeFouls -0.069764137 0.103289383 0.039199983 0.228757933 0.40146801 -0.471290635 -0.083026632 -0.0298357168 0.282563721 -0.132405385 -3.703711e-02 0.1064858587 -0.460178682 0.232581530 0.389435886 -0.01003629 0.017781898 -0.095358263 -0.009130458 0.026219786 0.023875143 -0.032385284 -0.0043934417
## AwayFouls 0.013774086 0.168723673 0.024801858 0.342818582 0.28762775 0.488003822 -0.061822463 -0.0497072042 -0.165901646 -0.059197438 -3.064346e-02 0.2982038603 -0.431776086 -0.348198416 -0.260392611 0.05989367 -0.064591273 0.005496845 -0.103513330 0.104682448 0.012275923 -0.006022789 0.0036631162
## HomeYellowCards -0.091623920 0.110567533 -0.062905434 0.212975815 0.45594867 -0.399850623 0.157346975 -0.0835345377 0.075757179 -0.068292896 -7.054138e-06 -0.2619225117 0.375918933 -0.292964360 -0.469657451 0.02164501 -0.044015994 0.022849470 -0.025353095 -0.014373586 0.001324531 0.007095374 -0.0010229615
## AwayYellowCards 0.040822966 0.122727742 0.006642855 0.334963653 0.39818772 0.406858282 0.118093585 -0.0281040422 -0.156373760 0.222730790 2.909941e-04 -0.1603495410 0.365983731 0.419666000 0.306148175 -0.05658165 0.060605416 0.034267300 0.136961097 -0.029031400 -0.023487262 -0.020789520 -0.0023718045
## HomeRedCards -0.070458468 -0.021015616 -0.080563530 -0.003441233 0.03606542 0.322839602 0.147825662 0.5276394492 0.746076986 -0.016623814 -2.053443e-02 -0.1015467080 0.021260206 -0.079879153 -0.012353206 -0.01362548 -0.035644530 0.023251817 -0.063244515 -0.020461046 -0.031603842 -0.016392168 0.0009638201
## AwayRedCards 0.062075458 -0.044938114 0.018778675 0.027792818 0.10936630 -0.227000077 0.059400091 0.8215430006 -0.473457189 0.057057485 -5.671925e-02 0.1268859700 -0.003122419 -0.024301843 0.015066856 0.02051803 0.009308522 -0.043635472 0.013897869 0.006904905 0.026915854 0.010119734 -0.0031156228
The first Principal Component is the variable which combines original variables in such a way that the statistics concerning ofensive aspects of the game are taken with positive sign for the home team and with a negative sign for away team. The defensive aspects of the game, such as fouls and cards are taken with a negative sign for home team and positive sign for away team. Such combination of original variables could be named as the indicator of how much home team dominated the game.
The second Principal Component combines original variables in such a way that goals have negative sign (home goals have much higher influence than away goals). Shots on target also have negative signs (once again home shots are more significant than away shots). Successful passes percentage have also negative signs for both teams. Aerial duels, fouls and yellow cards have positive signs with quite high significance. Shots off target for home team also have positive sign. This Principal Component could be described as an indicator of whether the match was more offensive or defensive in nature.
In 1961 Henry Kaiser proposed to use eigenvalues resulting from PCA to determine the number of Principal Components to be used in the analysis. According to his deliberations, the eigenvalue greater than 1 could be used as a threshold showing which PCs should be used.
Percentage of variance for the first 10 PCs are also presented in the form of a graph.
library(factoextra)
# summary(pca_res)
eig.val <- get_eigenvalue(pca_res)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.51384091 19.6253953 19.62540
## Dim.2 2.43458847 10.5851673 30.21056
## Dim.3 2.09742176 9.1192250 39.32979
## Dim.4 1.58488773 6.8908162 46.22060
## Dim.5 1.50937523 6.5625010 52.78310
## Dim.6 1.19073752 5.1771197 57.96022
## Dim.7 1.09469041 4.7595235 62.71975
## Dim.8 0.98267157 4.2724851 66.99223
## Dim.9 0.94255448 4.0980629 71.09030
## Dim.10 0.76046255 3.3063589 74.39665
## Dim.11 0.73664446 3.2028020 77.59946
## Dim.12 0.69479819 3.0208617 80.62032
## Dim.13 0.64696730 2.8129013 83.43322
## Dim.14 0.56066171 2.4376596 85.87088
## Dim.15 0.52645719 2.2889443 88.15982
## Dim.16 0.51355211 2.2328352 90.39266
## Dim.17 0.47516865 2.0659506 92.45861
## Dim.18 0.46244022 2.0106097 94.46922
## Dim.19 0.40425841 1.7576453 96.22686
## Dim.20 0.33225072 1.4445683 97.67143
## Dim.21 0.25534403 1.1101914 98.78162
## Dim.22 0.23982149 1.0427021 99.82433
## Dim.23 0.04040488 0.1756734 100.00000
According to this rule, the first 7 PCs should be retained in the analysis.
Another approach to choosing the number of PCs is to pick as many of them to account for certain amount of the total variance. Usually it is 70-80%.
fviz_eig(pca_res, addlabels = TRUE)
a<-summary(pca_res)
plot(a$importance[3,],type="l") # cumulative variance
Looking at the cumulative variance plot we can say that it grows relatively slow. This condition points to 9 PCs when we are concerned about 70% of variance and to 12 PCs when 80% is considered enough.
Comparing these results 12 PCs should be chosen as the safe alternative. It cuts in half the number of variables when compared to the original data set.
First two PCs are responsible for only 30% of the total variance. The result will be, however, presented only for them in order to make the procedure clearer.
fviz_pca_var(pca_res, col.var="contrib", repel = TRUE)
We see that statistics for the home team tend to be grouped on the right side of Dim1, while for the away team on the opposite side.
The brighter the color of the arrow, the higher the contribution of the variable to the PCs.
Contributions of the variables to the PCs are shown precisely in the graphs below.
Note: The red horizontal reference line corresponds to the expected value if the contributions were uniform.
# Contributions of variables to PC1
fviz_contrib(pca_res, choice = "var", axes = 1)
# Contributions of variables to PC2
fviz_contrib(pca_res, choice = "var", axes = 2)
# Contributions of variables to PC1 and PC2
fviz_contrib(pca_res, choice = "var", axes = 1:2)
We see that for the first PC almost half of the variables is relatively important. For the second PC 7 variables are relatively important. For the first two PCs combined 13 variables are relatively important.
It is also called the Principal Coordinates Analysis.
It is the method for visualizing the similarity or dissimilarity between the observations. It is most commonly presented for only two dimensions (k = 2).
library(ggpubr)
library(maptools)
dist.matches = dist(t(matches_data_scaled)) # as input we need distance between units
mds = cmdscale(dist.matches, k=2, eig = TRUE) #k - the maximum dimension of the space
colnames(mds$points) <- c("Dim.1", "Dim.2")
ggscatter(as.data.frame(mds$points), x = "Dim.1", y = "Dim.2",
label = colnames(matches_data_scaled),
size = 1,
repel = TRUE)
plot(mds$points) # plot with labels
plot(mds$points, type='n') # plot with labels
pointLabel(mds$points, labels=rownames(mds$points), cex=0.6, adj=0.5)
Graph with points was printed only in order to visualize data dispersion more clearly.
We cannot see any clear outliers in the data. Only the number of aerials for both teams seem to be an outlier.
We see that BlockedShots, ShotsOnTarget and Corners are grouped for the away team. For home team BlockedShots, ShotsOffTarget and Corners are grouped.
It seems that there are two clusters – lower left and upper right.
kmeans_clust <- kmeans(mds$points, 2)
plot(mds$points, type='n') # plot with labels
pointLabel(mds$points, labels=rownames(mds$points), col = kmeans_clust$cluster ,cex=0.6, adj=0.5)
The red one consists of statistics mostly for the home team and of ofensive nature, while the black one contains the data on defensive part of the game and for the away team.
Conducted dimension reduction analysis showed that many PCs must be taken into account in order to provide good representation of the original dataset. However, it still allowed for a two-fold reduction of its dimensionality.
MDS analysis conducted for two dimensions showed that two clusters of the original variables can be distinguished.