The goal of the project is to apply dimension reduction to data from the FIFA 2022 game in order to reduce the number of variables. To do this I used Principal Component analysis model (PCA).
The FIFA game involves building a team of football players and playing matches using that team. To better assess players skills in the game, there is a complex system of statistics evaluating individual players. These statistics cover many aspects, such as attack, shooting, passing, physicality, or defensive skills. Many of these are very detailed, focusing on a single minor skill that seems difficult to measure for the huge number of players in the game. This raises doubts about their necessity, perhaps many of them could be combined into broader, more general statistics without losing valuable information for the users.
This project aims to address this issue by reducing the number of variables describing players skills. To do this I used data from âFIFA 2022â game, which was realesed in 2021. Data was sourced from Kaggle website: https://www.kaggle.com/datasets/stefanoleone992/fifa-22-complete-player-dataset?select=players_22.csv , which contains information of over 19 thousend players.
source: https://wall.alphacoders.com/big.php?i=1156913
#Loading packages
library(corrplot)
library(factoextra)
library(gridExtra)
library(dplyr)
#Loading data, I need to exclude goalkeepers due to the lack of many variables for field players
fifa <- read.csv("C:/Users/User/OneDrive/Pulpit/Studia magisterskie/USL/Project part 2/players_22.csv")
fifa <- fifa[fifa$player_positions!= "GK", ]
fifa_1 <- fifa[ , c('short_name', 'overall', 'potential', 'weak_foot', 'skill_moves', 'international_reputation', names(fifa)[38:77])]
#Data prewiev
any(is.na(fifa_1))
## [1] FALSE
summary(fifa_1)
## short_name overall potential weak_foot
## Length:17107 Min. :47.00 Min. :49.00 Min. :1.000
## Class :character 1st Qu.:62.00 1st Qu.:67.00 1st Qu.:3.000
## Mode :character Median :66.00 Median :71.00 Median :3.000
## Mean :65.94 Mean :71.24 Mean :3.002
## 3rd Qu.:70.00 3rd Qu.:75.00 3rd Qu.:3.000
## Max. :93.00 Max. :95.00 Max. :5.000
## skill_moves international_reputation pace shooting
## Min. :2.000 Min. :1.000 Min. :28.00 Min. :18.00
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:62.00 1st Qu.:42.00
## Median :2.000 Median :1.000 Median :69.00 Median :54.00
## Mean :2.521 Mean :1.096 Mean :68.21 Mean :52.35
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:76.00 3rd Qu.:63.00
## Max. :5.000 Max. :5.000 Max. :97.00 Max. :94.00
## passing dribbling defending physic
## Min. :25.00 Min. :27.00 Min. :14.0 Min. :29.00
## 1st Qu.:51.00 1st Qu.:57.00 1st Qu.:37.0 1st Qu.:59.00
## Median :58.00 Median :64.00 Median :56.0 Median :66.00
## Mean :57.31 Mean :62.56 Mean :51.7 Mean :64.82
## 3rd Qu.:64.00 3rd Qu.:69.00 3rd Qu.:64.0 3rd Qu.:72.00
## Max. :93.00 Max. :95.00 Max. :91.0 Max. :90.00
## attacking_crossing attacking_finishing attacking_heading_accuracy
## Min. :15.00 Min. :10.00 Min. :17.0
## 1st Qu.:45.00 1st Qu.:37.00 1st Qu.:48.0
## Median :56.00 Median :53.00 Median :57.0
## Mean :54.03 Mean :50.25 Mean :56.5
## 3rd Qu.:64.00 3rd Qu.:63.00 3rd Qu.:65.0
## Max. :94.00 Max. :95.00 Max. :93.0
## attacking_short_passing attacking_volleys skill_dribbling skill_curve
## Min. :23.00 Min. :10.00 Min. :16.00 Min. :12.00
## 1st Qu.:58.00 1st Qu.:34.00 1st Qu.:55.00 1st Qu.:40.00
## Median :64.00 Median :46.00 Median :63.00 Median :52.00
## Mean :62.83 Mean :46.36 Mean :60.95 Mean :51.36
## 3rd Qu.:69.00 3rd Qu.:57.00 3rd Qu.:69.00 3rd Qu.:63.00
## Max. :94.00 Max. :90.00 Max. :96.00 Max. :94.00
## skill_fk_accuracy skill_long_passing skill_ball_control movement_acceleration
## Min. :10.00 Min. :20.00 Min. :24.00 Min. :27.00
## 1st Qu.:34.00 1st Qu.:49.00 1st Qu.:58.00 1st Qu.:62.00
## Median :43.00 Median :58.00 Median :64.00 Median :69.00
## Mean :45.79 Mean :56.43 Mean :63.37 Mean :68.18
## 3rd Qu.:57.00 3rd Qu.:65.00 3rd Qu.:70.00 3rd Qu.:76.00
## Max. :94.00 Max. :93.00 Max. :96.00 Max. :97.00
## movement_sprint_speed movement_agility movement_reactions movement_balance
## Min. :27.00 Min. :25.00 Min. :29.00 Min. :26.00
## 1st Qu.:62.00 1st Qu.:59.00 1st Qu.:56.00 1st Qu.:60.00
## Median :69.00 Median :68.00 Median :62.00 Median :68.00
## Mean :68.22 Mean :66.57 Mean :61.88 Mean :66.85
## 3rd Qu.:76.00 3rd Qu.:75.00 3rd Qu.:68.00 3rd Qu.:75.00
## Max. :97.00 Max. :96.00 Max. :94.00 Max. :96.00
## power_shot_power power_jumping power_stamina power_strength
## Min. :20.00 Min. :29.00 Min. :24.00 Min. :19.00
## 1st Qu.:51.00 1st Qu.:58.00 1st Qu.:60.00 1st Qu.:58.00
## Median :61.00 Median :66.00 Median :68.00 Median :67.00
## Mean :59.19 Mean :65.83 Mean :67.29 Mean :65.66
## 3rd Qu.:69.00 3rd Qu.:74.00 3rd Qu.:75.00 3rd Qu.:75.00
## Max. :95.00 Max. :95.00 Max. :97.00 Max. :97.00
## power_long_shots mentality_aggression mentality_interceptions
## Min. :11.00 Min. :20.00 Min. :10.0
## 1st Qu.:39.00 1st Qu.:50.00 1st Qu.:35.0
## Median :54.00 Median :60.00 Median :56.0
## Mean :51.04 Mean :59.27 Mean :50.5
## 3rd Qu.:63.00 3rd Qu.:70.00 3rd Qu.:65.0
## Max. :94.00 Max. :95.00 Max. :91.0
## mentality_positioning mentality_vision mentality_penalties mentality_composure
## Min. :12.00 Min. :13.00 Min. :13.00 Min. :30.00
## 1st Qu.:48.00 1st Qu.:48.00 1st Qu.:42.00 1st Qu.:53.00
## Median :58.00 Median :57.00 Median :51.00 Median :60.00
## Mean :55.35 Mean :55.85 Mean :51.51 Mean :60.05
## 3rd Qu.:65.00 3rd Qu.:65.00 3rd Qu.:61.00 3rd Qu.:67.00
## Max. :96.00 Max. :95.00 Max. :93.00 Max. :96.00
## defending_marking_awareness defending_standing_tackle defending_sliding_tackle
## Min. :10.00 Min. :10.00 Min. :10.00
## 1st Qu.:37.00 1st Qu.:37.00 1st Qu.:34.00
## Median :55.00 Median :59.00 Median :56.00
## Mean :50.72 Mean :52.32 Mean :49.93
## 3rd Qu.:64.00 3rd Qu.:66.00 3rd Qu.:64.00
## Max. :93.00 Max. :93.00 Max. :92.00
## goalkeeping_diving goalkeeping_handling goalkeeping_kicking
## Min. : 2.00 Min. : 2.00 Min. : 2.00
## 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 8.00
## Median :10.00 Median :10.00 Median :10.00
## Mean :10.35 Mean :10.39 Mean :10.36
## 3rd Qu.:13.00 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :32.00 Max. :33.00 Max. :38.00
## goalkeeping_positioning goalkeeping_reflexes
## Min. : 2.00 Min. : 2.00
## 1st Qu.: 8.00 1st Qu.: 8.00
## Median :10.00 Median :10.00
## Mean :10.37 Mean :10.33
## 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :33.00 Max. :37.00
dim(fifa_1)
## [1] 17107 46
As we can see, we have over 17k observations and 46 numeric variables, which theoretically can range from 0 to 100, although in practice, they never reach such extreme values.
fifa_1 %>%
filter(short_name == "R. Lewandowski") %>%
t()
## [,1]
## short_name "R. Lewandowski"
## overall "92"
## potential "92"
## weak_foot "4"
## skill_moves "4"
## international_reputation "5"
## pace "78"
## shooting "92"
## passing "79"
## dribbling "86"
## defending "44"
## physic "82"
## attacking_crossing "71"
## attacking_finishing "95"
## attacking_heading_accuracy "90"
## attacking_short_passing "85"
## attacking_volleys "89"
## skill_dribbling "85"
## skill_curve "79"
## skill_fk_accuracy "85"
## skill_long_passing "70"
## skill_ball_control "88"
## movement_acceleration "77"
## movement_sprint_speed "79"
## movement_agility "77"
## movement_reactions "93"
## movement_balance "82"
## power_shot_power "90"
## power_jumping "85"
## power_stamina "76"
## power_strength "86"
## power_long_shots "87"
## mentality_aggression "81"
## mentality_interceptions "49"
## mentality_positioning "95"
## mentality_vision "81"
## mentality_penalties "90"
## mentality_composure "88"
## defending_marking_awareness "35"
## defending_standing_tackle "42"
## defending_sliding_tackle "19"
## goalkeeping_diving "15"
## goalkeeping_handling "6"
## goalkeeping_kicking "12"
## goalkeeping_positioning "8"
## goalkeeping_reflexes "10"
Using Robert Lewandowski as an example, we can observe that he has excellent statistics in shooting and attacking finishing, but very poor goalkeeping skills, therefore, the data accurately reflect reality.
#checking correlations using correlation matrix
yyy.cor<-cor(fifa_1[,2:46], method="pearson")
corrplot(yyy.cor, type = "lower",order ="alphabet", tl.col = "black",tl.cex=0.6)
As we can see, some variables show a high correlation with each other, and it turns out they are related to defending. This is intuitive, so we can keep only the most general one among these variables
fifa_1 <- fifa_1[, !names(fifa_1) %in% c("defending_marking_awareness", "defending_sliding_tackle","defending_standing_tackle")]
yyy.cor=cor(fifa_1[,2:43], method="pearson")
corrplot(yyy.cor, type = "lower",order ="alphabet", tl.col = "black",tl.cex=0.6)
After excluding a few variables, the correlation matrix looks much better. However, there are still some correlated variables, such as shooting with attacking_finishing or physic with power_strength. Interestingly, most of the correlations are positive, and only a few variables have negative correlations (e.g., power_strength with movement_balance)
PCA model reduces the dimensionality of data by transforming the original variables into a new set of uncorrelated variables that minimize variance loss in the data.
fifa_pca=prcomp(fifa_1[,2:43], center = TRUE, scale = TRUE)
summary(fifa_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.942 2.5704 1.74878 1.50049 1.15732 1.05387 0.97840
## Proportion of Variance 0.370 0.1573 0.07281 0.05361 0.03189 0.02644 0.02279
## Cumulative Proportion 0.370 0.5274 0.60018 0.65379 0.68568 0.71212 0.73491
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.97343 0.97173 0.96763 0.94534 0.89783 0.82830 0.75217
## Proportion of Variance 0.02256 0.02248 0.02229 0.02128 0.01919 0.01634 0.01347
## Cumulative Proportion 0.75747 0.77996 0.80225 0.82353 0.84272 0.85906 0.87253
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.7187 0.69953 0.65880 0.6546 0.60984 0.57088 0.53872
## Proportion of Variance 0.0123 0.01165 0.01033 0.0102 0.00885 0.00776 0.00691
## Cumulative Proportion 0.8848 0.89648 0.90681 0.9170 0.92587 0.93363 0.94054
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.51697 0.48634 0.47952 0.47280 0.46757 0.45025 0.43581
## Proportion of Variance 0.00636 0.00563 0.00547 0.00532 0.00521 0.00483 0.00452
## Cumulative Proportion 0.94690 0.95253 0.95801 0.96333 0.96853 0.97336 0.97788
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.40226 0.37539 0.34957 0.34656 0.32804 0.31531 0.28347
## Proportion of Variance 0.00385 0.00336 0.00291 0.00286 0.00256 0.00237 0.00191
## Cumulative Proportion 0.98173 0.98509 0.98800 0.99086 0.99342 0.99579 0.99770
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.2593 0.16350 0.02541 0.02505 0.02327 0.02135 0.01753
## Proportion of Variance 0.0016 0.00064 0.00002 0.00001 0.00001 0.00001 0.00001
## Cumulative Proportion 0.9993 0.99994 0.99995 0.99997 0.99998 0.99999 1.00000
fviz_eig(fifa_pca, barfill = "#B22222", addlabels = TRUE)
The results of calculating principal components show that PC1 explains 37% of the variance, PC2 explains 15%, and PC3 explains only 7%. What is more with seven principal components, we can observe a significant flattening, where subsequent components explain a very similar amount of variance. To achieve a minimum threshold of 70% variance explained, 6 components are needed, while using 17 components retains 90% of the information.
fviz_eig(fifa_pca, barfill = "#B22222", addlabels = TRUE, choice='eigenvalue')
The plot above illustrates the eigenvalues, which represent the variance of the projected inputs along the principal axes. According to the Kaiser Rule, only values greater than 1 should be retained, indicating that 6 principal components should be used. In this way, we will retain 70% of the variance, which is consistent with the previous conclusions
var=get_pca_var(fifa_pca)
fviz_contrib(fifa_pca, "var", axes=1, xtickslab.rt=90, fill = "#B22222")
fviz_contrib(fifa_pca, "var", axes=2, xtickslab.rt=90, fill = "#B22222")
fviz_contrib(fifa_pca, "var", axes=3, xtickslab.rt=90, fill = "#B22222")
The first three principal components are illustrated above, shown by the variables that contribute to them. Looking at them, we can observe few similarities in the statistics. PC1 includes variables such as dribbling, ball_control, passing, and shooting, which are especially important for forwards, while the variables in PC2 are more characteristic of defenders due to the high importance of defending, physicality, and strength.
Below, there is a correlation plot with a quality measure. Variables that are positively correlated are grouped together, while negatively correlated variables are placed on opposite sides of the plots origin. The distance between the variables and the origin indicates how well the variables are represented on the factor map. Variables further from the origin are more strongly represented on the map.
fviz_pca_var(fifa_pca, col.var="#B22222")
In our case, there are many variables, which makes the plot less clear. Nevertheless, we can observe that there are only a few variables that are negatively correlated. Moreover, the plot reveals some of the principal components calculated earlier. For example, the cluster of arrows at the bottom of the plot may correspond to PC2, due to the inclusion of defending, physicality, and other values relevant for defenders.Furthermore, we can see a significant clustering of variables in the center of the plot, which indicates their poor quality in the model.
The aim of this project was to reduce the number of variables describing football players statistics in FIFA 2022 while preserving as much information as possible. For this purpose, the PCA model was used, and it was found that using 6 principal components retained 70% of the variance. Moreover, to retain more variance in the data, the number of principal components would need to be significantly increased. This suggests that many of the statistics are well-constructed, and it is not easy to eliminate them without losing volatility in the data. An interesting observation is the analysis of the principal components based on the variables, which showed that variables within the same component can be similar and characterize specific roles on the field. This demonstrates that both the data and the analysis itself make sense and can be related to real life. In conclusion, dimensionality reduction is a very useful tool, applicable even in less economically oriented areas of life, and can be valuable in computer game development.