Data Dimension Reduction

The key purpose of dimension reduction is the idea of reducing the complexity of data while retaining the most variety possible. Dimension reduction techniques are methods used to decrease the amount of variables and as a result they simplify the dataset. One of such methods is Principal Components Analysis (PCA), which essentially takes a highly dimensional dataset and generates a set with less variables. The transformed data can then be used as input for modeling, clustering and various data manipulations.

Project Intruduction

The aim of the project is to limit the number of dimensions in the dataset while ensuring the sufficient level of variance. The result of dimension reduction was later used as data source for clustering. In the second part of the project, clustering on raw data and clustering on less complex (result of dimension reduction) was performed to understand to what extend both results are alike (Clustering part please see: https://rpubs.com/meggie/863150). In the last part of the project association rules were used to identify what attributes are the most desirable for game players (please see: https://rpubs.com/meggie/863148).

The dataset used was FIFA’22 computer game players (https://www.futbin.com/22/players?page=1&version=gold_rare&sort=version&order=desc, accessed 22-23.01.2022). The dataset was reduced to only Gold Rare players as these are the most valuable players. It is composed of 19 variables and 107 observations:
- Index -> of each observation to better understand the difference in clustering
- Player Name -> ch. for each of the player
- Position -> position on which the player plays, can be used for filtering as there are different requirements for goal keepers, defendants and strikers
- Version -> the same for all the players in the dataset “rare”
- Rating -> rating in the game
- Player’s price -> player’s price in the game
- Skills -> rating out of 5 stars
- Weak foot -> rating out of 5 stars
- Pace -> score out of 100
- Shooting -> score out of 100
- Dribbling -> score out of 100
- Defense -> score out of 100
- Physically -> score out of 100
- Popularity -> number of clicks (profile views of each player) on the Futbin website
- Base statistics -> statistics of the player
- In game statistics -> statistics of the player
- Games played -> number of games played
- Average goals per game -> average goals scored in one game

db <- read.csv("FIFA_RARE.csv", sep = ";", dec = ",", header = TRUE)
summary(db)
##      Index          Player            Position           Version         
##  Min.   :  1.0   Length:107         Length:107         Length:107        
##  1st Qu.: 27.5   Class :character   Class :character   Class :character  
##  Median : 54.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 54.0                                                           
##  3rd Qu.: 80.5                                                           
##  Max.   :107.0                                                           
##      Rating         PS_price          Skills        Weak_Foot    
##  Min.   :77.00   Min.   :   700   Min.   :1.000   Min.   :2.000  
##  1st Qu.:82.00   1st Qu.:  1000   1st Qu.:2.500   1st Qu.:3.000  
##  Median :84.00   Median :  2600   Median :3.000   Median :4.000  
##  Mean   :84.34   Mean   : 41224   Mean   :3.168   Mean   :3.561  
##  3rd Qu.:86.50   3rd Qu.: 16625   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :93.00   Max.   :969000   Max.   :5.000   Max.   :5.000  
##       Pace         Passing         Shooting      Dribbling        Defense     
##  Min.   :53.0   Min.   :37.00   Min.   :53.0   Min.   :59.00   Min.   :29.00  
##  1st Qu.:73.5   1st Qu.:65.50   1st Qu.:72.5   1st Qu.:78.00   1st Qu.:45.00  
##  Median :80.0   Median :76.00   Median :78.0   Median :82.00   Median :64.00  
##  Mean   :78.9   Mean   :72.49   Mean   :77.2   Mean   :81.35   Mean   :61.78  
##  3rd Qu.:86.0   3rd Qu.:81.00   3rd Qu.:81.0   3rd Qu.:86.50   3rd Qu.:80.00  
##  Max.   :97.0   Max.   :93.00   Max.   :93.0   Max.   :95.00   Max.   :91.00  
##    Physically      Popularity        Base_Stats    In_game_stats 
##  Min.   :50.00   Min.   : -440.0   Min.   :385.0   Min.   : 817  
##  1st Qu.:67.50   1st Qu.:  141.5   1st Qu.:429.0   1st Qu.:2012  
##  Median :76.00   Median :  442.0   Median :446.0   Median :2117  
##  Mean   :74.87   Mean   : 1170.0   Mean   :446.6   Mean   :2015  
##  3rd Qu.:82.00   3rd Qu.: 1405.0   3rd Qu.:462.5   3rd Qu.:2211  
##  Max.   :92.00   Max.   :11196.0   Max.   :502.0   Max.   :2337  
##   Games_played        Avg_goals     
##  Min.   :    4665   Min.   :0.0000  
##  1st Qu.:  164228   1st Qu.:0.0200  
##  Median :  924030   Median :0.1600  
##  Mean   : 3670417   Mean   :0.2604  
##  3rd Qu.: 3665154   3rd Qu.:0.3500  
##  Max.   :26525205   Max.   :1.1100
cat("Number of observations in the dataset:", nrow(db))
## Number of observations in the dataset: 107
cat("Number of variables in the dataset:", ncol(db))
## Number of variables in the dataset: 19

Loading necessary libraries

library(corrplot)
## corrplot 0.92 loaded
library("factoextra")
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(gridExtra)
library(grid)

Limiting the attributes for dimension reduction

For the purpose of dimension reduction, the dataset has been limited only to numerical attributes. Hence, the number of observations remained the same, whereas the number of variables decreased to 15. This will allow for a more facile transformation since various measures can be performed on numerical data.

db1 <- db[5:19]

cat("Number of observations in the dataset:", nrow(db1))
## Number of observations in the dataset: 107
cat("Number of variables in the dataset:", ncol(db1))
## Number of variables in the dataset: 15

Assessing correlation between variables

By analyzing the below chart it can be stated that there are couple of attribute pairs which are highly positively correlated (in_game_stats and skills, games_played and rating, games_played and popularity, dribbling and passing). There are also some pairs which have high negative correlation (defense and passing, defense and avg_goals, skills and physically). For better understanding dimension reduction using PCA was performed twice, once - on the whole dataset db1 and second time on db1 without goalkeepers.

cor<-cor(db1, method="pearson") 
corrplot(cor)

PCA Principal Components Analysis on whole db1

pca <- prcomp(db1, center = TRUE, scale = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     2.2226 1.7477 1.3357 1.2719 0.99514 0.84750 0.80545
## Proportion of Variance 0.3293 0.2036 0.1190 0.1079 0.06602 0.04788 0.04325
## Cumulative Proportion  0.3293 0.5330 0.6519 0.7598 0.82577 0.87365 0.91690
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.70662 0.50474 0.38215 0.34007 0.32330 0.29688 0.19518
## Proportion of Variance 0.03329 0.01698 0.00974 0.00771 0.00697 0.00588 0.00254
## Cumulative Proportion  0.95019 0.96717 0.97691 0.98462 0.99158 0.99746 1.00000
##                             PC15
## Standard deviation     4.878e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00
fviz_eig(pca,barfill = "#1CA5BE",barcolor = "#1CA5BE",linecolor = "purple")

Based on the scree plot it can be stated that in order to retain a minimum of 79% variance, 5 dimensions need to be kept.

PC1 <- fviz_contrib(pca, choice = "var", axes = 1,fill = "#1CA5BE",color = "#1CA5BE")
PC2 <- fviz_contrib(pca, choice = "var", axes = 2,fill = "#1CA5BE",color = "#1CA5BE")
PC3 <- fviz_contrib(pca, choice = "var", axes = 3,fill = "#1CA5BE",color = "#1CA5BE")
PC4 <- fviz_contrib(pca, choice = "var", axes = 4,fill = "#1CA5BE",color = "#1CA5BE")
PC5 <- fviz_contrib(pca, choice = "var", axes = 5,fill = "#1CA5BE",color = "#1CA5BE")
grid.arrange(PC1, PC2, PC3, PC4, PC5,  ncol=2)

fviz_contrib(pca, choice = "var", axes = 1:5, fill = "#1CA5BE", color = "#1CA5BE")

…… //// ….

PCA Principal Components Analysis on db1 without goal keepers

Data preparation for PCA on database without goal keepers.

db2<-db[db$Position != "GK",] 
db2 <- db2[5:19]
pca2 <- prcomp(db2, center = TRUE, scale = TRUE)
summary(pca2)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4136 1.7354 1.4407 1.08483 0.90980 0.85077 0.73183
## Proportion of Variance 0.3884 0.2008 0.1384 0.07846 0.05518 0.04825 0.03571
## Cumulative Proportion  0.3884 0.5891 0.7275 0.80597 0.86115 0.90941 0.94511
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.51600 0.43429 0.33006 0.31494 0.28554 0.23283 0.15672
## Proportion of Variance 0.01775 0.01257 0.00726 0.00661 0.00544 0.00361 0.00164
## Cumulative Proportion  0.96286 0.97544 0.98270 0.98931 0.99475 0.99836 1.00000
##                             PC15
## Standard deviation     3.385e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00
fviz_eig(pca2,barfill = "#6B6DFF",barcolor = "#6B6DFF",linecolor = "purple")

Based on the scree plot it can be stated that in order to retain a minimum of 79% variance, 4 dimensions need to be kept.

PC1 <- fviz_contrib(pca2, choice = "var", axes = 1,fill = "#6B6DFF",color = "#6B6DFF")
PC2 <- fviz_contrib(pca2, choice = "var", axes = 2,fill = "#6B6DFF",color = "#6B6DFF")
PC3 <- fviz_contrib(pca2, choice = "var", axes = 3,fill = "#6B6DFF",color = "#6B6DFF")
PC4 <- fviz_contrib(pca2, choice = "var", axes = 4,fill = "#6B6DFF",color = "#6B6DFF")

grid.arrange(PC1, PC2, PC3, PC4,  ncol=2)

fviz_contrib(pca2, choice = "var", axes = 1:4, fill = "#6B6DFF", color = "#6B6DFF")

Conclusion

Analyzing the two PCAs performed it can be stated that the difference in PCA results (assuming 79% variance is the cut-of-point) is 1 dimension. When excluding goal keepers from the dataset, its variance is decreasing, hence, less reduction of the variance happens as the result of PCA. The difference in cumulative percentage of explained variances for 4th dimension between the two analyses is less than 5 percent. Thus, it can be argued that the effects of removing goal keepers for the purpose of dimension reduction can be neglectable.

Reference

http://www.sthda.com/english/wiki/eigenvalues-quick-data-visualization-with-factoextra-r-software-and-data-mining#install-and-load-factoextra
https://cran.r-project.org/web/packages/gridExtra/vignettes/arrangeGrob.html
https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
https://rpubs.com/wkonarz/pca_mcdonalds
https://machinelearningmastery.com/dimensionality-reduction-for-machine-learning/
https://www.displayr.com/working-with-principal-components-analysis-results/