The key purpose of dimension reduction is the idea of reducing the complexity of data while retaining the most variety possible. Dimension reduction techniques are methods used to decrease the amount of variables and as a result they simplify the dataset. One of such methods is Principal Components Analysis (PCA), which essentially takes a highly dimensional dataset and generates a set with less variables. The transformed data can then be used as input for modeling, clustering and various data manipulations.
The aim of the project is to limit the number of dimensions in the dataset while ensuring the sufficient level of variance. The result of dimension reduction was later used as data source for clustering. In the second part of the project, clustering on raw data and clustering on less complex (result of dimension reduction) was performed to understand to what extend both results are alike (Clustering part please see: https://rpubs.com/meggie/863150). In the last part of the project association rules were used to identify what attributes are the most desirable for game players (please see: https://rpubs.com/meggie/863148).
The dataset used was FIFA’22 computer game players (https://www.futbin.com/22/players?page=1&version=gold_rare&sort=version&order=desc, accessed 22-23.01.2022). The dataset was reduced to only Gold Rare players as these are the most valuable players. It is composed of 19 variables and 107 observations:
- Index -> of each observation to better understand the difference in clustering
- Player Name -> ch. for each of the player
- Position -> position on which the player plays, can be used for filtering as there are different requirements for goal keepers, defendants and strikers
- Version -> the same for all the players in the dataset “rare”
- Rating -> rating in the game
- Player’s price -> player’s price in the game
- Skills -> rating out of 5 stars
- Weak foot -> rating out of 5 stars
- Pace -> score out of 100
- Shooting -> score out of 100
- Dribbling -> score out of 100
- Defense -> score out of 100
- Physically -> score out of 100
- Popularity -> number of clicks (profile views of each player) on the Futbin website
- Base statistics -> statistics of the player
- In game statistics -> statistics of the player
- Games played -> number of games played
- Average goals per game -> average goals scored in one game
db <- read.csv("FIFA_RARE.csv", sep = ";", dec = ",", header = TRUE)
summary(db)
## Index Player Position Version
## Min. : 1.0 Length:107 Length:107 Length:107
## 1st Qu.: 27.5 Class :character Class :character Class :character
## Median : 54.0 Mode :character Mode :character Mode :character
## Mean : 54.0
## 3rd Qu.: 80.5
## Max. :107.0
## Rating PS_price Skills Weak_Foot
## Min. :77.00 Min. : 700 Min. :1.000 Min. :2.000
## 1st Qu.:82.00 1st Qu.: 1000 1st Qu.:2.500 1st Qu.:3.000
## Median :84.00 Median : 2600 Median :3.000 Median :4.000
## Mean :84.34 Mean : 41224 Mean :3.168 Mean :3.561
## 3rd Qu.:86.50 3rd Qu.: 16625 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :93.00 Max. :969000 Max. :5.000 Max. :5.000
## Pace Passing Shooting Dribbling Defense
## Min. :53.0 Min. :37.00 Min. :53.0 Min. :59.00 Min. :29.00
## 1st Qu.:73.5 1st Qu.:65.50 1st Qu.:72.5 1st Qu.:78.00 1st Qu.:45.00
## Median :80.0 Median :76.00 Median :78.0 Median :82.00 Median :64.00
## Mean :78.9 Mean :72.49 Mean :77.2 Mean :81.35 Mean :61.78
## 3rd Qu.:86.0 3rd Qu.:81.00 3rd Qu.:81.0 3rd Qu.:86.50 3rd Qu.:80.00
## Max. :97.0 Max. :93.00 Max. :93.0 Max. :95.00 Max. :91.00
## Physically Popularity Base_Stats In_game_stats
## Min. :50.00 Min. : -440.0 Min. :385.0 Min. : 817
## 1st Qu.:67.50 1st Qu.: 141.5 1st Qu.:429.0 1st Qu.:2012
## Median :76.00 Median : 442.0 Median :446.0 Median :2117
## Mean :74.87 Mean : 1170.0 Mean :446.6 Mean :2015
## 3rd Qu.:82.00 3rd Qu.: 1405.0 3rd Qu.:462.5 3rd Qu.:2211
## Max. :92.00 Max. :11196.0 Max. :502.0 Max. :2337
## Games_played Avg_goals
## Min. : 4665 Min. :0.0000
## 1st Qu.: 164228 1st Qu.:0.0200
## Median : 924030 Median :0.1600
## Mean : 3670417 Mean :0.2604
## 3rd Qu.: 3665154 3rd Qu.:0.3500
## Max. :26525205 Max. :1.1100
cat("Number of observations in the dataset:", nrow(db))
## Number of observations in the dataset: 107
cat("Number of variables in the dataset:", ncol(db))
## Number of variables in the dataset: 19
library(corrplot)
## corrplot 0.92 loaded
library("factoextra")
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(gridExtra)
library(grid)
For the purpose of dimension reduction, the dataset has been limited only to numerical attributes. Hence, the number of observations remained the same, whereas the number of variables decreased to 15. This will allow for a more facile transformation since various measures can be performed on numerical data.
db1 <- db[5:19]
cat("Number of observations in the dataset:", nrow(db1))
## Number of observations in the dataset: 107
cat("Number of variables in the dataset:", ncol(db1))
## Number of variables in the dataset: 15
By analyzing the below chart it can be stated that there are couple of attribute pairs which are highly positively correlated (in_game_stats and skills, games_played and rating, games_played and popularity, dribbling and passing). There are also some pairs which have high negative correlation (defense and passing, defense and avg_goals, skills and physically). For better understanding dimension reduction using PCA was performed twice, once - on the whole dataset db1 and second time on db1 without goalkeepers.
cor<-cor(db1, method="pearson")
corrplot(cor)
pca <- prcomp(db1, center = TRUE, scale = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.2226 1.7477 1.3357 1.2719 0.99514 0.84750 0.80545
## Proportion of Variance 0.3293 0.2036 0.1190 0.1079 0.06602 0.04788 0.04325
## Cumulative Proportion 0.3293 0.5330 0.6519 0.7598 0.82577 0.87365 0.91690
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.70662 0.50474 0.38215 0.34007 0.32330 0.29688 0.19518
## Proportion of Variance 0.03329 0.01698 0.00974 0.00771 0.00697 0.00588 0.00254
## Cumulative Proportion 0.95019 0.96717 0.97691 0.98462 0.99158 0.99746 1.00000
## PC15
## Standard deviation 4.878e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
fviz_eig(pca,barfill = "#1CA5BE",barcolor = "#1CA5BE",linecolor = "purple")
Based on the scree plot it can be stated that in order to retain a minimum of 79% variance, 5 dimensions need to be kept.
PC1 <- fviz_contrib(pca, choice = "var", axes = 1,fill = "#1CA5BE",color = "#1CA5BE")
PC2 <- fviz_contrib(pca, choice = "var", axes = 2,fill = "#1CA5BE",color = "#1CA5BE")
PC3 <- fviz_contrib(pca, choice = "var", axes = 3,fill = "#1CA5BE",color = "#1CA5BE")
PC4 <- fviz_contrib(pca, choice = "var", axes = 4,fill = "#1CA5BE",color = "#1CA5BE")
PC5 <- fviz_contrib(pca, choice = "var", axes = 5,fill = "#1CA5BE",color = "#1CA5BE")
grid.arrange(PC1, PC2, PC3, PC4, PC5, ncol=2)
fviz_contrib(pca, choice = "var", axes = 1:5, fill = "#1CA5BE", color = "#1CA5BE")
…… //// ….
Data preparation for PCA on database without goal keepers.
db2<-db[db$Position != "GK",]
db2 <- db2[5:19]
pca2 <- prcomp(db2, center = TRUE, scale = TRUE)
summary(pca2)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4136 1.7354 1.4407 1.08483 0.90980 0.85077 0.73183
## Proportion of Variance 0.3884 0.2008 0.1384 0.07846 0.05518 0.04825 0.03571
## Cumulative Proportion 0.3884 0.5891 0.7275 0.80597 0.86115 0.90941 0.94511
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.51600 0.43429 0.33006 0.31494 0.28554 0.23283 0.15672
## Proportion of Variance 0.01775 0.01257 0.00726 0.00661 0.00544 0.00361 0.00164
## Cumulative Proportion 0.96286 0.97544 0.98270 0.98931 0.99475 0.99836 1.00000
## PC15
## Standard deviation 3.385e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
fviz_eig(pca2,barfill = "#6B6DFF",barcolor = "#6B6DFF",linecolor = "purple")
Based on the scree plot it can be stated that in order to retain a minimum of 79% variance, 4 dimensions need to be kept.
PC1 <- fviz_contrib(pca2, choice = "var", axes = 1,fill = "#6B6DFF",color = "#6B6DFF")
PC2 <- fviz_contrib(pca2, choice = "var", axes = 2,fill = "#6B6DFF",color = "#6B6DFF")
PC3 <- fviz_contrib(pca2, choice = "var", axes = 3,fill = "#6B6DFF",color = "#6B6DFF")
PC4 <- fviz_contrib(pca2, choice = "var", axes = 4,fill = "#6B6DFF",color = "#6B6DFF")
grid.arrange(PC1, PC2, PC3, PC4, ncol=2)
fviz_contrib(pca2, choice = "var", axes = 1:4, fill = "#6B6DFF", color = "#6B6DFF")
Analyzing the two PCAs performed it can be stated that the difference in PCA results (assuming 79% variance is the cut-of-point) is 1 dimension. When excluding goal keepers from the dataset, its variance is decreasing, hence, less reduction of the variance happens as the result of PCA. The difference in cumulative percentage of explained variances for 4th dimension between the two analyses is less than 5 percent. Thus, it can be argued that the effects of removing goal keepers for the purpose of dimension reduction can be neglectable.
http://www.sthda.com/english/wiki/eigenvalues-quick-data-visualization-with-factoextra-r-software-and-data-mining#install-and-load-factoextra
https://cran.r-project.org/web/packages/gridExtra/vignettes/arrangeGrob.html
https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
https://rpubs.com/wkonarz/pca_mcdonalds
https://machinelearningmastery.com/dimensionality-reduction-for-machine-learning/
https://www.displayr.com/working-with-principal-components-analysis-results/