Import neccesary packages

library(imager)
library(csv)
library(dplyr)
library(PerformanceAnalytics)
library(corrplot)
library(REdaS)
library(factoextra)
library(cowplot)
library(rgl)

Introduction

The paper is about dimension reduction in the analysis of Pokémon statistics. It will be used a Principal Component Analysis (PCA) method of dimension reduction on a following dataset with 800 observations and 13 columns:

The data set was downloaded from Kaggle -> source

Firstly, let’s import our dataset and look at the types of each characteristic.

pokemon <- read.csv("Pokemon.csv")
head(pokemon)
##   X.                  Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1  1             Bulbasaur  Grass Poison   318 45     49      49      65
## 2  2               Ivysaur  Grass Poison   405 60     62      63      80
## 3  3              Venusaur  Grass Poison   525 80     82      83     100
## 4  3 VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 5  4            Charmander   Fire          309 39     52      43      60
## 6  5            Charmeleon   Fire          405 58     64      58      80
##   Sp..Def Speed Generation Legendary
## 1      65    45          1     False
## 2      80    60          1     False
## 3     100    80          1     False
## 4     120    80          1     False
## 5      50    65          1     False
## 6      65    80          1     False
summary(pokemon)
##        X.            Name              Type.1             Type.2         
##  Min.   :  1.0   Length:800         Length:800         Length:800        
##  1st Qu.:184.8   Class :character   Class :character   Class :character  
##  Median :364.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :362.8                                                           
##  3rd Qu.:539.2                                                           
##  Max.   :721.0                                                           
##      Total             HP             Attack       Defense      
##  Min.   :180.0   Min.   :  1.00   Min.   :  5   Min.   :  5.00  
##  1st Qu.:330.0   1st Qu.: 50.00   1st Qu.: 55   1st Qu.: 50.00  
##  Median :450.0   Median : 65.00   Median : 75   Median : 70.00  
##  Mean   :435.1   Mean   : 69.26   Mean   : 79   Mean   : 73.84  
##  3rd Qu.:515.0   3rd Qu.: 80.00   3rd Qu.:100   3rd Qu.: 90.00  
##  Max.   :780.0   Max.   :255.00   Max.   :190   Max.   :230.00  
##     Sp..Atk          Sp..Def          Speed          Generation   
##  Min.   : 10.00   Min.   : 20.0   Min.   :  5.00   Min.   :1.000  
##  1st Qu.: 49.75   1st Qu.: 50.0   1st Qu.: 45.00   1st Qu.:2.000  
##  Median : 65.00   Median : 70.0   Median : 65.00   Median :3.000  
##  Mean   : 72.82   Mean   : 71.9   Mean   : 68.28   Mean   :3.324  
##  3rd Qu.: 95.00   3rd Qu.: 90.0   3rd Qu.: 90.00   3rd Qu.:5.000  
##  Max.   :194.00   Max.   :230.0   Max.   :180.00   Max.   :6.000  
##   Legendary        
##  Length:800        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

As seen above, the data contains non-numeric values. The ‘Legendary’ variable takes True/False values, and in order to include it in the analysis, it has been appropriately converted to numeric values (1/0).

pokemon$Legendary <- ifelse(pokemon$Legendary, 1, 0)

The text variable ‘Type.1,’ indicating the type of a given Pokémon, appears to be a significant variable as it is one of the main features determining the Pokémon. In order to transform the variable from a text variable to a discrete variable.

types_unique <- unique(pokemon$Type.1)
pokemon_type_list <- pokemon$Type.1
pokemon$Type.1 <- factor(pokemon$Type.1, levels = types_unique)
pokemon$Type.1 <- as.numeric(pokemon$Type.1)

Assigning a Pokémon type to a specific value presents as follows:

##       [,1]       [,2]
##  [1,] "Grass"    "1" 
##  [2,] "Fire"     "2" 
##  [3,] "Water"    "3" 
##  [4,] "Bug"      "4" 
##  [5,] "Normal"   "5" 
##  [6,] "Poison"   "6" 
##  [7,] "Electric" "7" 
##  [8,] "Ground"   "8" 
##  [9,] "Fairy"    "9" 
## [10,] "Fighting" "10"
## [11,] "Psychic"  "11"
## [12,] "Rock"     "12"
## [13,] "Ghost"    "13"
## [14,] "Ice"      "14"
## [15,] "Dragon"   "15"
## [16,] "Dark"     "16"
## [17,] "Steel"    "17"
## [18,] "Flying"   "18"
## Percentage of Pokémon without a second type: 48.25 %

Close to 50% of Pokémon not having a second type is a significant amount. Therefore, it has been decided to convert the Type.2 variable into a binary variable, where 1 indicates having a second type, and 0 indicates its absence.

pokemon$Type.2 <- ifelse(pokemon$Type.2 == "", 0, 1)

To perform the PCA method, we need to restrict the variables to numeric ones only. As a result, the variable ‘Name,’ indicating the name of the Pokémon, will be removed.

pokemon2 <- pokemon[sapply(pokemon, is.numeric)]

Additionally, empty observations have been removed.

pokemon2 <- na.omit(pokemon2)

Finally, the ‘X.’ variable needs to be removed, as it refers to the table and not to Pokemon characteristics. The variable ‘Total’ also needs to be removed because it is the sum of 6 other variables. It is obvious that it is a dependent variable.

pokemon_num <- pokemon2[, -c(1, 4)]

Let’s inspect boxplots, correlations matrix and distributions of the variables:

boxplot(pokemon_num, col = 'skyblue', las = 2)

corrs <- cor(pokemon_num, method = c("spearman"))
round(corrs, 2)
##            Type.1 Type.2   HP Attack Defense Sp..Atk Sp..Def Speed Generation
## Type.1       1.00   0.07 0.01   0.13    0.17    0.01    0.11  0.03       0.16
## Type.2       0.07   1.00 0.10   0.13    0.21    0.13    0.15  0.09       0.05
## HP           0.01   0.10 1.00   0.57    0.43    0.47    0.49  0.27       0.08
## Attack       0.13   0.13 0.57   1.00    0.51    0.36    0.32  0.37       0.05
## Defense      0.17   0.21 0.43   0.51    1.00    0.31    0.58  0.09       0.06
## Sp..Atk      0.01   0.13 0.47   0.36    0.31    1.00    0.57  0.46       0.04
## Sp..Def      0.11   0.15 0.49   0.32    0.58    0.57    1.00  0.32       0.02
## Speed        0.03   0.09 0.27   0.37    0.09    0.46    0.32  1.00      -0.01
## Generation   0.16   0.05 0.08   0.05    0.06    0.04    0.02 -0.01       1.00
## Legendary    0.17   0.06 0.30   0.32    0.27    0.37    0.33  0.31       0.08
##            Legendary
## Type.1          0.17
## Type.2          0.06
## HP              0.30
## Attack          0.32
## Defense         0.27
## Sp..Atk         0.37
## Sp..Def         0.33
## Speed           0.31
## Generation      0.08
## Legendary       1.00
chart.Correlation(pokemon_num, histogram=TRUE, pch=19)

pokemon_cor<-cor(pokemon_num, method="spearman") 
corrplot(pokemon_cor, order ="alphabet", tl.cex=0.6, insig= 'n', col = COL1('Blues', 200))

As we can see, the variables are not strongly correlated with each other. The most correlated pairs of variables are Defense - Sp. Def and Sp. Def - Sp. Atk, which could be expected since special defense is an enhanced form of defense, and also Sp. Def and Sp. Atk are the special abilities of the Pokémon. In such cases, a correlation coefficient of 0.51 does not indicate the necessity of removing one of the statistics. Doing so could result in the loss of significant information.

Before moving forward with the PCA method, it is essential to assess whether conducting principal component analysis makes sense. One way to evaluate this is by examining the Kaiser-Meyer-Olkin (KMO) statistic. This metric compares the correlations and partial correlations among variables. A low KMO indicates that obtaining a meaningful solution in a lower-dimensional space may not be practical when correlations are relatively high compared to partial correlations.

Interpretation of KMO coefficients:

In simpler terms, KMO helps us gauge whether the data is suitable for PCA. Higher KMO values, especially in the range of 0.70 to 1.00, indicate that the data is more appropriate for principal component analysis, while lower values may suggest limitations in using PCA for the given dataset.

KMOS(pokemon_num)
## 
## Kaiser-Meyer-Olkin Statistics
## 
## Call: KMOS(x = pokemon_num)
## 
## Measures of Sampling Adequacy (MSA):
##     Type.1     Type.2         HP     Attack    Defense    Sp..Atk    Sp..Def 
##  0.7070785  0.8046637  0.7467445  0.6528347  0.5757854  0.7890903  0.6660647 
##      Speed Generation  Legendary 
##  0.6604249  0.5755205  0.8744270 
## 
## KMO-Criterion: 0.7007622

The KMO statistic for the examined dataset has reached a value of 0.7007622, which is sufficient to proceed with Principal Component Analysis (PCA).

Principal Component Analysis

Optimal number of components

Kaiser’s Stopping Rule is a way to figure out which components to choose. In this method, we keep the components with eigenvalues higher than 1. There’s also a screening test where we plot eigenvalues on the up-down axis and components on the side-to-side axis. We list the parts from biggest to smallest, and following the elbow rule, we pick the number of components. If the eigenvalue curve is stable, we should go with this number of components.

pca <- prcomp(pokemon_num, scale = TRUE)

fviz_eig(pca, choice='eigenvalue', addlabels = TRUE, barfill = 'skyblue') +
  theme_bw() +
  geom_line(linetype = "dashed", y = 1)

According to Kaiser’s rule, it is advisable to opt for 3 components, as the eigenvalue of the 4th component is the highest but lower than one. Additionally, considering the percentage of explained variance is a worthwhile practice. A cumulative sum of explained variance by components falling between 70-90% is considered satisfactory. Therefore, the selected number of components should lie within this range of explained variance.

fviz_eig(pca, choice = 'variance', addlabels = TRUE, ylim = c(0, 51), barfill = 'skyblue')

summary(pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     1.7756 1.1274 1.0238 0.97575 0.93534 0.88476 0.7918
## Proportion of Variance 0.3153 0.1271 0.1048 0.09521 0.08749 0.07828 0.0627
## Cumulative Proportion  0.3153 0.4424 0.5472 0.64240 0.72989 0.80817 0.8709
##                            PC8     PC9    PC10
## Standard deviation     0.77725 0.64897 0.51580
## Proportion of Variance 0.06041 0.04212 0.02661
## Cumulative Proportion  0.93128 0.97339 1.00000

In line with the cumulative proportion of variance, the threshold of 70% is only surpassed by selecting 5 principal components. Even though the fourth and fifth components have standard deviation values below 1, it is crucial to also preserve a sufficient amount of information. The cumulative proportion of variance for the first 3 components is 54.72%, which may result in an overly simplified model. On the other hand, the value of 72.99% for the first 5 components appears to be sufficient. Finally, the decision was made to choose 5 components.

fviz_pca_var(pca, col.circle = "blue", alpha.var = 1, col.var = "skyblue")

pca$rotation[,1:5]
##                   PC1          PC2         PC3         PC4         PC5
## Type.1     0.16246735 -0.538587377  0.33143732 -0.01117134 -0.53845313
## Type.2     0.14493166 -0.281649908 -0.22115662 -0.83967485  0.28456553
## HP         0.33523184  0.110762202 -0.15676343  0.34634027  0.46798628
## Attack     0.39675503  0.009997472 -0.02317968  0.03200524  0.05572558
## Defense    0.33359040 -0.400578334 -0.44415626  0.14526782 -0.12455736
## Sp..Atk    0.41920315  0.260899759  0.12238913 -0.09752223  0.02044243
## Sp..Def    0.40505892 -0.054680198 -0.25143204  0.12702113 -0.05449955
## Speed      0.30440338  0.435745189  0.34322746 -0.32591920 -0.12277177
## Generation 0.06527658 -0.436335736  0.59626666  0.12477169  0.57287518
## Legendary  0.36714467  0.080277006  0.25945329  0.07284539 -0.21169002
dim1 <- fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)
dim2 <- fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
dim3 <- fviz_contrib(pca, "var", axes=3, xtickslab.rt=90)
dim4 <- fviz_contrib(pca, "var", axes=4, xtickslab.rt=90)
dim5 <- fviz_contrib(pca, "var", axes=5, xtickslab.rt=90)

plot_grid(dim1, dim2, dim3, dim4, dim5, ncol = 2)

The first principal component adequately captures all Pokémon statistics. Although the speed statistic slightly falls below the 10% threshold, it is well represented by the second principal component. Additionally, information regarding whether a Pokémon is legendary is encompassed in the first component. Variables such as Type.1, Generation, and Type.2 are distributed across components 2, 3, and 4. As illustrated in the charts, all variables are adequately represented within four components. Therefore, it would be acceptable to omit the fifth principal component and present the data using only four principal components.

Some final visualisation

The graph below presents that Legendary Pokémons are definitely different that typical Pokémons.

fviz_pca_ind(pca, 
             geom="point", 
             habillage = pokemon_num$Legendary
)

The 3D chart displays Pokémon grouped by their generation. When viewed from the right angle, it becomes apparent that corresponding colors (representing generations) are clustered together in the same region of the chart.

plot3d(pca$x[,1:3], col = pokemon$Generation)
rglwidget()

Conclusion

The study aimed to achieve the smallest possible number of dimensions to effectively characterize Pokémon while discarding insignificant information. The PCA (Principal Component Analysis) algorithm was employed for dimensionality reduction. The obtained optimal number of dimensions is 4. The above-presented graphs demonstrate that the applied dimensionality constraint can fully describe the individual characteristics of Pokémon.