library(imager)
library(csv)
library(dplyr)
library(PerformanceAnalytics)
library(corrplot)
library(REdaS)
library(factoextra)
library(cowplot)
library(rgl)
The data set was downloaded from Kaggle -> source
Firstly, let’s import our dataset and look at the types of each characteristic.
pokemon <- read.csv("Pokemon.csv")
head(pokemon)
## X. Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122
## 5 4 Charmander Fire 309 39 52 43 60
## 6 5 Charmeleon Fire 405 58 64 58 80
## Sp..Def Speed Generation Legendary
## 1 65 45 1 False
## 2 80 60 1 False
## 3 100 80 1 False
## 4 120 80 1 False
## 5 50 65 1 False
## 6 65 80 1 False
summary(pokemon)
## X. Name Type.1 Type.2
## Min. : 1.0 Length:800 Length:800 Length:800
## 1st Qu.:184.8 Class :character Class :character Class :character
## Median :364.5 Mode :character Mode :character Mode :character
## Mean :362.8
## 3rd Qu.:539.2
## Max. :721.0
## Total HP Attack Defense
## Min. :180.0 Min. : 1.00 Min. : 5 Min. : 5.00
## 1st Qu.:330.0 1st Qu.: 50.00 1st Qu.: 55 1st Qu.: 50.00
## Median :450.0 Median : 65.00 Median : 75 Median : 70.00
## Mean :435.1 Mean : 69.26 Mean : 79 Mean : 73.84
## 3rd Qu.:515.0 3rd Qu.: 80.00 3rd Qu.:100 3rd Qu.: 90.00
## Max. :780.0 Max. :255.00 Max. :190 Max. :230.00
## Sp..Atk Sp..Def Speed Generation
## Min. : 10.00 Min. : 20.0 Min. : 5.00 Min. :1.000
## 1st Qu.: 49.75 1st Qu.: 50.0 1st Qu.: 45.00 1st Qu.:2.000
## Median : 65.00 Median : 70.0 Median : 65.00 Median :3.000
## Mean : 72.82 Mean : 71.9 Mean : 68.28 Mean :3.324
## 3rd Qu.: 95.00 3rd Qu.: 90.0 3rd Qu.: 90.00 3rd Qu.:5.000
## Max. :194.00 Max. :230.0 Max. :180.00 Max. :6.000
## Legendary
## Length:800
## Class :character
## Mode :character
##
##
##
As seen above, the data contains non-numeric values. The ‘Legendary’ variable takes True/False values, and in order to include it in the analysis, it has been appropriately converted to numeric values (1/0).
pokemon$Legendary <- ifelse(pokemon$Legendary, 1, 0)
The text variable ‘Type.1,’ indicating the type of a given Pokémon, appears to be a significant variable as it is one of the main features determining the Pokémon. In order to transform the variable from a text variable to a discrete variable.
types_unique <- unique(pokemon$Type.1)
pokemon_type_list <- pokemon$Type.1
pokemon$Type.1 <- factor(pokemon$Type.1, levels = types_unique)
pokemon$Type.1 <- as.numeric(pokemon$Type.1)
Assigning a Pokémon type to a specific value presents as follows:
## [,1] [,2]
## [1,] "Grass" "1"
## [2,] "Fire" "2"
## [3,] "Water" "3"
## [4,] "Bug" "4"
## [5,] "Normal" "5"
## [6,] "Poison" "6"
## [7,] "Electric" "7"
## [8,] "Ground" "8"
## [9,] "Fairy" "9"
## [10,] "Fighting" "10"
## [11,] "Psychic" "11"
## [12,] "Rock" "12"
## [13,] "Ghost" "13"
## [14,] "Ice" "14"
## [15,] "Dragon" "15"
## [16,] "Dark" "16"
## [17,] "Steel" "17"
## [18,] "Flying" "18"
## Percentage of Pokémon without a second type: 48.25 %
Close to 50% of Pokémon not having a second type is a significant amount. Therefore, it has been decided to convert the Type.2 variable into a binary variable, where 1 indicates having a second type, and 0 indicates its absence.
pokemon$Type.2 <- ifelse(pokemon$Type.2 == "", 0, 1)
To perform the PCA method, we need to restrict the variables to numeric ones only. As a result, the variable ‘Name,’ indicating the name of the Pokémon, will be removed.
pokemon2 <- pokemon[sapply(pokemon, is.numeric)]
Additionally, empty observations have been removed.
pokemon2 <- na.omit(pokemon2)
Finally, the ‘X.’ variable needs to be removed, as it refers to the table and not to Pokemon characteristics. The variable ‘Total’ also needs to be removed because it is the sum of 6 other variables. It is obvious that it is a dependent variable.
pokemon_num <- pokemon2[, -c(1, 4)]
Let’s inspect boxplots, correlations matrix and distributions of the variables:
boxplot(pokemon_num, col = 'skyblue', las = 2)
corrs <- cor(pokemon_num, method = c("spearman"))
round(corrs, 2)
## Type.1 Type.2 HP Attack Defense Sp..Atk Sp..Def Speed Generation
## Type.1 1.00 0.07 0.01 0.13 0.17 0.01 0.11 0.03 0.16
## Type.2 0.07 1.00 0.10 0.13 0.21 0.13 0.15 0.09 0.05
## HP 0.01 0.10 1.00 0.57 0.43 0.47 0.49 0.27 0.08
## Attack 0.13 0.13 0.57 1.00 0.51 0.36 0.32 0.37 0.05
## Defense 0.17 0.21 0.43 0.51 1.00 0.31 0.58 0.09 0.06
## Sp..Atk 0.01 0.13 0.47 0.36 0.31 1.00 0.57 0.46 0.04
## Sp..Def 0.11 0.15 0.49 0.32 0.58 0.57 1.00 0.32 0.02
## Speed 0.03 0.09 0.27 0.37 0.09 0.46 0.32 1.00 -0.01
## Generation 0.16 0.05 0.08 0.05 0.06 0.04 0.02 -0.01 1.00
## Legendary 0.17 0.06 0.30 0.32 0.27 0.37 0.33 0.31 0.08
## Legendary
## Type.1 0.17
## Type.2 0.06
## HP 0.30
## Attack 0.32
## Defense 0.27
## Sp..Atk 0.37
## Sp..Def 0.33
## Speed 0.31
## Generation 0.08
## Legendary 1.00
chart.Correlation(pokemon_num, histogram=TRUE, pch=19)
pokemon_cor<-cor(pokemon_num, method="spearman")
corrplot(pokemon_cor, order ="alphabet", tl.cex=0.6, insig= 'n', col = COL1('Blues', 200))
As we can see, the variables are not strongly correlated with each
other. The most correlated pairs of variables are Defense - Sp. Def and
Sp. Def - Sp. Atk, which could be expected since special defense is an
enhanced form of defense, and also Sp. Def and Sp. Atk are the special
abilities of the Pokémon. In such cases, a correlation coefficient of
0.51 does not indicate the necessity of removing one of the statistics.
Doing so could result in the loss of significant information.
Before moving forward with the PCA method, it is essential to assess whether conducting principal component analysis makes sense. One way to evaluate this is by examining the Kaiser-Meyer-Olkin (KMO) statistic. This metric compares the correlations and partial correlations among variables. A low KMO indicates that obtaining a meaningful solution in a lower-dimensional space may not be practical when correlations are relatively high compared to partial correlations.
Interpretation of KMO coefficients:In simpler terms, KMO helps us gauge whether the data is suitable for PCA. Higher KMO values, especially in the range of 0.70 to 1.00, indicate that the data is more appropriate for principal component analysis, while lower values may suggest limitations in using PCA for the given dataset.
KMOS(pokemon_num)
##
## Kaiser-Meyer-Olkin Statistics
##
## Call: KMOS(x = pokemon_num)
##
## Measures of Sampling Adequacy (MSA):
## Type.1 Type.2 HP Attack Defense Sp..Atk Sp..Def
## 0.7070785 0.8046637 0.7467445 0.6528347 0.5757854 0.7890903 0.6660647
## Speed Generation Legendary
## 0.6604249 0.5755205 0.8744270
##
## KMO-Criterion: 0.7007622
The KMO statistic for the examined dataset has reached a value of 0.7007622, which is sufficient to proceed with Principal Component Analysis (PCA).
Kaiser’s Stopping Rule is a way to figure out which components to choose. In this method, we keep the components with eigenvalues higher than 1. There’s also a screening test where we plot eigenvalues on the up-down axis and components on the side-to-side axis. We list the parts from biggest to smallest, and following the elbow rule, we pick the number of components. If the eigenvalue curve is stable, we should go with this number of components.
pca <- prcomp(pokemon_num, scale = TRUE)
fviz_eig(pca, choice='eigenvalue', addlabels = TRUE, barfill = 'skyblue') +
theme_bw() +
geom_line(linetype = "dashed", y = 1)
According to Kaiser’s rule, it is advisable to opt for 3 components, as the eigenvalue of the 4th component is the highest but lower than one. Additionally, considering the percentage of explained variance is a worthwhile practice. A cumulative sum of explained variance by components falling between 70-90% is considered satisfactory. Therefore, the selected number of components should lie within this range of explained variance.
fviz_eig(pca, choice = 'variance', addlabels = TRUE, ylim = c(0, 51), barfill = 'skyblue')
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7756 1.1274 1.0238 0.97575 0.93534 0.88476 0.7918
## Proportion of Variance 0.3153 0.1271 0.1048 0.09521 0.08749 0.07828 0.0627
## Cumulative Proportion 0.3153 0.4424 0.5472 0.64240 0.72989 0.80817 0.8709
## PC8 PC9 PC10
## Standard deviation 0.77725 0.64897 0.51580
## Proportion of Variance 0.06041 0.04212 0.02661
## Cumulative Proportion 0.93128 0.97339 1.00000
In line with the cumulative proportion of variance, the threshold of 70% is only surpassed by selecting 5 principal components. Even though the fourth and fifth components have standard deviation values below 1, it is crucial to also preserve a sufficient amount of information. The cumulative proportion of variance for the first 3 components is 54.72%, which may result in an overly simplified model. On the other hand, the value of 72.99% for the first 5 components appears to be sufficient. Finally, the decision was made to choose 5 components.
fviz_pca_var(pca, col.circle = "blue", alpha.var = 1, col.var = "skyblue")
pca$rotation[,1:5]
## PC1 PC2 PC3 PC4 PC5
## Type.1 0.16246735 -0.538587377 0.33143732 -0.01117134 -0.53845313
## Type.2 0.14493166 -0.281649908 -0.22115662 -0.83967485 0.28456553
## HP 0.33523184 0.110762202 -0.15676343 0.34634027 0.46798628
## Attack 0.39675503 0.009997472 -0.02317968 0.03200524 0.05572558
## Defense 0.33359040 -0.400578334 -0.44415626 0.14526782 -0.12455736
## Sp..Atk 0.41920315 0.260899759 0.12238913 -0.09752223 0.02044243
## Sp..Def 0.40505892 -0.054680198 -0.25143204 0.12702113 -0.05449955
## Speed 0.30440338 0.435745189 0.34322746 -0.32591920 -0.12277177
## Generation 0.06527658 -0.436335736 0.59626666 0.12477169 0.57287518
## Legendary 0.36714467 0.080277006 0.25945329 0.07284539 -0.21169002
dim1 <- fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)
dim2 <- fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
dim3 <- fviz_contrib(pca, "var", axes=3, xtickslab.rt=90)
dim4 <- fviz_contrib(pca, "var", axes=4, xtickslab.rt=90)
dim5 <- fviz_contrib(pca, "var", axes=5, xtickslab.rt=90)
plot_grid(dim1, dim2, dim3, dim4, dim5, ncol = 2)
The first principal component adequately captures all Pokémon statistics. Although the speed statistic slightly falls below the 10% threshold, it is well represented by the second principal component. Additionally, information regarding whether a Pokémon is legendary is encompassed in the first component. Variables such as Type.1, Generation, and Type.2 are distributed across components 2, 3, and 4. As illustrated in the charts, all variables are adequately represented within four components. Therefore, it would be acceptable to omit the fifth principal component and present the data using only four principal components.
The graph below presents that Legendary Pokémons are definitely different that typical Pokémons.
fviz_pca_ind(pca,
geom="point",
habillage = pokemon_num$Legendary
)
The 3D chart displays Pokémon grouped by their generation. When viewed from the right angle, it becomes apparent that corresponding colors (representing generations) are clustered together in the same region of the chart.
plot3d(pca$x[,1:3], col = pokemon$Generation)
rglwidget()
The study aimed to achieve the smallest possible number of dimensions to effectively characterize Pokémon while discarding insignificant information. The PCA (Principal Component Analysis) algorithm was employed for dimensionality reduction. The obtained optimal number of dimensions is 4. The above-presented graphs demonstrate that the applied dimensionality constraint can fully describe the individual characteristics of Pokémon.