Since I played the Pokemon Red 10 years ago on my Gameboy Color, I wondered about the statistics that define different Pokemon. Are the fair? Are some Pokemon just better than others, or maybe there are some groups, or specializations? When I encountered the Pokemon data set on Kaggle, I knew this was a chance to finally answer this questions.
The goal of this paper is to test what distributions the Pokemon stat-lines follow. I want to also see if there are any significant groups within the population and if there is some logic behind the assignment of statistics to Pokemon.
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(labdsv)
library(smacof)
library(psych)
library(pca3d)
library(NbClust)
library(ClusterR)
library(wesanderson)
library(factoextra)
library(clustertend)
library(knitr)
options(scipen=999)
# Define nice colors
cYellow = '#FADA5E'
cBlue = '#378CC7'
The source is a fantastic Kaggle data set containing the statistic for every Pokemon released.
pokemon <- read.csv('data/Pokemon.csv')
pokemon <- rename(pokemon, 'Special.Attack' = 'Sp..Atk', 'Special.Defense' = 'Sp..Def')
a <- pokemon %>% filter(pokemon$Name == 'Pikachu')
kable(a)
| X. | Name | Type.1 | Type.2 | Total | HP | Attack | Defense | Special.Attack | Special.Defense | Speed | Generation | Legendary |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25 | Pikachu | Electric | 320 | 35 | 55 | 40 | 50 | 50 | 90 | 1 | False |
dim(pokemon)
[1] 800 13
Pikachu is present and has the correct stat-line. Overall, this data set has 800 observations and 11 variables.
str(pokemon)
'data.frame': 800 obs. of 13 variables:
$ X. : int 1 2 3 3 4 5 6 6 6 7 ...
$ Name : Factor w/ 800 levels "Abomasnow","AbomasnowMega Abomasnow",..: 81 330 746 747 103 104 100 101 102 666 ...
$ Type.1 : Factor w/ 18 levels "Bug","Dark","Dragon",..: 10 10 10 10 7 7 7 7 7 18 ...
$ Type.2 : Factor w/ 19 levels "","Bug","Dark",..: 15 15 15 15 1 1 9 4 9 1 ...
$ Total : int 318 405 525 625 309 405 534 634 634 314 ...
$ HP : int 45 60 80 80 39 58 78 78 78 44 ...
$ Attack : int 49 62 82 100 52 64 84 130 104 48 ...
$ Defense : int 49 63 83 123 43 58 78 111 78 65 ...
$ Special.Attack : int 65 80 100 122 60 80 109 130 159 50 ...
$ Special.Defense: int 65 80 100 120 50 65 85 85 115 64 ...
$ Speed : int 45 60 80 80 65 80 100 100 100 43 ...
$ Generation : int 1 1 1 1 1 1 1 1 1 1 ...
$ Legendary : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
a <- summary(pokemon)[,c(1:4,12,13)]
b <- summary(pokemon)[,c(5:11)]
kable(a)
| X. | Name | Type.1 | Type.2 | Generation | Legendary | |
|---|---|---|---|---|---|---|
| Min. : 1.0 | Abomasnow : 1 | Water :112 | :386 | Min. :1.000 | False:735 | |
| 1st Qu.:184.8 | AbomasnowMega Abomasnow: 1 | Normal : 98 | Flying : 97 | 1st Qu.:2.000 | True : 65 | |
| Median :364.5 | Abra : 1 | Grass : 70 | Ground : 35 | Median :3.000 | NA | |
| Mean :362.8 | Absol : 1 | Bug : 69 | Poison : 34 | Mean :3.324 | NA | |
| 3rd Qu.:539.2 | AbsolMega Absol : 1 | Psychic: 57 | Psychic : 33 | 3rd Qu.:5.000 | NA | |
| Max. :721.0 | Accelgor : 1 | Fire : 52 | Fighting: 26 | Max. :6.000 | NA | |
| NA | (Other) :794 | (Other):342 | (Other) :189 | NA | NA |
kable(b)
| Total | HP | Attack | Defense | Special.Attack | Special.Defense | Speed | |
|---|---|---|---|---|---|---|---|
| Min. :180.0 | Min. : 1.00 | Min. : 5 | Min. : 5.00 | Min. : 10.00 | Min. : 20.0 | Min. : 5.00 | |
| 1st Qu.:330.0 | 1st Qu.: 50.00 | 1st Qu.: 55 | 1st Qu.: 50.00 | 1st Qu.: 49.75 | 1st Qu.: 50.0 | 1st Qu.: 45.00 | |
| Median :450.0 | Median : 65.00 | Median : 75 | Median : 70.00 | Median : 65.00 | Median : 70.0 | Median : 65.00 | |
| Mean :435.1 | Mean : 69.26 | Mean : 79 | Mean : 73.84 | Mean : 72.82 | Mean : 71.9 | Mean : 68.28 | |
| 3rd Qu.:515.0 | 3rd Qu.: 80.00 | 3rd Qu.:100 | 3rd Qu.: 90.00 | 3rd Qu.: 95.00 | 3rd Qu.: 90.0 | 3rd Qu.: 90.00 | |
| Max. :780.0 | Max. :255.00 | Max. :190 | Max. :230.00 | Max. :194.00 | Max. :230.0 | Max. :180.00 | |
| NA | NA | NA | NA | NA | NA | NA |
The data set has a nice distribution of variables - 2 categorical, 1 binary and 8 interval. The X variable is a Pokemon Id. Note that it is not unique, because later generations added new evolutions for existing Pokemon.
Type 1 designates the primary type of the Pokemon - it influences it’s strengths and weaknesses (e.g. fire Pokemon are weak against water Pokemon) and it’s overall design. Some Pokemon also have a second type.
Generation is the number of generation this Pokemon it’s from. First generation originated in 1996, and the 6th one in 2013. This data set is slightly old, as it lack the 7 generation from 2016.
Next are the statistics for each Pokemon.The Total variable is a simple sum of all statistics. HP stands for Hit Points, the Pokemon health. Attack signifies how much damage can it do, and it’s compared to the defense of the enemy’s Pokemon. Special attack is similar to normal attack, but it’s compared to the Special Defense. Speed indicates which Pokemon attacks first in a given round. Aside from these basic stats, each Pokemon has it’s moves, abilities and other variables not included in this data set. That being said, the basic stat-line and it’s distribution has a great impact on how powerful a given Pokemon is.
# All stats
poke <- pokemon[, c(5:11)]
# All stats, legendary
poke <- pokemon %>% filter(Legendary == 'True')
poke <- poke[, c(6:11)]
poke2 <- pokemon %>% filter(Legendary == 'True')
# Stats without the Total
poke <- pokemon[, c(6:11)]
For further analysis, there are 3 choices of subsets. First is the one including all statistics of a given Pokemon - Total, HP, Attack, Defense, Special Attack, Special Defense and Speed. Generation is ignored, since it is a nominal variable.
Second choice are only the Legendary Pokemon - a small (65) subset of all Pokemon, containing the rarest and most powerful Pokemon.
Third choice it again all Pokemon, but without the Total statistic, since it is produced but all other statistics already included.
We can take a look at the overall distributions by plotting the scatter plots for all variables.
plot(poke, col = cBlue, pch = 19)
All variables seem to be nicely distributed, the doesn’t seem to be any strong correlations. The HP statistic seems to be the most independent. Some outliers can be seen, but they aren’t very strong.
Now, we can take a closer look at the distributions of some of the variables.
ggplot(pokemon) + geom_bar(aes(x = Generation), fill = cYellow)
The number of Pokemon per generation seems to follow a “tick-tock” distribution - each even generation has a significantly lower number of new Pokemon.
ggplot(pokemon) + geom_bar(aes(x = Legendary), fill = cYellow)
There is a large difference between the number of legendary Pokemon (65) and normal ones (735)
ggplot(pokemon) + geom_bar(aes(x = Type.1), fill = cBlue)
The number of Pokemon in each category in not distributed evenly. There are a lot more of the water, normal and bug types of Pokemon. These might be a consequence of a design decision, since a lot of the time in-game take place in forests and similar locations. Also of note is the flying type - almost no Pokemon has it as it’s first type.
ggplot(pokemon) + geom_bar(aes(x = Type.2), fill = cBlue)
The majority of Pokemon don’t have a second type - but when they do, it’s typically flying.
We can take a look how is the Total variable distributed. It gives a good estimation on how powerful a certain Pokemon can be.
ggplot(pokemon) + geom_density(aes(x = Total), fill = cYellow, colour = cYellow)
The distribution has two strong peaks - one at around 300, and the other at 500. These is because Pokemon has a mechanism of evolution. Many Pokemon (but not all) have at least two forms. One that is weaker and is encountered earlier in the game (e.g. a bug), and second one that is a result of a evolution that a Pokemon undergoes when it reaches a certain level of experience (e.g. a butterfly). To give a sense of progress, the evolved formes are typically much stronger.
ggplot(pokemon) + geom_density(aes(x = Total), fill = cBlue, colour = cBlue) + facet_grid(Legendary ~ .)
When comparing the Total between Legendary and non-legendary Pokemon, it is clear that most high spots are occupied by the Legendary Pokemon. They do seem to also follow this interesting two-peak distribution, even though almost none of the evolve.
ggplot(pokemon) + geom_density(aes(x = Total), fill = cYellow, colour = cYellow) + facet_grid(Generation ~ .)
We can now test, if there exist any “Power Creep” in Pokemon. “Power creep” is a idea that each new version or generation is stronger than the previous one, typically to help sell new editions. This is a significant problem in many card games. When comparing the Total between generations the “power creep” seems to be nonexistent. It has to be noted that the two-peak distribution seems to be getting stronger with each generation.
ggplot(pokemon) + geom_density(aes(x = Total), fill = cBlue, colour = cBlue) + facet_grid(Generation ~ Legendary)
Comparing Total over generations and legendary vs non-legendary Pokemon doesn’t bring new insights - the distributions seems to be quite consistent.
We can take a look how are the other variables distributed.
p1 <- ggplot(pokemon) + geom_density(aes(x = HP), fill = cYellow, colour = cYellow)
p2 <- ggplot(pokemon) + geom_density(aes(x = Attack), fill = cYellow, colour = cYellow)
p3 <- ggplot(pokemon) + geom_density(aes(x = Defense), fill = cYellow, colour = cYellow)
p4 <- ggplot(pokemon) + geom_density(aes(x = Special.Attack), fill = cYellow, colour = cYellow)
p5 <- ggplot(pokemon) + geom_density(aes(x = Special.Defense), fill = cYellow, colour = cYellow)
p6 <- ggplot(pokemon) + geom_density(aes(x = Speed), fill = cYellow, colour = cYellow)
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 3)
All stats follow a roughly normal distribution, with means around 55. Some outliers in each category can be seen.
To see how the variables interact with each other, we can compute and display the correlation matrix.
pokemonCorrelation <- cor(pokemon[, c(6:11)], method="pearson")
print(pokemonCorrelation, digits=2)
HP Attack Defense Special.Attack Special.Defense Speed
HP 1.00 0.42 0.240 0.36 0.38 0.176
Attack 0.42 1.00 0.439 0.40 0.26 0.381
Defense 0.24 0.44 1.000 0.22 0.51 0.015
Special.Attack 0.36 0.40 0.224 1.00 0.51 0.473
Special.Defense 0.38 0.26 0.511 0.51 1.00 0.259
Speed 0.18 0.38 0.015 0.47 0.26 1.000
corrplot(pokemonCorrelation, order ="alphabet", method = 'number')
There are no strong correlations. Some mild ones can be observed between Defense and Special Defense, and Special Defense and Special Attack. Also note a absolute lack of correlation between Defense and Speed.
After exploring the dataset, we can move to multidimensional scaling. The general idea it that although we have 6 (or 7 when counting the Total) variables, there are similarities between them that would allow us to reduce the number of variables to 2, all without loosing too much information.
First, we have to compute the distance matrix for our data subset, then feed it into the cmdscale() function from the stats package. As a result we’ll be able to reduce the number of variables to 2.
poke.dist<-dist(poke)
a <- as.matrix(poke.dist)[1:10, 1:10]
kable(a)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| 0.00000 | 35.56684 | 84.52810 | 129.61867 | 27.03701 | 43.87482 | 92.28218 | 138.36184 | 138.98201 | 22.09072 |
| 35.56684 | 0.00000 | 48.98979 | 95.95832 | 47.60252 | 25.65151 | 59.15235 | 106.66302 | 106.67240 | 43.60046 |
| 84.52810 | 48.98979 | 0.00000 | 52.99057 | 92.22798 | 55.29919 | 27.18455 | 67.94851 | 67.96323 | 89.11229 |
| 129.61867 | 95.95832 | 52.99057 | 0.00000 | 139.11865 | 103.89418 | 63.86705 | 52.31635 | 61.95966 | 130.58714 |
| 27.03701 | 47.60252 | 92.22798 | 139.11865 | 0.00000 | 39.74921 | 92.84934 | 139.92498 | 143.87842 | 36.12478 |
| 43.87482 | 25.65151 | 55.29919 | 103.89418 | 39.74921 | 0.00000 | 53.30103 | 104.23531 | 107.42905 | 52.64029 |
| 92.28218 | 59.15235 | 27.18455 | 63.86705 | 92.84934 | 53.30103 | 0.00000 | 60.38212 | 61.64414 | 98.95454 |
| 138.36184 | 106.66302 | 67.94851 | 52.31635 | 139.92498 | 104.23531 | 60.38212 | 0.00000 | 59.21149 | 141.72509 |
| 138.98201 | 106.67240 | 67.96323 | 61.95966 | 143.87842 | 107.42905 | 61.64414 | 59.21149 | 0.00000 | 148.96980 |
| 22.09072 | 43.60046 | 89.11229 | 130.58714 | 36.12478 | 52.64029 | 98.95454 | 141.72509 | 148.96980 | 0.00000 |
poke.mds.1 <- cmdscale(poke.dist, k=2)
b <- summary(poke.mds.1)
kable(b)
| V1 | V2 | |
|---|---|---|
| Min. :-103.496 | Min. :-189.23 | |
| 1st Qu.: -44.222 | 1st Qu.: -16.69 | |
| Median : 3.725 | Median : 3.31 | |
| Mean : 0.000 | Mean : 0.00 | |
| 3rd Qu.: 31.978 | 3rd Qu.: 19.68 | |
| Max. : 151.413 | Max. : 127.70 |
plot(poke.mds.1)
The results look promising - most of the data is located along the x axis, with some outliers on the top, bottom, and the left side. There are two major clusters in the data - maybe they have something to do with the dual-peak distribution discovered earlier? The data also follows a kind of “cone” distribution, having a lower variance for x < 0 and higher for x > 0.
We can try to gain more insights by displaying the names of the Pokemon and comparing the results to a outside source, the Pokemon Wikipedia.
plot(poke.mds.1, type = 'n')
text(poke.mds.1, labels = pokemon$Name, cex=0.5, adj = 0.5)
The plot looks crowded, but we can gain much insight from it. First, we can take a look at the outlier at the bottom, in the center - Shuckle. According to Bulbapedia “Shuckle has the most extreme stat distribution of any Pokémon, being either the best or in the bottom three of every base stat”. This is a great information, because it means the MDS algorithm was able to identify this outlier.
Other interesting Pokemon are the Pichu, that can be found on the left-hand side in the center, that is one of the weakest Pokemon, and Mewtwo Mega, that can be found on the right-hand side, also in the center, that is one of the most powerful Pokemon.
To see the effects of different variables, we can plot their surfaces on the scatter plots. We’ll be using the pco() function that is a wrapper for the cmdscale() to enable better ploting.
poke <- pokemon[, c(5:11)]
poke.mds.2<-pco(poke.dist, k=2)
par(mfrow=c(2,2))
plot(poke.mds.2)
title(main = "PCO")
plot(poke.mds.2)
title(main = "Total")
surf(poke.mds.2, poke$Total)
plot(poke.mds.2)
title(main = "HP")
surf(poke.mds.2, poke$HP)
plot(poke.mds.2)
title(main = "Attack")
surf(poke.mds.2, poke$Attack)
plot(poke.mds.2)
title(main = "Defense")
surf(poke.mds.2, poke$Defense)
plot(poke.mds.2)
title(main = "Special Defense")
surf(poke.mds.2, poke$Special.Defense)
plot(poke.mds.2)
title(main = "Special Attack")
surf(poke.mds.2, poke$Special.Attack)
plot(poke.mds.2)
title(main = "Speed")
surf(poke.mds.2, poke$Speed)
par(mfrow=c(1,1))
We can see many interesting interactions. It seems that going along the X axis to the right directly increases the Total and Attack statistics. The HP also increases, but only if we stay near the 0 on the Y axis. When analyzing the Y axis, it seems that going down increases the Defense and Special Defense, while going up and to the left increases the Special Attack and Speed.
From this, we can see that there is a kind of a trade-off system. The X axis functions as a “power lever”, generally the further right, the more powerful the Pokemon gets.
All powerful Pokemon have higher attack, and higher Total. As for the other stats, there seems to be 3 paths - the can stay at Y ~ 0, and get more HP, they can get below Y < 0, and get Defense and Special Defense, or they can go Y > 0, and get Special Attack and Speed.
It seems that for the most powerful Pokemon, they can be either Aggressive (Y > 0), Defensive (Y < 0), or well-rounded (Y ~ 0). It also looks like the differences are far smaller for the weaker Pokemon.
To investigate this further, we can do the same analysis only for the legendary Pokemon.
poke <- pokemon %>% filter(Legendary == 'True')
poke <- poke[, c(5:11)]
poke.dist<-dist(poke)
poke.mds.2<-pco(poke.dist, k=2)
par(mfrow=c(2,2))
plot(poke.mds.2)
title(main = "PCO")
plot(poke.mds.2)
title(main = "Total")
surf(poke.mds.2, poke$Total)
plot(poke.mds.2)
title(main = "HP")
surf(poke.mds.2, poke$HP)
plot(poke.mds.2)
title(main = "Attack")
surf(poke.mds.2, poke$Attack)
plot(poke.mds.2)
title(main = "Defense")
surf(poke.mds.2, poke$Defense)
plot(poke.mds.2)
title(main = "Special Defense")
surf(poke.mds.2, poke$Special.Defense)
plot(poke.mds.2)
title(main = "Special Attack")
surf(poke.mds.2, poke$Special.Attack)
plot(poke.mds.2)
title(main = "Speed")
surf(poke.mds.2, poke$Speed)
par(mfrow=c(1,1))
poke <- pokemon[, c(6:11)]
The results are similar, but there are some differences. As previously, the X axis serves as a kind of “power level”. On the Y axis, it looks like that for Y > 0 HP, Defense and Special Defense increase, and for the Y < 0, the Attack and Special Attack increase. Note that while Attack and Special Attack seem to be also correlated with the power level of the creature, this in not true for Defense and Special Defense. Speed seems to be more individual, although favoring Y < 0.
We can test this interactions between variables, by performing the MDS on a transposed subset of our dataset.
poke.dist.t<-dist(t(pokemon[, c(5:11)]))
poke.mds.3<-cmdscale(poke.dist.t, k=2)
a <- summary(poke.mds.3)
kable(a)
| V1 | V2 | |
|---|---|---|
| Min. :-1660 | Min. :-621.05 | |
| 1st Qu.:-1589 | 1st Qu.:-106.92 | |
| Median :-1490 | Median : -42.91 | |
| Mean : 0 | Mean : 0.00 | |
| 3rd Qu.:-1395 | 3rd Qu.: 182.13 | |
| Max. : 9118 | Max. : 513.54 |
plot(poke.mds.3, type = 'n')
text(poke.mds.3, rownames(poke.mds.3), cex=0.8, adj = 0.5)
The X axis was taken wholly by the Total variable. On the Y axis, we can see a partial confirmation of the conclusions from above - indeed the Speed and Special Attack occupy the Y > 0, and Defense the Y < 0. Only Special Defense seems to be closer to Attack and HP.
To see the differences better, we can perform the analysis again, this time without the Total variable.
poke.dist.t.2<-dist(t(pokemon[, c(6:11)]))
poke.mds.4<-cmdscale(poke.dist.t.2, k=2)
a <- summary(poke.mds.4)
kable(a)
| V1 | V2 | |
|---|---|---|
| Min. :-529.41 | Min. :-599.9 | |
| 1st Qu.:-258.89 | 1st Qu.: -78.4 | |
| Median : 67.26 | Median : 33.9 | |
| Mean : 0.00 | Mean : 0.0 | |
| 3rd Qu.: 125.48 | 3rd Qu.: 181.0 | |
| Max. : 617.60 | Max. : 418.0 |
plot(poke.mds.4, type = 'n')
text(poke.mds.4, rownames(poke.mds.4), cex=0.8, adj = 0.5)
Removing the total variable gives a better picture - on this graph the left-top corner represents the Aggressive group, the right-top the Defensive, and the bottom-middle the Well-rounded group.
To test how well the MDS algorithm was able to approximate the original dataset, we can compare it to a stress of a MDS algorithm run on a random dissimilarity matrix.
poke <- pokemon[, c(6:11)]
poke.dist <- dist(t(poke))
poke.mds.4 <- mds(poke.dist, ndim=2, type="ordinal")
poke.mds.4
Call:
mds(delta = poke.dist, ndim = 2, type = "ordinal")
Model: Symmetric SMACOF
Number of objects: 6
Stress-1 value: 0.065
Number of iterations: 35
stress.random.matrix <- randomstress(n=800, ndim=2, nrep = 1)
poke.mds.4$stress/ mean(stress.random.matrix)
[1] 0.1103179
The resulting coefficient is 0.11, which is a fair result. We could improve it by adding the 3rd dimension, but that makes plots much less clear.
After the MDS, we can move to a alternative, the PCA. The prcomp() function takes care of the centering and scaling the data.
poke <- pokemon[, c(6:11)]
poke.pca.1<-prcomp(poke, center=TRUE, scale.=TRUE)
a <- poke.pca.1$rotation
kable(a, digits = 2)
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|
| HP | 0.39 | 0.08 | -0.47 | 0.72 | -0.22 | 0.23 |
| Attack | 0.44 | -0.01 | -0.59 | -0.41 | 0.19 | -0.50 |
| Defense | 0.36 | 0.63 | 0.07 | -0.42 | -0.06 | 0.54 |
| Special.Attack | 0.46 | -0.31 | 0.31 | 0.15 | 0.74 | 0.20 |
| Special.Defense | 0.45 | 0.24 | 0.57 | 0.19 | -0.30 | -0.55 |
| Speed | 0.34 | -0.67 | 0.08 | -0.30 | -0.53 | 0.26 |
The results are similar - the PC1 increases with each statistic (the power level). The PC2 is more interesting - It is mostly influenced by the Defense (+), and Speed (-). The Special Attack (-) and Special Defense (+) are also significant, but their effect is about half as strong.
summary(poke.pca.1)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 1.6466 1.0457 0.8825 0.8489 0.65463 0.51681
Proportion of Variance 0.4519 0.1822 0.1298 0.1201 0.07142 0.04451
Cumulative Proportion 0.4519 0.6342 0.7640 0.8841 0.95549 1.00000
plot(poke.pca.1, type = "l")
Using 2 variables we can explain 63% of the variance. It is quite good, but could be improved. Adding a third dimension increases the ratio to 77%.
fviz_pca_ind(poke.pca.1, col.ind="cos2", geom = "point", gradient.cols = c(cYellow, cBlue))
Plotting the observations show us that they are quite similar to MDS.
fviz_pca_var(poke.pca.1, col.var="black")
We can plot the influence of each variable in a nice graph. As previously discussed, the Defense and Speed have the most significant effects.
Using the princomp() function we can calculate the PCA using the eigen on the correlation matrix.
poke.pca.2<-princomp(poke)
loadings(poke.pca.2)
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
HP 0.301 0.802 0.387 0.334
Attack 0.493 0.730 -0.193 -0.424
Defense 0.381 0.695 -0.366 0.485
Special.Attack 0.509 -0.383 -0.385 0.101 -0.641 0.158
Special.Defense 0.394 0.174 -0.541 0.375 -0.616
Speed 0.327 -0.576 0.144 -0.459 0.510 0.263
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.167 0.167 0.167 0.167 0.167 0.167
Cumulative Var 0.167 0.333 0.500 0.667 0.833 1.000
As expected, the results are very similar.
plot(poke.pca.2)
fviz_pca_var(poke.pca.2, col.var="black")
To have more significant and easier to interpret results, we can use the rotated PCA approach.
poke.pca.3 <- principal(poke, nfactors=3, rotate="varimax")
poke.pca.3
Principal Components Analysis
Call: principal(r = poke, nfactors = 3, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 com
HP 0.21 0.15 0.73 0.59 0.41 1.3
Attack 0.14 0.24 0.85 0.80 0.20 1.2
Defense 0.79 -0.16 0.38 0.79 0.21 1.5
Special.Attack 0.38 0.74 0.21 0.74 0.26 1.7
Special.Defense 0.85 0.37 0.08 0.86 0.14 1.4
Speed -0.08 0.87 0.20 0.80 0.20 1.1
RC1 RC2 RC3
SS loadings 1.55 1.55 1.49
Proportion Var 0.26 0.26 0.25
Cumulative Var 0.26 0.52 0.76
Proportion Explained 0.34 0.34 0.32
Cumulative Proportion 0.34 0.68 1.00
Mean item complexity = 1.4
Test of the hypothesis that 3 components are sufficient.
The root mean square of the residuals (RMSR) is 0.12
with the empirical chi square 318.75 with prob < NA
Fit based upon off diagonal values = 0.9
print(loadings(poke.pca.3), digits=2, cutoff=0.4, sort=TRUE)
Loadings:
RC1 RC2 RC3
Defense 0.79
Special.Defense 0.85
Special.Attack 0.74
Speed 0.87
HP 0.73
Attack 0.85
RC1 RC2 RC3
SS loadings 1.55 1.55 1.49
Proportion Var 0.26 0.26 0.25
Cumulative Var 0.26 0.52 0.76
These results are much more interesting - without the cut-off we can explain 68% of the variance. Cuting the influence at 0.4 lowers the ratio to 52%, but let’s us see the results better. It seems that we have a variable for each of our previously discovered groups - RC1 for Defensive Pokemon, RC2 for Aggressive, and RC3 for Well-rounded.
biplot(poke.pca.3, hist.col = cYellow, smoother = TRUE)
The biplot let’s us see the combinations of these 3 variables. However, it isn’t very clear, so no new insights can be gained.
After performing the dimensionality reduction in MDS and PCA, we can cluster the results in K-means and PAM.
First step is to find a suitable number of clusters. For this we can run the fviz_nbclust() function from the factoextra package.
# Prepare data
poke.dist<-dist(poke)
poke.mds.1 <- cmdscale(poke.dist, k=2)
poke.mds.1.center <- center_scale(poke.mds.1)
poke.mds.1.center <- poke.mds.1
fviz_nbclust(as.data.frame(poke.mds.1), FUNcluster=pam)
The optimal number of clusters is 2. But, because we want to see the groups identified earlier, we can settle for 5 clusters, which have a bit lower silhouette.
We can now proceed to clustering using K-means. We’ll use the KMenas_rcpp() function from ClusteR library. The algorithm will be run on the data from MDS and PCA.
poke.km <- KMeans_rcpp(poke.mds.1.center, clusters=5, num_init=30, max_iters = 10000)
poke.km.pca <- KMeans_rcpp(poke.pca.2$scores[, 1:2], clusters=5, num_init=30, max_iters = 10000)
x1 <- ggplot(as.data.frame(poke.mds.1.center)) + geom_point(aes(x = V1, y = V2, colour = poke.km$clusters)) + scale_colour_gradientn(colours=wes_palette(n=3, name="BottleRocket2"), guide = FALSE)
x2 <- ggplot(as.data.frame(poke.pca.2$scores[, 1:2])) + geom_point(aes(x = Comp.1, y = Comp.2, colour = poke.km.pca$clusters)) + scale_colour_gradientn(colours=wes_palette(n=3, name="BottleRocket2"), guide = FALSE)
grid.arrange(x1, x2, nrow=2)
The data is nicely clustered into 5 groups: ‘weak’, ‘average’, ‘aggressive’, ‘defensive’ and ‘legendary’ Pokemon. The results from MDS and PCA are virtually identical, only the Y axis is reversed.
To display the silhouette plot, we can do the clustering using eclust().
poke.km.2 <- eclust(as.data.frame(poke.mds.1), "kmeans", k = 5)
poke.km.2.pca <- eclust(as.data.frame(poke.pca.2$scores[, 1:2]), "kmeans", k = 5)
fviz_silhouette(poke.km.2)
fviz_silhouette(poke.km.2.pca)
Almost all points have a positive silhouette, but the overall average silhouette is only decent. This is because the differences between points are quite small.
Additionally, we can run the PAM algorithm on the MDS and PCA points. Since this is a small dataset, the results are very similar to K-means.
poke.pam <- eclust(as.data.frame(poke.mds.1), "pam", k = 5)
poke.pam.2 <- eclust(as.data.frame(poke.pca.2$scores[, 1:2]), "pam", k = 5)
fviz_silhouette(poke.pam)
fviz_silhouette(poke.pam.2)
After this analysis I can conclude that there are rules governing the assignment of statistics to different Pokemon.
Visual analysis of the population showed that there is a strong two-peak distribution in the Total statistic, corresponding to pre and post evolution Pokemon. It was shown that the highest values are achieved by the small subset of all Pokemon, the legendary Pokemon. The analysis shown no signs of “power creep” between generations and showed that all statistics other than Total follow a normal distribution around 55.
Multidimensional scaling analysis was able to correctly identify significant outliers, that were then confirmed with a outside source (Wikipedia).
MDS analysis, later confirmed by PCA, showed that the more powerful Pokemon fall into 3 different categories - Aggressive (high Speed and Special Attack), Defensive (high Defense and Special Defense) and Well-rounded (high HP and Attack).
The K-means and PAM analysis showed that the dataset can be divided into 5 groups - ‘weak’, ‘medium - average’, ‘medium - aggressive’, ‘medium - defensive’ and ‘legendary’ Pokemon.