Introduction

Goal of the paper

Since I played the Pokemon Red 10 years ago on my Gameboy Color, I wondered about the statistics that define different Pokemon. Are the fair? Are some Pokemon just better than others, or maybe there are some groups, or specializations? When I encountered the Pokemon data set on Kaggle, I knew this was a chance to finally answer this questions.
The goal of this paper is to test what distributions the Pokemon stat-lines follow. I want to also see if there are any significant groups within the population and if there is some logic behind the assignment of statistics to Pokemon.

Data preparation

Libraries

library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(corrplot) 
library(labdsv)
library(smacof)
library(psych)
library(pca3d)
library(NbClust)
library(ClusterR)
library(wesanderson)
library(factoextra)
library(clustertend)
library(knitr)
options(scipen=999)

# Define nice colors
cYellow = '#FADA5E'
cBlue = '#378CC7'

Loading the data from a csv

The source is a fantastic Kaggle data set containing the statistic for every Pokemon released.

pokemon <- read.csv('data/Pokemon.csv')
pokemon <- rename(pokemon, 'Special.Attack' = 'Sp..Atk', 'Special.Defense' = 'Sp..Def')

Data exploration

First look at the data

a <- pokemon %>% filter(pokemon$Name == 'Pikachu')
kable(a)

X.	Name	Type.1	Type.2	Total	HP	Attack	Defense	Special.Attack	Special.Defense	Speed	Generation	Legendary
25	Pikachu	Electric		320	35	55	40	50	50	90	1	False

dim(pokemon)

[1] 800  13

Pikachu is present and has the correct stat-line. Overall, this data set has 800 observations and 11 variables.

Pikachu, the most iconic Pokemon. Introduced in the 1st generation. Source: bulbapedia.bulbagarden.net

str(pokemon)

'data.frame':   800 obs. of  13 variables:
 $ X.             : int  1 2 3 3 4 5 6 6 6 7 ...
 $ Name           : Factor w/ 800 levels "Abomasnow","AbomasnowMega Abomasnow",..: 81 330 746 747 103 104 100 101 102 666 ...
 $ Type.1         : Factor w/ 18 levels "Bug","Dark","Dragon",..: 10 10 10 10 7 7 7 7 7 18 ...
 $ Type.2         : Factor w/ 19 levels "","Bug","Dark",..: 15 15 15 15 1 1 9 4 9 1 ...
 $ Total          : int  318 405 525 625 309 405 534 634 634 314 ...
 $ HP             : int  45 60 80 80 39 58 78 78 78 44 ...
 $ Attack         : int  49 62 82 100 52 64 84 130 104 48 ...
 $ Defense        : int  49 63 83 123 43 58 78 111 78 65 ...
 $ Special.Attack : int  65 80 100 122 60 80 109 130 159 50 ...
 $ Special.Defense: int  65 80 100 120 50 65 85 85 115 64 ...
 $ Speed          : int  45 60 80 80 65 80 100 100 100 43 ...
 $ Generation     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Legendary      : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...

a <- summary(pokemon)[,c(1:4,12,13)]
b <- summary(pokemon)[,c(5:11)]
kable(a)

X.	Name	Type.1	Type.2	Generation	Legendary
Min. : 1.0	Abomasnow : 1	Water :112	:386	Min. :1.000	False:735
1st Qu.:184.8	AbomasnowMega Abomasnow: 1	Normal : 98	Flying : 97	1st Qu.:2.000	True : 65
Median :364.5	Abra : 1	Grass : 70	Ground : 35	Median :3.000	NA
Mean :362.8	Absol : 1	Bug : 69	Poison : 34	Mean :3.324	NA
3rd Qu.:539.2	AbsolMega Absol : 1	Psychic: 57	Psychic : 33	3rd Qu.:5.000	NA
Max. :721.0	Accelgor : 1	Fire : 52	Fighting: 26	Max. :6.000	NA
NA	(Other) :794	(Other):342	(Other) :189	NA	NA

kable(b)

Total	HP	Attack	Defense	Special.Attack	Special.Defense	Speed
Min. :180.0	Min. : 1.00	Min. : 5	Min. : 5.00	Min. : 10.00	Min. : 20.0	Min. : 5.00
1st Qu.:330.0	1st Qu.: 50.00	1st Qu.: 55	1st Qu.: 50.00	1st Qu.: 49.75	1st Qu.: 50.0	1st Qu.: 45.00
Median :450.0	Median : 65.00	Median : 75	Median : 70.00	Median : 65.00	Median : 70.0	Median : 65.00
Mean :435.1	Mean : 69.26	Mean : 79	Mean : 73.84	Mean : 72.82	Mean : 71.9	Mean : 68.28
3rd Qu.:515.0	3rd Qu.: 80.00	3rd Qu.:100	3rd Qu.: 90.00	3rd Qu.: 95.00	3rd Qu.: 90.0	3rd Qu.: 90.00
Max. :780.0	Max. :255.00	Max. :190	Max. :230.00	Max. :194.00	Max. :230.0	Max. :180.00
NA	NA	NA	NA	NA	NA	NA

The data set has a nice distribution of variables - 2 categorical, 1 binary and 8 interval. The X variable is a Pokemon Id. Note that it is not unique, because later generations added new evolutions for existing Pokemon.
Type 1 designates the primary type of the Pokemon - it influences it’s strengths and weaknesses (e.g. fire Pokemon are weak against water Pokemon) and it’s overall design. Some Pokemon also have a second type.
Generation is the number of generation this Pokemon it’s from. First generation originated in 1996, and the 6th one in 2013. This data set is slightly old, as it lack the 7 generation from 2016.
Next are the statistics for each Pokemon.The Total variable is a simple sum of all statistics. HP stands for Hit Points, the Pokemon health. Attack signifies how much damage can it do, and it’s compared to the defense of the enemy’s Pokemon. Special attack is similar to normal attack, but it’s compared to the Special Defense. Speed indicates which Pokemon attacks first in a given round. Aside from these basic stats, each Pokemon has it’s moves, abilities and other variables not included in this data set. That being said, the basic stat-line and it’s distribution has a great impact on how powerful a given Pokemon is.

Subsetting the data set

# All stats
poke <- pokemon[, c(5:11)]

# All stats, legendary
poke <- pokemon %>% filter(Legendary == 'True')
poke <- poke[, c(6:11)]
poke2 <- pokemon %>% filter(Legendary == 'True')

# Stats without the Total
poke <- pokemon[, c(6:11)]

For further analysis, there are 3 choices of subsets. First is the one including all statistics of a given Pokemon - Total, HP, Attack, Defense, Special Attack, Special Defense and Speed. Generation is ignored, since it is a nominal variable.

Second choice are only the Legendary Pokemon - a small (65) subset of all Pokemon, containing the rarest and most powerful Pokemon.

Third choice it again all Pokemon, but without the Total statistic, since it is produced but all other statistics already included.

Visual analysis

We can take a look at the overall distributions by plotting the scatter plots for all variables.

plot(poke, col = cBlue, pch = 19)

All variables seem to be nicely distributed, the doesn’t seem to be any strong correlations. The HP statistic seems to be the most independent. Some outliers can be seen, but they aren’t very strong.

Now, we can take a closer look at the distributions of some of the variables.

ggplot(pokemon) + geom_bar(aes(x = Generation), fill = cYellow)

The number of Pokemon per generation seems to follow a “tick-tock” distribution - each even generation has a significantly lower number of new Pokemon.

ggplot(pokemon) + geom_bar(aes(x = Legendary), fill = cYellow)

There is a large difference between the number of legendary Pokemon (65) and normal ones (735)

ggplot(pokemon) + geom_bar(aes(x = Type.1), fill = cBlue)

The number of Pokemon in each category in not distributed evenly. There are a lot more of the water, normal and bug types of Pokemon. These might be a consequence of a design decision, since a lot of the time in-game take place in forests and similar locations. Also of note is the flying type - almost no Pokemon has it as it’s first type.

ggplot(pokemon) + geom_bar(aes(x = Type.2), fill = cBlue)

The majority of Pokemon don’t have a second type - but when they do, it’s typically flying.

We can take a look how is the Total variable distributed. It gives a good estimation on how powerful a certain Pokemon can be.

ggplot(pokemon) + geom_density(aes(x = Total), fill = cYellow, colour = cYellow)

The distribution has two strong peaks - one at around 300, and the other at 500. These is because Pokemon has a mechanism of evolution. Many Pokemon (but not all) have at least two forms. One that is weaker and is encountered earlier in the game (e.g. a bug), and second one that is a result of a evolution that a Pokemon undergoes when it reaches a certain level of experience (e.g. a butterfly). To give a sense of progress, the evolved formes are typically much stronger.

ggplot(pokemon) + geom_density(aes(x = Total), fill = cBlue, colour = cBlue) + facet_grid(Legendary ~ .)

When comparing the Total between Legendary and non-legendary Pokemon, it is clear that most high spots are occupied by the Legendary Pokemon. They do seem to also follow this interesting two-peak distribution, even though almost none of the evolve.

ggplot(pokemon) + geom_density(aes(x = Total), fill = cYellow, colour = cYellow) + facet_grid(Generation ~ .)

We can now test, if there exist any “Power Creep” in Pokemon. “Power creep” is a idea that each new version or generation is stronger than the previous one, typically to help sell new editions. This is a significant problem in many card games. When comparing the Total between generations the “power creep” seems to be nonexistent. It has to be noted that the two-peak distribution seems to be getting stronger with each generation.

ggplot(pokemon) + geom_density(aes(x = Total), fill = cBlue, colour = cBlue) + facet_grid(Generation ~ Legendary)

Comparing Total over generations and legendary vs non-legendary Pokemon doesn’t bring new insights - the distributions seems to be quite consistent.

We can take a look how are the other variables distributed.

p1 <- ggplot(pokemon) + geom_density(aes(x = HP), fill = cYellow, colour = cYellow)
p2 <- ggplot(pokemon) + geom_density(aes(x = Attack), fill = cYellow, colour = cYellow)
p3 <- ggplot(pokemon) + geom_density(aes(x = Defense), fill = cYellow, colour = cYellow)
p4 <- ggplot(pokemon) + geom_density(aes(x = Special.Attack), fill = cYellow, colour = cYellow)
p5 <- ggplot(pokemon) + geom_density(aes(x = Special.Defense), fill = cYellow, colour = cYellow)
p6 <- ggplot(pokemon) + geom_density(aes(x = Speed), fill = cYellow, colour = cYellow)
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 3)

All stats follow a roughly normal distribution, with means around 55. Some outliers in each category can be seen.

Correlation plot

To see how the variables interact with each other, we can compute and display the correlation matrix.

pokemonCorrelation <- cor(pokemon[, c(6:11)], method="pearson") 
print(pokemonCorrelation, digits=2)

                  HP Attack Defense Special.Attack Special.Defense Speed
HP              1.00   0.42   0.240           0.36            0.38 0.176
Attack          0.42   1.00   0.439           0.40            0.26 0.381
Defense         0.24   0.44   1.000           0.22            0.51 0.015
Special.Attack  0.36   0.40   0.224           1.00            0.51 0.473
Special.Defense 0.38   0.26   0.511           0.51            1.00 0.259
Speed           0.18   0.38   0.015           0.47            0.26 1.000

corrplot(pokemonCorrelation, order ="alphabet", method = 'number')

There are no strong correlations. Some mild ones can be observed between Defense and Special Defense, and Special Defense and Special Attack. Also note a absolute lack of correlation between Defense and Speed.

Multidimensional Scaling

Classical multidimensional scaling

After exploring the dataset, we can move to multidimensional scaling. The general idea it that although we have 6 (or 7 when counting the Total) variables, there are similarities between them that would allow us to reduce the number of variables to 2, all without loosing too much information.

First, we have to compute the distance matrix for our data subset, then feed it into the cmdscale() function from the stats package. As a result we’ll be able to reduce the number of variables to 2.

poke.dist<-dist(poke) 
a <- as.matrix(poke.dist)[1:10, 1:10]
kable(a)

1	2	3	4	5	6	7	8	9	10
0.00000	35.56684	84.52810	129.61867	27.03701	43.87482	92.28218	138.36184	138.98201	22.09072
35.56684	0.00000	48.98979	95.95832	47.60252	25.65151	59.15235	106.66302	106.67240	43.60046
84.52810	48.98979	0.00000	52.99057	92.22798	55.29919	27.18455	67.94851	67.96323	89.11229
129.61867	95.95832	52.99057	0.00000	139.11865	103.89418	63.86705	52.31635	61.95966	130.58714
27.03701	47.60252	92.22798	139.11865	0.00000	39.74921	92.84934	139.92498	143.87842	36.12478
43.87482	25.65151	55.29919	103.89418	39.74921	0.00000	53.30103	104.23531	107.42905	52.64029
92.28218	59.15235	27.18455	63.86705	92.84934	53.30103	0.00000	60.38212	61.64414	98.95454
138.36184	106.66302	67.94851	52.31635	139.92498	104.23531	60.38212	0.00000	59.21149	141.72509
138.98201	106.67240	67.96323	61.95966	143.87842	107.42905	61.64414	59.21149	0.00000	148.96980
22.09072	43.60046	89.11229	130.58714	36.12478	52.64029	98.95454	141.72509	148.96980	0.00000

poke.mds.1 <- cmdscale(poke.dist, k=2) 
b <- summary(poke.mds.1)
kable(b)

	V1	V2
	Min. :-103.496	Min. :-189.23
	1st Qu.: -44.222	1st Qu.: -16.69
	Median : 3.725	Median : 3.31
	Mean : 0.000	Mean : 0.00
	3rd Qu.: 31.978	3rd Qu.: 19.68
	Max. : 151.413	Max. : 127.70

plot(poke.mds.1)

The results look promising - most of the data is located along the x axis, with some outliers on the top, bottom, and the left side. There are two major clusters in the data - maybe they have something to do with the dual-peak distribution discovered earlier? The data also follows a kind of “cone” distribution, having a lower variance for x < 0 and higher for x > 0.

Closer look at the clusters and outliers

We can try to gain more insights by displaying the names of the Pokemon and comparing the results to a outside source, the Pokemon Wikipedia.

plot(poke.mds.1, type = 'n')
text(poke.mds.1, labels = pokemon$Name, cex=0.5, adj = 0.5)

The plot looks crowded, but we can gain much insight from it. First, we can take a look at the outlier at the bottom, in the center - Shuckle. According to Bulbapedia “Shuckle has the most extreme stat distribution of any Pokémon, being either the best or in the bottom three of every base stat”. This is a great information, because it means the MDS algorithm was able to identify this outlier.

Shuckle, loosely based on the real-life endoliths. Introduced in the 2nd generation. Source: bulbapedia.bulbagarden.net

Other interesting Pokemon are the Pichu, that can be found on the left-hand side in the center, that is one of the weakest Pokemon, and Mewtwo Mega, that can be found on the right-hand side, also in the center, that is one of the most powerful Pokemon.

Identifying the influence of different variables on the MDS

To see the effects of different variables, we can plot their surfaces on the scatter plots. We’ll be using the pco() function that is a wrapper for the cmdscale() to enable better ploting.

poke <- pokemon[, c(5:11)]
poke.mds.2<-pco(poke.dist, k=2) 
par(mfrow=c(2,2))
plot(poke.mds.2)
title(main = "PCO")

plot(poke.mds.2)
title(main = "Total")
surf(poke.mds.2, poke$Total)

plot(poke.mds.2)
title(main = "HP")
surf(poke.mds.2, poke$HP)

plot(poke.mds.2)
title(main = "Attack")
surf(poke.mds.2, poke$Attack)

plot(poke.mds.2)
title(main = "Defense")
surf(poke.mds.2, poke$Defense)

plot(poke.mds.2)
title(main = "Special Defense")
surf(poke.mds.2, poke$Special.Defense)

plot(poke.mds.2)
title(main = "Special Attack")
surf(poke.mds.2, poke$Special.Attack)

plot(poke.mds.2)
title(main = "Speed")
surf(poke.mds.2, poke$Speed)

par(mfrow=c(1,1))

We can see many interesting interactions. It seems that going along the X axis to the right directly increases the Total and Attack statistics. The HP also increases, but only if we stay near the 0 on the Y axis. When analyzing the Y axis, it seems that going down increases the Defense and Special Defense, while going up and to the left increases the Special Attack and Speed.

From this, we can see that there is a kind of a trade-off system. The X axis functions as a “power lever”, generally the further right, the more powerful the Pokemon gets.
All powerful Pokemon have higher attack, and higher Total. As for the other stats, there seems to be 3 paths - the can stay at Y ~ 0, and get more HP, they can get below Y < 0, and get Defense and Special Defense, or they can go Y > 0, and get Special Attack and Speed.

It seems that for the most powerful Pokemon, they can be either Aggressive (Y > 0), Defensive (Y < 0), or well-rounded (Y ~ 0). It also looks like the differences are far smaller for the weaker Pokemon.

To investigate this further, we can do the same analysis only for the legendary Pokemon.

poke <- pokemon %>% filter(Legendary == 'True')
poke <- poke[, c(5:11)]
poke.dist<-dist(poke) 
poke.mds.2<-pco(poke.dist, k=2) 
par(mfrow=c(2,2))
plot(poke.mds.2)
title(main = "PCO")

plot(poke.mds.2)
title(main = "Total")
surf(poke.mds.2, poke$Total)

plot(poke.mds.2)
title(main = "HP")
surf(poke.mds.2, poke$HP)

plot(poke.mds.2)
title(main = "Attack")
surf(poke.mds.2, poke$Attack)

plot(poke.mds.2)
title(main = "Defense")
surf(poke.mds.2, poke$Defense)

plot(poke.mds.2)
title(main = "Special Defense")
surf(poke.mds.2, poke$Special.Defense)

plot(poke.mds.2)
title(main = "Special Attack")
surf(poke.mds.2, poke$Special.Attack)

plot(poke.mds.2)
title(main = "Speed")
surf(poke.mds.2, poke$Speed)

par(mfrow=c(1,1))

poke <- pokemon[, c(6:11)]

The results are similar, but there are some differences. As previously, the X axis serves as a kind of “power level”. On the Y axis, it looks like that for Y > 0 HP, Defense and Special Defense increase, and for the Y < 0, the Attack and Special Attack increase. Note that while Attack and Special Attack seem to be also correlated with the power level of the creature, this in not true for Defense and Special Defense. Speed seems to be more individual, although favoring Y < 0.

MDS on variables

We can test this interactions between variables, by performing the MDS on a transposed subset of our dataset.

poke.dist.t<-dist(t(pokemon[, c(5:11)]))
poke.mds.3<-cmdscale(poke.dist.t, k=2) 
a <- summary(poke.mds.3)
kable(a)

	V1	V2
	Min. :-1660	Min. :-621.05
	1st Qu.:-1589	1st Qu.:-106.92
	Median :-1490	Median : -42.91
	Mean : 0	Mean : 0.00
	3rd Qu.:-1395	3rd Qu.: 182.13
	Max. : 9118	Max. : 513.54

plot(poke.mds.3, type = 'n') 
text(poke.mds.3, rownames(poke.mds.3), cex=0.8, adj = 0.5)

The X axis was taken wholly by the Total variable. On the Y axis, we can see a partial confirmation of the conclusions from above - indeed the Speed and Special Attack occupy the Y > 0, and Defense the Y < 0. Only Special Defense seems to be closer to Attack and HP.

MDS on variables - excluding the Total

To see the differences better, we can perform the analysis again, this time without the Total variable.

poke.dist.t.2<-dist(t(pokemon[, c(6:11)])) 
poke.mds.4<-cmdscale(poke.dist.t.2, k=2) 
a <- summary(poke.mds.4)
kable(a)

	V1	V2
	Min. :-529.41	Min. :-599.9
	1st Qu.:-258.89	1st Qu.: -78.4
	Median : 67.26	Median : 33.9
	Mean : 0.00	Mean : 0.0
	3rd Qu.: 125.48	3rd Qu.: 181.0
	Max. : 617.60	Max. : 418.0

plot(poke.mds.4, type = 'n') 
text(poke.mds.4, rownames(poke.mds.4), cex=0.8, adj = 0.5)

Removing the total variable gives a better picture - on this graph the left-top corner represents the Aggressive group, the right-top the Defensive, and the bottom-middle the Well-rounded group.

Testing the Goodness of Fit

To test how well the MDS algorithm was able to approximate the original dataset, we can compare it to a stress of a MDS algorithm run on a random dissimilarity matrix.

poke <- pokemon[, c(6:11)]
poke.dist <- dist(t(poke))  
poke.mds.4 <- mds(poke.dist, ndim=2,  type="ordinal") 
poke.mds.4


Call:
mds(delta = poke.dist, ndim = 2, type = "ordinal")

Model: Symmetric SMACOF 
Number of objects: 6 
Stress-1 value: 0.065 
Number of iterations: 35

stress.random.matrix <- randomstress(n=800, ndim=2, nrep = 1) 
poke.mds.4$stress/ mean(stress.random.matrix)

[1] 0.1103179

The resulting coefficient is 0.11, which is a fair result. We could improve it by adding the 3rd dimension, but that makes plots much less clear.

Principal Value Decomposition

PCA using the singular value decomposition

After the MDS, we can move to a alternative, the PCA. The prcomp() function takes care of the centering and scaling the data.

poke <- pokemon[, c(6:11)]

poke.pca.1<-prcomp(poke, center=TRUE, scale.=TRUE) 
a <- poke.pca.1$rotation
kable(a, digits = 2)

	PC1	PC2	PC3	PC4	PC5	PC6
HP	0.39	0.08	-0.47	0.72	-0.22	0.23
Attack	0.44	-0.01	-0.59	-0.41	0.19	-0.50
Defense	0.36	0.63	0.07	-0.42	-0.06	0.54
Special.Attack	0.46	-0.31	0.31	0.15	0.74	0.20
Special.Defense	0.45	0.24	0.57	0.19	-0.30	-0.55
Speed	0.34	-0.67	0.08	-0.30	-0.53	0.26

The results are similar - the PC1 increases with each statistic (the power level). The PC2 is more interesting - It is mostly influenced by the Defense (+), and Speed (-). The Special Attack (-) and Special Defense (+) are also significant, but their effect is about half as strong.

summary(poke.pca.1)

Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6
Standard deviation     1.6466 1.0457 0.8825 0.8489 0.65463 0.51681
Proportion of Variance 0.4519 0.1822 0.1298 0.1201 0.07142 0.04451
Cumulative Proportion  0.4519 0.6342 0.7640 0.8841 0.95549 1.00000

plot(poke.pca.1, type = "l")

Using 2 variables we can explain 63% of the variance. It is quite good, but could be improved. Adding a third dimension increases the ratio to 77%.

fviz_pca_ind(poke.pca.1, col.ind="cos2", geom = "point", gradient.cols = c(cYellow, cBlue))

Plotting the observations show us that they are quite similar to MDS.

fviz_pca_var(poke.pca.1, col.var="black")

We can plot the influence of each variable in a nice graph. As previously discussed, the Defense and Speed have the most significant effects.

PCA using the eigen

Using the princomp() function we can calculate the PCA using the eigen on the correlation matrix.

poke.pca.2<-princomp(poke)
loadings(poke.pca.2)


Loadings:
                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
HP               0.301                0.802  0.387  0.334
Attack           0.493         0.730        -0.193 -0.424
Defense          0.381  0.695        -0.366         0.485
Special.Attack   0.509 -0.383 -0.385  0.101 -0.641  0.158
Special.Defense  0.394  0.174 -0.541         0.375 -0.616
Speed            0.327 -0.576  0.144 -0.459  0.510  0.263

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.167  0.167  0.167  0.167  0.167  0.167
Cumulative Var  0.167  0.333  0.500  0.667  0.833  1.000

As expected, the results are very similar.

plot(poke.pca.2)

fviz_pca_var(poke.pca.2, col.var="black")

Rotated PCA

To have more significant and easier to interpret results, we can use the rotated PCA approach.

poke.pca.3 <- principal(poke, nfactors=3, rotate="varimax")
poke.pca.3

Principal Components Analysis
Call: principal(r = poke, nfactors = 3, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
                  RC1   RC2  RC3   h2   u2 com
HP               0.21  0.15 0.73 0.59 0.41 1.3
Attack           0.14  0.24 0.85 0.80 0.20 1.2
Defense          0.79 -0.16 0.38 0.79 0.21 1.5
Special.Attack   0.38  0.74 0.21 0.74 0.26 1.7
Special.Defense  0.85  0.37 0.08 0.86 0.14 1.4
Speed           -0.08  0.87 0.20 0.80 0.20 1.1

                       RC1  RC2  RC3
SS loadings           1.55 1.55 1.49
Proportion Var        0.26 0.26 0.25
Cumulative Var        0.26 0.52 0.76
Proportion Explained  0.34 0.34 0.32
Cumulative Proportion 0.34 0.68 1.00

Mean item complexity =  1.4
Test of the hypothesis that 3 components are sufficient.

The root mean square of the residuals (RMSR) is  0.12 
 with the empirical chi square  318.75  with prob <  NA 

Fit based upon off diagonal values = 0.9

print(loadings(poke.pca.3), digits=2, cutoff=0.4, sort=TRUE)


Loadings:
                RC1   RC2   RC3  
Defense          0.79            
Special.Defense  0.85            
Special.Attack         0.74      
Speed                  0.87      
HP                           0.73
Attack                       0.85

                RC1  RC2  RC3
SS loadings    1.55 1.55 1.49
Proportion Var 0.26 0.26 0.25
Cumulative Var 0.26 0.52 0.76

These results are much more interesting - without the cut-off we can explain 68% of the variance. Cuting the influence at 0.4 lowers the ratio to 52%, but let’s us see the results better. It seems that we have a variable for each of our previously discovered groups - RC1 for Defensive Pokemon, RC2 for Aggressive, and RC3 for Well-rounded.

biplot(poke.pca.3, hist.col = cYellow, smoother = TRUE)

The biplot let’s us see the combinations of these 3 variables. However, it isn’t very clear, so no new insights can be gained.

Clustering the results

Optimal number of clusters

After performing the dimensionality reduction in MDS and PCA, we can cluster the results in K-means and PAM.
First step is to find a suitable number of clusters. For this we can run the fviz_nbclust() function from the factoextra package.

# Prepare data
poke.dist<-dist(poke) 
poke.mds.1 <- cmdscale(poke.dist, k=2) 
poke.mds.1.center <- center_scale(poke.mds.1)
poke.mds.1.center <- poke.mds.1

fviz_nbclust(as.data.frame(poke.mds.1), FUNcluster=pam)

The optimal number of clusters is 2. But, because we want to see the groups identified earlier, we can settle for 5 clusters, which have a bit lower silhouette.

K-means in ClusteR

We can now proceed to clustering using K-means. We’ll use the KMenas_rcpp() function from ClusteR library. The algorithm will be run on the data from MDS and PCA.

poke.km <- KMeans_rcpp(poke.mds.1.center, clusters=5, num_init=30, max_iters = 10000) 
poke.km.pca <- KMeans_rcpp(poke.pca.2$scores[, 1:2], clusters=5, num_init=30, max_iters = 10000) 

x1 <- ggplot(as.data.frame(poke.mds.1.center)) + geom_point(aes(x = V1, y = V2, colour = poke.km$clusters)) + scale_colour_gradientn(colours=wes_palette(n=3, name="BottleRocket2"), guide = FALSE)
x2 <- ggplot(as.data.frame(poke.pca.2$scores[, 1:2])) + geom_point(aes(x = Comp.1, y = Comp.2, colour = poke.km.pca$clusters)) + scale_colour_gradientn(colours=wes_palette(n=3, name="BottleRocket2"), guide = FALSE)

grid.arrange(x1, x2, nrow=2)

The data is nicely clustered into 5 groups: ‘weak’, ‘average’, ‘aggressive’, ‘defensive’ and ‘legendary’ Pokemon. The results from MDS and PCA are virtually identical, only the Y axis is reversed.

K-means in Factoextra

To display the silhouette plot, we can do the clustering using eclust().

poke.km.2 <- eclust(as.data.frame(poke.mds.1), "kmeans", k = 5)
poke.km.2.pca <- eclust(as.data.frame(poke.pca.2$scores[, 1:2]), "kmeans", k = 5)

fviz_silhouette(poke.km.2)
fviz_silhouette(poke.km.2.pca)

Almost all points have a positive silhouette, but the overall average silhouette is only decent. This is because the differences between points are quite small.

PAM

Additionally, we can run the PAM algorithm on the MDS and PCA points. Since this is a small dataset, the results are very similar to K-means.

poke.pam <- eclust(as.data.frame(poke.mds.1), "pam", k = 5)
poke.pam.2 <- eclust(as.data.frame(poke.pca.2$scores[, 1:2]), "pam", k = 5)
fviz_silhouette(poke.pam)
fviz_silhouette(poke.pam.2)

Conclusions

After this analysis I can conclude that there are rules governing the assignment of statistics to different Pokemon.

Visual analysis of the population showed that there is a strong two-peak distribution in the Total statistic, corresponding to pre and post evolution Pokemon. It was shown that the highest values are achieved by the small subset of all Pokemon, the legendary Pokemon. The analysis shown no signs of “power creep” between generations and showed that all statistics other than Total follow a normal distribution around 55.

Multidimensional scaling analysis was able to correctly identify significant outliers, that were then confirmed with a outside source (Wikipedia).

MDS analysis, later confirmed by PCA, showed that the more powerful Pokemon fall into 3 different categories - Aggressive (high Speed and Special Attack), Defensive (high Defense and Special Defense) and Well-rounded (high HP and Attack).

The K-means and PAM analysis showed that the dataset can be divided into 5 groups - ‘weak’, ‘medium - average’, ‘medium - aggressive’, ‘medium - defensive’ and ‘legendary’ Pokemon.

Are Pokemon Born Equal? Analysis Using Multidimensional Scaling and Clustering

Michał Szałański

19 marca 2019