K-Means Clustering

1 Intro
2 Import Data
3 Data Preprocessing
4 Exploratory Data Analysis
- 4.1 Possibility for Clustering
- 4.2 Possibility for Principle Component Analysis (PCA)
5 Clustering
6 PCA
7 Conclusion

1 Intro

1.1 What We’ll Do

We will try to do a clustering analysis using K-means method. We will also see if we can do a dimensionality reduction using the Principle Components Analysis (PCA).

1.2 Datasets

The datasets is acquired through Kaggle .

1.3 Library and Setup

Load the required library.

library(tidyverse)
library(lubridate)
library(cluster)
library(factoextra)
library(ggforce)
library(GGally)
library(scales)
library(cowplot)
library(FactoMineR)
library(factoextra)
library(plotly)
options(scipen = 100, max.print = 101)

2 Import Data

The datasets is list of all pokemons from Pokemon franchise game from generation 1 up to generation 7

data <- read.csv("pokemon.csv")

glimpse(data)

Observations: 801
Variables: 41
$ abilities         <fct> "['Overgrow', 'Chlorophyll']", "['Overgrow',...
$ against_bug       <dbl> 1.00, 1.00, 1.00, 0.50, 0.50, 0.25, 1.00, 1....
$ against_dark      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ against_dragon    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ against_electric  <dbl> 0.5, 0.5, 0.5, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0,...
$ against_fairy     <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0,...
$ against_fight     <dbl> 0.50, 0.50, 0.50, 1.00, 1.00, 0.50, 1.00, 1....
$ against_fire      <dbl> 2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,...
$ against_flying    <dbl> 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,...
$ against_ghost     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ against_grass     <dbl> 0.25, 0.25, 0.25, 0.50, 0.50, 0.25, 2.00, 2....
$ against_ground    <dbl> 1.0, 1.0, 1.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0,...
$ against_ice       <dbl> 2.0, 2.0, 2.0, 0.5, 0.5, 1.0, 0.5, 0.5, 0.5,...
$ against_normal    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ against_poison    <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,...
$ against_psychic   <dbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,...
$ against_rock      <dbl> 1, 1, 1, 2, 2, 4, 1, 1, 1, 2, 2, 4, 2, 2, 2,...
$ against_steel     <dbl> 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,...
$ against_water     <dbl> 0.5, 0.5, 0.5, 2.0, 2.0, 2.0, 0.5, 0.5, 0.5,...
$ attack            <int> 49, 62, 100, 52, 64, 104, 48, 63, 103, 30, 2...
$ base_egg_steps    <int> 5120, 5120, 5120, 5120, 5120, 5120, 5120, 51...
$ base_happiness    <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, ...
$ base_total        <int> 318, 405, 625, 309, 405, 634, 314, 405, 630,...
$ capture_rate      <fct> 45, 45, 45, 45, 45, 45, 45, 45, 45, 255, 120...
$ classfication     <fct> Seed PokÃ©mon, Seed PokÃ©mon, Seed PokÃ©mon,...
$ defense           <int> 49, 63, 123, 43, 58, 78, 65, 80, 120, 35, 55...
$ experience_growth <int> 1059860, 1059860, 1059860, 1059860, 1059860,...
$ height_m          <dbl> 0.7, 1.0, 2.0, 0.6, 1.1, 1.7, 0.5, 1.0, 1.6,...
$ hp                <int> 45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, ...
$ japanese_name     <fct> Fushigidaneãƒ•ã‚·ã‚®ãƒ€ãƒ, Fushigisouãƒ•ã‚·...
$ name              <fct> Bulbasaur, Ivysaur, Venusaur, Charmander, Ch...
$ percentage_male   <dbl> 88.1, 88.1, 88.1, 88.1, 88.1, 88.1, 88.1, 88...
$ pokedex_number    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ sp_attack         <int> 65, 80, 122, 60, 80, 159, 50, 65, 135, 20, 2...
$ sp_defense        <int> 65, 80, 120, 50, 65, 115, 64, 80, 115, 20, 2...
$ speed             <int> 45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30,...
$ type1             <fct> grass, grass, grass, fire, fire, fire, water...
$ type2             <fct> poison, poison, poison, , , flying, , , , , ...
$ weight_kg         <dbl> 6.9, 13.0, 100.0, 8.5, 19.0, 90.5, 9.0, 22.5...
$ generation        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ is_legendary      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

Variable explanation:

name : The English name of the Pokemon
japanese_name : The Original Japanese name of the Pokemon
pokedex_number : The entry number of the Pokemon in the National Pokedex
percentage_male : The percentage of the species that are male. Blank if the Pokemon is genderless.
type1 : The Primary Type of the Pokemon
type2 : The Secondary Type of the Pokemon
classification : The Classification of the Pokemon as described by the Sun and Moon Pokedex
height_m : Height of the Pokemon in metres
weight_kg : The Weight of the Pokemon in kilograms
capture_rate : Capture Rate of the Pokemon
base_egg_steps : The number of steps required to hatch an egg of the Pokemon
abilities : A stringified list of abilities that the Pokemon is capable of having
experience_growth : The Experience Growth of the Pokemon
base_happiness : Base Happiness of the Pokemon
against_? : Eighteen features that denote the amount of damage taken against an attack of a particular type
hp : The Base HP of the Pokemon
attack : The Base Attack of the Pokemon
defense : The Base Defense of the Pokemon
sp_attack : The Base Special Attack of the Pokemon
sp_defense : The Base Special Defense of the Pokemon
base_total : attack + defense + hp + sp_attack + sp_defense
speed : The Base Speed of the Pokemon
generation : The numbered generation which the Pokemon was first introduced
is_legendary : Denotes if the Pokemon is legendary.

3 Data Preprocessing

Since is_legendary is logical, we will convert the data type.

data <- data %>% mutate(is_legendary = as.factor(if_else(is_legendary == 1, 
    "yes", "no")))

Capture rate should be a numeric data, we might check the levels

levels(data$capture_rate)

 [1] "100"                      "120"                     
 [3] "125"                      "127"                     
 [5] "130"                      "140"                     
 [7] "145"                      "15"                      
 [9] "150"                      "155"                     
[11] "160"                      "170"                     
[13] "180"                      "190"                     
[15] "200"                      "205"                     
[17] "220"                      "225"                     
[19] "235"                      "25"                      
[21] "255"                      "3"                       
[23] "30"                       "30 (Meteorite)255 (Core)"
[25] "35"                       "45"                      
[27] "50"                       "55"                      
[29] "60"                       "65"                      
[31] "70"                       "75"                      
[33] "80"                       "90"

Turns out there is a data with specific characters capture_rate with 2 numeric value based on the Pokemon condition. We may force them into NA by coercing the data type into numeric, or we can choose either one of the value. I prefer to keep them and choose the lower capture rate. generation is not a genuine numeric value, but rather a factor with the number indicating on which generation the Pokemon is introduced by the game.

data <- data %>% mutate(capture_rate = if_else(capture_rate == "30 (Meteorite)255 (Core)", 
    30, as.numeric(capture_rate)), generation = as.factor(generation))

We will check if there is any missing values in the data.

colSums(is.na(data))

        abilities       against_bug      against_dark    against_dragon 
                0                 0                 0                 0 
 against_electric     against_fairy     against_fight      against_fire 
                0                 0                 0                 0 
   against_flying     against_ghost     against_grass    against_ground 
                0                 0                 0                 0 
      against_ice    against_normal    against_poison   against_psychic 
                0                 0                 0                 0 
     against_rock     against_steel     against_water            attack 
                0                 0                 0                 0 
   base_egg_steps    base_happiness        base_total      capture_rate 
                0                 0                 0                 0 
    classfication           defense experience_growth          height_m 
                0                 0                 0                20 
               hp     japanese_name              name   percentage_male 
                0                 0                 0                98 
   pokedex_number         sp_attack        sp_defense             speed 
                0                 0                 0                 0 
            type1             type2         weight_kg        generation 
                0                 0                20                 0 
     is_legendary 
                0

The percentage_male variable has a lot of missing value. We need to remove them in order to preserve our data. We will not use the pokedex_number since it is a unique ID for each pokemon.

df2 <- data %>% select_if(~is.numeric(.)) %>% select(-c(pokedex_number, percentage_male)) %>% 
    mutate(legendary = data$is_legendary, name = data$name) %>% na.omit()
df <- df2 %>% select(-c(name, legendary))

4 Exploratory Data Analysis

4.1 Possibility for Clustering

One of the easiest way to segment the Pokemon is to look at the legendary status. Here, we try to see if a legendary pokemon require more base_egg_steps to hatch and has higher base_total stats.

ggplot(data, aes(base_egg_steps, base_total, color = is_legendary, size = capture_rate)) + 
    geom_point(alpha = 0.5) + theme_minimal()

There is a clear distinction between a legendary Pokemon and a common Pokemon by looking at the base_egg_steps and the base_total stats.

Let’s look at another merics. We want to see if there is a clear difference on combination of type1 and type2 toward the mean of the base total of a non-legendary Pokemon.

data %>% filter(is_legendary == "no") %>% group_by(type1, type2) %>% summarise(base = mean(base_total)) %>% 
    ggplot(aes(type1, type2, fill = base)) + scale_fill_viridis_c(option = "B") + 
    geom_tile(color = "black") + theme_minimal()

Pokemon which only has 1 type (no type2) is generally weaker compared to Pokemon which has more than 1 type. Pokemon with normal type is also weaker, regardless of the combination.

Lastly, we want to check some stats between legendary and non-legendary Pokemon.

p1 <- ggplot(data, aes(is_legendary, height_m, fill = is_legendary)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Height")

p2 <- ggplot(data, aes(is_legendary, weight_kg, fill = is_legendary)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Weight")

p3 <- ggplot(data, aes(is_legendary, speed, fill = is_legendary)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Speed")

p4 <- ggplot(data, aes(is_legendary, hp, fill = is_legendary)) + geom_boxplot() + 
    theme_minimal() + theme(legend.position = "bottom") + labs(title = "Health Point (HP)")

plot_grid(p1, p2, p3, p4)

Legendary Pokemon is slightly better in all aspects.

Based on our exploratory process, we can segment or cluster some Pokemon based on their specific traits, such as the legendary status. To find more interesting and undiscovered pattern in the data, we will use clustering method using the K-means.

4.2 Possibility for Principle Component Analysis (PCA)

We want to see if there is a high correlation between numeric variables. Strong correlation in some variables imply that we can reduce the dimensionality or number of features using the Principle Component Analysis (PCA).

ggcorr(df, low = "navy", high = "darkred")

There are some features that has high correlation such as the base_total with sp_attack, against_ice with against_ground, etc. Based on this result, we will try to reduce the dimension using PCA.

5 Clustering

5.1 Finding Optimal Number of Clusters

Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use 3 methods: elbow method, silhouette method, and gap statistic. We will decide the number of cluster based on majority voting.

5.1.1 Elbow Method

Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.

fviz_nbclust(df, kmeans, method = "wss", k.max = 15) + scale_y_continuous(labels = number_format(scale = 10^(-9), 
    big.mark = ",", suffix = " bil.")) + labs(subtitle = "Elbow method")

Using the elbow method, we know that 4 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.

5.1.2 Silhouette Method

The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).

fviz_nbclust(df, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")

Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 4.

5.1.3 Gap Statistic

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.

fviz_nbclust(df, kmeans, "gap_stat", k.max = 15) + labs(subtitle = "Gap Statistic method")

Based on the gap statistic method, the optimal k is one.

Two out of three methods suggest that k = 4 is the optimum number of clusters. If we use k=1, we will not be able to analyze the difference between clusters or segments because that mean all of the data belong to one cluster and has the same characteristics. This can also be an alarm to us that a clustering method may not be very useful, so take it with a grain of salt.

5.2 K-Means Clustering

Here is the algorithm behind K-Means Clustering:

1. Randomly assign number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
1. Iteratre until the cluster assignments stop changing. For each of the K clusters, compute the cluster centroid. The Kth cluster centroid is the vector of the p features means for the observations in the kth cluster. Assign each observation to the cluster whose centroid is closest (using euclidean distance or any other distance measurement)

set.seed(123)
km <- kmeans(df, centers = 4)
km

K-means clustering with 4 clusters of sizes 78, 184, 320, 199

Cluster means:
  against_bug against_dark against_dragon against_electric against_fairy
1   0.8910256     1.096154      0.7820513         1.243590      0.974359
2   0.9904891     1.057065      1.0679348         1.061141      1.245924
3   0.9601563     1.067969      0.9406250         1.092188      1.035937
  against_fight against_fire against_flying against_ghost against_grass
1      1.048077     1.016026       1.038462     0.9487179     1.1474359
2      1.160326     1.019022       1.074728     1.0353261     1.0353261
3      1.035937     1.216406       1.197656     0.9875000     1.0445312
  against_ground against_ice against_normal against_poison against_psychic
1       1.044872    1.038462      0.8076923      1.0256410       0.9487179
2       1.154891    1.258152      0.8926630      0.9076087       0.9116848
3       1.110156    1.132031      0.8757812      0.9523438       1.0203125
  against_rock against_steel against_water   attack base_egg_steps
1     1.256410     1.2115385      1.025641 63.43590       5120.000
2     1.197011     1.0108696      1.058424 94.66848      13753.043
3     1.281250     0.9812500      1.044531 73.19375       5196.000
  base_happiness base_total capture_rate  defense experience_growth
1       72.62821   393.9487     23.12821 67.87179          743589.7
2       50.40761   506.0489     22.02717 86.37500         1279673.9
3       69.53125   404.4094     21.82187 70.97500         1000000.0
   height_m       hp sp_attack sp_defense    speed weight_kg
1 0.9269231 69.60256  60.84615   74.33333 57.85897  29.45769
2 1.7581522 82.15761  84.51630   82.15217 76.17935 127.14239
3 1.0165625 64.50000  66.01875   66.92500 62.79688  48.14000
 [ reached getOption("max.print") -- omitted 1 row ]

Clustering vector:
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
  4   4   4   4   4   4   4   4   4   3   3   3   3   3   3   4   4   4 
 21  22  23  24  25  29  30  31  32  33  34  35  36  39  40  41  42  43 
  3   3   3   3   3   4   4   4   4   4   4   1   1   1   1   3   3   4 
 44  45  46  47  48  49  54  55  56  57  58  59  60  61  62  63  64  65 
  4   4   3   3   3   3   3   3   3   3   2   2   4   4   4   4   4   4 
 66  67  68  69  70  71  72  73  77  78  79  80  81  82  83  84  85  86 
  4   4   4   4   4   4   2   2   3   3   3   3   3   3   3   3   3   3 
 87  90  91  92  93  94  95  96  97  98  99 100 101 102 104 106 107 108 
  3   2   2   4   4   4   3   3   3   3   3   3   3   2   3   3   3   3 
109 110 111 112 113 114 115 116 117 118 119 
  3   3   2   2   1   3   3   3   3   3   3 
 [ reached getOption("max.print") -- omitted 680 entries ]

Within cluster sum of squares by cluster:
[1]  632087993618 1988052753373     458666182    2050047364
 (between_SS / total_SS =  87.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"

The ratio between the sum of squares distance between cluster to the total sum of squares is 87.2%, meaning that most of the sum of squares distance comes from the distance between clusters. Thus, we can conclude that our data is properly clustered since the observations in the same cluster has a little distance or variations. The number of members on each cluster is not equally distributed.

We’ve already get the information about the cluster of each observation. Let’s join the vector cluster into the dataset.

df_clust <- data %>% select(-percentage_male) %>% na.omit() %>% bind_cols(cluster = as.factor(km$cluster)) %>% 
    select(cluster, 1:40)

df_clust

5.3 Cluster Analysis

We will do analysis regarding the characteristic of each cluster and see if there is a difference or specific traits on each clusters. Since we have a lot of features (32), we might not be able to explore all of them.

df_clust %>% mutate(cluster = cluster) %>% ggplot(aes(experience_growth, base_total, 
    color = cluster)) + geom_point(alpha = 0.5) + geom_mark_hull() + scale_color_brewer(palette = "Set1") + 
    theme_minimal() + theme(legend.position = "top")

There is a clear distinction on each clusters based on their experience_growth. Cluster 1 has the lowest experience growth while Cluster 2 has the highest experience growth. Cluster 3 and Cluster 4 has a quite close experience growth but still perfectly separable from each other. No apparent difference in base_total between clusters, but Cluster 4 has a high variance in the base total stats.

We might check on each clusters centroid:

df_clust %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") %>% mutate_if(is.numeric, 
    .funs = "round", digits = 2) %>% select(1:19)

Some interesting finding that we can take from the centroid including:

Based on their ressistances

Cluster 1 has greater ressistance against bugs, dragon, ghost, and normal attack moves, meaning they can take more damage from dragon and other attacks. However, they are weak against fairy and steel attacks compared to other clusters, so watch out when your enemy use a Pokemon that has one of those two type of moves.
Cluster 2 has weak ressitance against fairy, ground, ice, ghost and fight attacks than other clusters. They are strong against fire, psychic and rocks.
Cluster 3 has no specific ressistance that is better than other clusters. However, they are weak against fire, like Cluster 1.
Cluster 4 has weak ressistance against bug, flying, ghost, ice . However, they are strong against grass and steel.

What are we supposed to do with this information? We can use them to help us consider the kind of Pokemon that we will use in accordance with what type of moves our opponent’s has. For example, if our opponent’s Pokemon move mostly consists of dragon attack, we may choose one of our Pokemon that from Cluster 1, since they are stronger against dragon attacks. Or if we can place a Pokemon from Cluster 3 to be our first Pokemon, to take damages and observe what our opponent’s move type is. We then switch our Pokemon with the better one that suits our opponent.

However, considering what Pokemon to use only based on our opponent’s moves might be not enough. We need to look at our own Pokemon’s battle stats.

Based on Pokemon’s base battle stats

df_clust %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") %>% select(cluster, 
    attack, defense, sp_attack, sp_defense, hp, speed, base_total) %>% mutate_if(is.numeric, 
    .funs = "round", digits = 2)

Pokemon’s base stats is indicated by their attack, defense, sp_attack, sp_defence, hp, speed, and base_total.

Cluster 1 overall has the lowest base_total stats. They are particularly the weakest in term of attack, speed, and sp_attack. They don’t have a particular stats that is better than other clusters. A group of Pokemon with low attack power.
Cluster 2 overall has the highest base_total and they are the best in all stats.
Cluster 3 is the balanced group, with no particular stats that is worse than other clusters even though their base_total is the second worse.
Cluster 4 has the worst defense, sp_defense, and hp. They are a group of Pokemon with low defense power, the opposite of Cluster 1.

Based on other numeric stats

df_clust %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") %>% select(cluster, 
    base_egg_steps, base_happiness, experience_growth, weight_kg, height_m, 
    capture_rate) %>% mutate_if(is.numeric, .funs = "round", digits = 2)

Cluster 1 is probably the smallest in term of weight and height. They has the highest base_happiness and easy to catch(high capture_rate) and hatch from egg(base_egg_steps). They are also the easiest to reach level 100, indicated by the low experience_growth.
Cluster 2 is the hardest to hatch from egg and start with less happiness than other clusters. They are also consists of larger Pokemon (high weight and height) and the hardest to reach level 100.
Cluster 3 and Cluster 4 is in the middle between Cluster 1 and Cluster 2, with Pokemon of Cluster 3 has slightly bigger size in term of their weight and height.

Based on the legendary status

How is the distribution of legendary Pokemon?

df_clust %>% group_by(cluster, is_legendary) %>% summarise(total = n())

Cluster 1 and Cluster 3 has no legendary Pokemon in it, while Cluster 2 has the most legendary Pokemon. That’s why they are better than the others in term of battle stats we previously examined.

6 PCA

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

6.1 Dimensionality Reduction

Here we will make PCA from the df datasets. We will see the eigenvalues and the percentage of variances explained by each dimensions. The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

poke_pca <- PCA(df2 %>% select(-name), scale.unit = T, ncp = 31, graph = F, 
    quali.sup = 32)

summary(poke_pca)


Call:
PCA(X = df2 %>% select(-name), scale.unit = T, ncp = 31, quali.sup = 32,  
     graph = F) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
Variance               5.412   3.528   2.570   2.173   1.920   1.765
% of var.             17.457  11.382   8.290   7.009   6.195   5.694
                       Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
Variance               1.426   1.330   1.246   1.162   1.003   0.794
% of var.              4.601   4.292   4.018   3.748   3.234   2.562
                      Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
Variance               0.751   0.733   0.629   0.616   0.584   0.515
% of var.              2.421   2.363   2.029   1.987   1.885   1.661
                      Dim.19  Dim.20  Dim.21  Dim.22  Dim.23  Dim.24
Variance               0.453   0.398   0.329   0.316   0.310   0.262
% of var.              1.460   1.282   1.060   1.020   1.001   0.845
                      Dim.25  Dim.26  Dim.27  Dim.28  Dim.29  Dim.30
Variance               0.225   0.190   0.145   0.088   0.069   0.058
% of var.              0.727   0.613   0.466   0.284   0.223   0.188
                      Dim.31
Variance               0.000
% of var.              0.000
 [ reached getOption("max.print") -- omitted 1 row ]

Individuals (the 10 first)
                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
1                 |  4.218 | -2.274  0.122  0.291 |  1.373  0.068  0.106 |
2                 |  3.819 | -1.131  0.030  0.088 |  1.657  0.100  0.188 |
3                 |  5.108 |  2.051  0.100  0.161 |  2.214  0.178  0.188 |
4                 |  4.029 | -1.716  0.070  0.181 | -1.778  0.115  0.195 |
5                 |  3.504 | -0.395  0.004  0.013 | -1.451  0.076  0.171 |
                   Dim.3    ctr   cos2  
1                 -0.975  0.047  0.053 |
2                 -1.070  0.057  0.078 |
3                 -1.334  0.089  0.068 |
4                 -0.341  0.006  0.007 |
5                 -0.442  0.010  0.016 |
 [ reached getOption("max.print") -- omitted 5 rows ]

Variables (the 10 first)
                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
against_bug       | -0.025  0.011  0.001 |  0.312  2.756  0.097 |  0.061
against_dark      |  0.131  0.319  0.017 | -0.177  0.884  0.031 | -0.733
against_dragon    |  0.173  0.551  0.030 |  0.257  1.874  0.066 |  0.193
against_electric  | -0.080  0.119  0.006 | -0.031  0.028  0.001 | -0.112
against_fairy     |  0.144  0.385  0.021 |  0.392  4.361  0.154 |  0.502
                     ctr   cos2  
against_bug        0.145  0.004 |
against_dark      20.880  0.537 |
against_dragon     1.443  0.037 |
against_electric   0.486  0.012 |
against_fairy      9.797  0.252 |
 [ reached getOption("max.print") -- omitted 5 rows ]

Supplementary categories
                       Dist     Dim.1    cos2  v.test     Dim.2    cos2
no                |   0.468 |  -0.413   0.779 -15.926 |  -0.098   0.044
yes               |   4.828 |   4.261   0.779  15.926 |   1.010   0.044
                   v.test     Dim.3    cos2  v.test  
no                 -4.674 |   0.057   0.015   3.183 |
yes                 4.674 |  -0.587   0.015  -3.183 |

Let’s visualize the percentage of variances captured by each dimensions.

fviz_eig(poke_pca, ncp = 15, addlabels = T, main = "Variance explained by each dimensions")

50% of the variances can be explained by only using the first 5 dimensions, with the first dimensions can explain 17.5% of the total variances.

We can keep around 80% of the information from our data by using only 13 dimensions (thus, 58% dimensionality reduction). This mean that we can actually reduce the number of features on our dataset from 31 to just 13 numeric features.

We can extract the values of PC1 to PC13 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.

df_pca <- data.frame(poke_pca$ind$coord[, 1:13]) %>% bind_cols(cluster = as.factor(km$cluster)) %>% 
    select(cluster, 1:13)
df_pca

6.2 Individual and Variable Factor Map

6.2.1 Individual Observations Map

cos2 or the squared cosine value shows the importance of a principal component for a given observation (vector of original variables). The value of cos2 can help find the components that are important to interpret observations.

The individual observations map shows where each of the observations is positioned in term of PC1 and PC2. Using only the first 2 PCs, we can see that there are a lot of outliers in our data that has high cos2 in the PC1. Further analysis can be done to check them out.

fviz_pca_ind(poke_pca, habillage = 32, addEllipses = T)

From the plot, we can see that pokemon with legendary status has more varied PCs values, indicated by bigger ellipse sphere, and has higher PC1 scores compared to the non-legendary. However, it’s clear that from the PCA, our data cannot be represent with only 2 dimensions, since it only accomodate less than 30% of the variance and cannot represent our data.

Pokemon number 384 and 401 is quite an outlier, let’s see his data.

df2 %>% mutate(rn = row_number()) %>% select(name, legendary, c(1:31, 34)) %>% 
    filter(rownames(df2) == 384)

df2 %>% filter(legendary == "yes") %>% select(name, 19:31) %>% arrange(desc(base_total)) %>% 
    top_n(5, wt = base_total)

The Pokemon with index number of 384 is Rayquaza, which is the second best legendary Pokemon in term of base_total stats and has higher attack and bigger than Mewtwo.

6.2.2 Variable Factor Map

If the observations are represented by their projections, the variables are represented by their correlations. When more than two components are needed to represent the data perfectly, the variables will be positioned inside the circle of correlations. The closer a variable is to the circle of correlations, the better we can reconstruct this variable from the first two components. The closer to the center of the plot a variable is, the less important it is for the first two components.

fviz_pca_var(poke_pca, select.var = list(contrib = 31), col.var = "contrib", 
    gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

The plot above shows us that the variables are located inside the circle, meaning that we would need more that two components to represent our data perfectly. The distance between variables and the origin measures the quality of the variables on the factor map. The color indicates the contribution of each variables. The contributions of variables in accounting for the variability in a given principal component are expressed in percentage. Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the data set. Variables that do not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed to simplify the overall analysis.

We can also check the quality of representation or cos2 of each variables. A high cos2 indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle. A low cos2 indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle.

fviz_cos2(poke_pca, choice = "var", fill = "cos2") + scale_fill_viridis_c(option = "B") + 
    theme(legend.position = "top")

Variables that highly contributed to PC2 are against_flying, against_poison, against_ice, and against_ground, while the rest of the first 13 variables contribute more toward PC1, especially base_total, which has the highest contribution toward PC1. The against_ice has the lowest correlation with the principle components, while the against_ground is negatively correlated with PC2.

We can consider to remove the against_ice variables since it’s contribute a little to our PC1.

a <- dimdesc(poke_pca)
as.data.frame(a[[1]]$quanti)

6.3 Clustering with PCA

PCA can also be integrated with the result of the K-means Clustering to help visualize our data in a fewer dimensions than the original features.

fviz_cluster(object = km, data = df, labelsize = 0) + theme_minimal()

However, same problem like the individual observations map happened: we have to little dimensions to represent our data. On the above visualization, our cluster looks like intersecting each other because we don’t have enough dimensions to represent them.

We may add 1 more dimensions using plotly to see if our clusters is still clumped together.

plot_ly(df_pca, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c("black", 
    "red", "green", "blue")) %>% add_markers() %>% layout(scene = list(xaxis = list(title = "Dim.1"), 
    yaxis = list(title = "Dim.2"), zaxis = list(title = "Dim.3")))

7 Conclusion

We can pull some conclusions regarding our dataset based on the previous cluster and principle component analysis:

We can separate our data into at least 4 clusters based on all of the numerical features, with more than 87% of the total sum of squares come from the distance of observations between clusters.
Cluster 2 has the unique traits that it has the most (if not all) of the legendary Pokemon, which make it the best overall in base_total battle stats.
We can reduce our dimensions from 31 features into just 13 dimensions and still retain more than 80% of the variances using PCA. The dimensionality reduction can be useful if we apply the new PCA for machine learning applications.
However, as we have seen, the dimensionality reduction is not enough for us to visualize the clustering of our data, indicated by overlapping of clusters if we only use the first 2 dimensions. Perhaps the result from the gap statistic method is true, that there is only 1 big cluster.