1 Intro

We are going to do cluster analysis on pokemon dataset, also we will see if dimension reductionality is possible on this dataset

2 Setup and Import Data

Importing library

library(dplyr)
library(tidyverse)
library(lubridate)
library(cluster)
library(factoextra)
library(ggforce)
library(GGally)
library(scales)
library(cowplot)
library(FactoMineR)
library(factoextra)
library(plotly)
options(scipen = 123)

Import dataset

pokemon <- read.csv('pokemon.csv')
str(pokemon)
#> 'data.frame':    801 obs. of  41 variables:
#>  $ abilities        : chr  "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Blaze', 'Solar Power']" ...
#>  $ against_bug      : num  1 1 1 0.5 0.5 0.25 1 1 1 1 ...
#>  $ against_dark     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ against_dragon   : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ against_electric : num  0.5 0.5 0.5 1 1 2 2 2 2 1 ...
#>  $ against_fairy    : num  0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 ...
#>  $ against_fight    : num  0.5 0.5 0.5 1 1 0.5 1 1 1 0.5 ...
#>  $ against_fire     : num  2 2 2 0.5 0.5 0.5 0.5 0.5 0.5 2 ...
#>  $ against_flying   : num  2 2 2 1 1 1 1 1 1 2 ...
#>  $ against_ghost    : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ against_grass    : num  0.25 0.25 0.25 0.5 0.5 0.25 2 2 2 0.5 ...
#>  $ against_ground   : num  1 1 1 2 2 0 1 1 1 0.5 ...
#>  $ against_ice      : num  2 2 2 0.5 0.5 1 0.5 0.5 0.5 1 ...
#>  $ against_normal   : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ against_poison   : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ against_psychic  : num  2 2 2 1 1 1 1 1 1 1 ...
#>  $ against_rock     : num  1 1 1 2 2 4 1 1 1 2 ...
#>  $ against_steel    : num  1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 1 ...
#>  $ against_water    : num  0.5 0.5 0.5 2 2 2 0.5 0.5 0.5 1 ...
#>  $ attack           : int  49 62 100 52 64 104 48 63 103 30 ...
#>  $ base_egg_steps   : int  5120 5120 5120 5120 5120 5120 5120 5120 5120 3840 ...
#>  $ base_happiness   : int  70 70 70 70 70 70 70 70 70 70 ...
#>  $ base_total       : int  318 405 625 309 405 634 314 405 630 195 ...
#>  $ capture_rate     : chr  "45" "45" "45" "45" ...
#>  $ classfication    : chr  "Seed Pokémon" "Seed Pokémon" "Seed Pokémon" "Lizard Pokémon" ...
#>  $ defense          : int  49 63 123 43 58 78 65 80 120 35 ...
#>  $ experience_growth: int  1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1000000 ...
#>  $ height_m         : num  0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
#>  $ hp               : int  45 60 80 39 58 78 44 59 79 45 ...
#>  $ japanese_name    : chr  "Fushigidaneフシギãƒ\200ãƒ\215" "Fushigisouフシギソウ" "Fushigibanaフシギãƒ\220ナ" "Hitokageヒãƒ\210カゲ" ...
#>  $ name             : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
#>  $ percentage_male  : num  88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 50 ...
#>  $ pokedex_number   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ sp_attack        : int  65 80 122 60 80 159 50 65 135 20 ...
#>  $ sp_defense       : int  65 80 120 50 65 115 64 80 115 20 ...
#>  $ speed            : int  45 60 80 65 80 100 43 58 78 45 ...
#>  $ type1            : chr  "grass" "grass" "grass" "fire" ...
#>  $ type2            : chr  "poison" "poison" "poison" "" ...
#>  $ weight_kg        : num  6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
#>  $ generation       : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ is_legendary     : int  0 0 0 0 0 0 0 0 0 0 ...

Data Dictionary :

  • name : The English name of the Pokemon
  • japanese_name : The Original Japanese name of the Pokemon
  • pokedex_number : The entry number of the Pokemon in the National Pokedex
  • percentage_male : The percentage of the species that are male. Blank if the Pokemon is genderless.
  • type1 : The Primary Type of the Pokemon
  • type2 : The Secondary Type of the Pokemon
  • classification : The Classification of the Pokemon as described by the Sun and Moon Pokedex
  • height_m : Height of the Pokemon in metres
  • weight_kg : The Weight of the Pokemon in kilograms
  • capture_rate : Capture Rate of the Pokemon
  • base_egg_steps : The number of steps required to hatch an egg of the Pokemon
  • abilities : A stringified list of abilities that the Pokemon is capable of having
  • experience_growth : The Experience Growth of the Pokemon
  • base_happiness : Base Happiness of the Pokemon
  • against_? : Eighteen features that denote the amount of damage taken against an attack of a particular type
  • hp : The Base HP of the Pokemon
  • attack : The Base Attack of the Pokemon
  • defense : The Base Defense of the Pokemon
  • sp_attack : The Base Special Attack of the Pokemon
  • sp_defense : The Base Special Defense of the Pokemon
  • base_total : attack + defense + hp + sp_attack + sp_defense
  • speed : The Base Speed of the Pokemon
  • generation : The numbered generation which the Pokemon was first introduced
  • is_legendary : Denotes if the Pokemon is legendary.

3 Data Processing

Assign the right data type : -‘is_legendary’ : bool -‘capture_rate’ : numeric -‘generation’ : categorical -‘type1’ & ‘type2’ : categorical

poke <- pokemon %>% mutate(is_legendary = as.factor(if_else(is_legendary == 1, "yes", "no")),
                        capture_rate = if_else(capture_rate == "30 (Meteorite)255 (Core)", 30, as.numeric(capture_rate)),
                         generation = as.factor(generation),
                        type1 = as.factor(type1),
                        type2 = as.factor(type2))

Check Missing Value

colSums(is.na(poke))
#>         abilities       against_bug      against_dark    against_dragon 
#>                 0                 0                 0                 0 
#>  against_electric     against_fairy     against_fight      against_fire 
#>                 0                 0                 0                 0 
#>    against_flying     against_ghost     against_grass    against_ground 
#>                 0                 0                 0                 0 
#>       against_ice    against_normal    against_poison   against_psychic 
#>                 0                 0                 0                 0 
#>      against_rock     against_steel     against_water            attack 
#>                 0                 0                 0                 0 
#>    base_egg_steps    base_happiness        base_total      capture_rate 
#>                 0                 0                 0                 0 
#>     classfication           defense experience_growth          height_m 
#>                 0                 0                 0                20 
#>                hp     japanese_name              name   percentage_male 
#>                 0                 0                 0                98 
#>    pokedex_number         sp_attack        sp_defense             speed 
#>                 0                 0                 0                 0 
#>             type1             type2         weight_kg        generation 
#>                 0                 0                20                 0 
#>      is_legendary 
#>                 0

Dropping features

We will be dropping ‘percentage_male’ feature because it has high amount of missing values and other unwanted features. Keeping all numeric features, pokemon’s name and their legendary status, then separate the numeric for cluster and PCA analysis.

poclean <- poke %>% select_if(~is.numeric(.)) %>% 
                    select(-c(pokedex_number, percentage_male)) %>% 
                    mutate(legendary = poke$is_legendary, name = pokemon$name, type1=poke$type1, type2=poke$type2) %>% na.omit()

poclean2 <- poclean %>% select(-c(legendary, name, type1, type2))
colSums(is.na(poclean))
#>       against_bug      against_dark    against_dragon  against_electric 
#>                 0                 0                 0                 0 
#>     against_fairy     against_fight      against_fire    against_flying 
#>                 0                 0                 0                 0 
#>     against_ghost     against_grass    against_ground       against_ice 
#>                 0                 0                 0                 0 
#>    against_normal    against_poison   against_psychic      against_rock 
#>                 0                 0                 0                 0 
#>     against_steel     against_water            attack    base_egg_steps 
#>                 0                 0                 0                 0 
#>    base_happiness        base_total      capture_rate           defense 
#>                 0                 0                 0                 0 
#> experience_growth          height_m                hp         sp_attack 
#>                 0                 0                 0                 0 
#>        sp_defense             speed         weight_kg         legendary 
#>                 0                 0                 0                 0 
#>              name             type1             type2 
#>                 0                 0                 0

4 Exploratory Data Analysis

4.1 Clustering Possibility

Lets plot pokemon legendary status with other attributes

ggplot(poclean, aes(base_egg_steps, base_total, color = legendary, size = experience_growth)) + 
    geom_point(alpha = 0.5) + theme_minimal()

from the plot above we can see clear distinction that legendary pokemons tend to have high ‘base_total’ that is combined value of their stats. the amount of steps need to hatch their eggs also indicate clear separation between the pokemon legendary status. and the experience needed for them to grow also high.

lets take another look, non legendary pokemon types vs base_total

levels(poclean$type1)
#>  [1] "bug"      "dark"     "dragon"   "electric" "fairy"    "fighting"
#>  [7] "fire"     "flying"   "ghost"    "grass"    "ground"   "ice"     
#> [13] "normal"   "poison"   "psychic"  "rock"     "steel"    "water"
levels(poclean$type2)
#>  [1] ""         "bug"      "dark"     "dragon"   "electric" "fairy"   
#>  [7] "fighting" "fire"     "flying"   "ghost"    "grass"    "ground"  
#> [13] "ice"      "normal"   "poison"   "psychic"  "rock"     "steel"   
#> [19] "water"

you might see an empty string on type2, that is not missing value, that indicates that that particular pokemon does not have secondary type, not a missing imputation.

poclean %>% filter(legendary =='no') %>% group_by(type1, type2) %>% summarise(base_avg = mean(base_total)) %>% 
  ggplot(aes(type1, type2, fill = base_avg)) + scale_fill_viridis_c(option = "B") + 
    geom_tile() + 
  theme_minimal()

Plot above shows that pokemon with single type and/or normal type are generally weak (indicated with dark coloured blocks, meaning low base stats)

Lastly, lets see what makes legendary pokemon a ‘legendary’, are they strong?

p1 <- ggplot(poclean, aes(legendary, attack, fill = legendary)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Attack")

p2 <- ggplot(poclean, aes(legendary, defense, fill = legendary)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Defense")

p3 <- ggplot(poclean, aes(legendary, speed, fill = legendary)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Speed")

p4 <- ggplot(poclean, aes(legendary, hp, fill = legendary)) + geom_boxplot() + 
    theme_minimal() + theme(legend.position = "bottom") + labs(title = "Health Point (HP)")

plot_grid(p1, p2, p3, p4)

One more plot just to be sure, are they really that good?

ggplot(poclean, aes(legendary, base_total, fill = legendary)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Base Total")

Okay, so the first plot, legendary pokemon shows better stats value compared to non legendary pokemon, yes there are non legendary pokemon that are comparable with legendary pokemon when we compare it only to single stats. But, if we compare it with total stats, it shows clear distinction between non legendary vs legendary pokemons.

4.2 PCA Possibility

We know that base total are combined value of their strength (attack + defense + hp + sp_att + sp_def), that alone makes it clear that base_total are correlated with those features, lets see if there are other features correlating with other features.

We already seperate the numeric value into different dataframe, so lets use that

ggcorr(poclean2,hjust= 1)

As we suspect previously, their stats are correlating strongly with each other, also, their strength againts type feature also shows correlation, it make sense because if certain pokemon type are strong to certain type, it also weak to other certain type. based on this result we can conclude that this dataset dimension could be reduced using PCA.

5 Clustering

First, we need to choose the optimal number of clusters, to do that, to answer how many is optimal, we could use our knowledge within the business(or ask the expert), if you ask people who are into pokemons ‘how many type of pokemons are there?’ you will get the answer, but it also depend on the question, you will also get different answer. so rather than going around asking questions, we could use Elbow Method.

5.1 Elbow Method

Enter elbow method, choosing number of clusters are arbitrary, anyone could come up with their own number, but with Elbow Method we can plot ‘Within Sum of Squares’ vs Number of cluster, we can choose the optimal cluster when ‘Wss’ does not show a significant reduction when we increase the number of cluster.

what is ‘within sum of square’? wss are the sum of distance of every observation to their cluster centroid what is centroid? centroid are the centre point of the clusters that means that high wss value meaning we have a lot of observation within that clusters or the observation have a high range to their cluster. logically we want lowest wss value possible but also we could, theoretically, pick the number of our observation as number of cluster that mean every observation will have their own cluster with 0 wss value because they act as their own centroid.

fviz_nbclust(poclean2, kmeans, method = "wss", k.max = 8)

Plot above shows that the optimal number of clusters are 4, as you can see, when we increase the number of clusters to 5 we have an increase of wss value, that because even with smaller amount of observation within the clusters, the range of observation to the centroid increases. when we increase it to 7 clusters, we dont have a meaningful decrease in wss.

5.2 KMeans Clustering

kmeans <-  kmeans(poclean2, centers = 4)

we now have the clusters, lets combine it with our data so we can analyze it

poclean2$cluster <- as.factor(kmeans$cluster)

5.3 Cluster Analysis

We can average the feature we want to analyze and group it with their clusters.

poclean2 %>% group_by(cluster) %>% summarise_if(is.numeric, 'mean') %>% select(c(cluster, base_egg_steps, experience_growth))
poclean2 %>% group_by(cluster) %>% summarise_if(is.numeric, 'mean') %>% select(c(cluster, height_m, weight_kg))
poclean2 %>% group_by(cluster) %>% summarise_if(is.numeric, 'mean') %>% select(c(cluster, attack, defense, speed, hp, sp_attack, sp_defense, base_total))
poclean2 %>% group_by(cluster) %>% summarise_if(is.numeric, 'mean') %>% select(c(cluster, (1:19)))

from data above we can take few key points like : - Pokemons on cluster no.2 are all around stronger, biggest, and tallest

  • Pokemons on cluster no.3 are easier to level up and the lightest

also many other points, for example : - pokemon in cluster 4 are the worst againts bug type.

  • pokemon in cluster 1 are the worst againts fire type.

6 PCA

With PCA, we could reduce the dimension of our data by combining all our features to create new feature, this will help us dealing with highly correlated features, to put it simply, if we have 2 or more features, if it have a strong correlation we should remove one of them, without PCA, we lost all information within that features, what PCA did is if we plot both features, PCA will determine new axis to capture the highest variance within that data, resulting new features.

PCA is sensitive with range, dont forget to scale your data

poke_pca <- PCA(poclean2 %>% select(-cluster), scale.unit = T, ncp = 31)

summary(poke_pca)
#> 
#> Call:
#> PCA(X = poclean2 %>% select(-cluster), scale.unit = T, ncp = 31) 
#> 
#> 
#> Eigenvalues
#>                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
#> Variance               5.747   3.540   2.572   2.179   1.921   1.765   1.385
#> % of var.             18.538  11.419   8.296   7.028   6.198   5.692   4.469
#> Cumulative % of var.  18.538  29.957  38.254  45.282  51.480  57.172  61.641
#>                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
#> Variance               1.324   1.249   1.103   0.998   0.793   0.756   0.666
#> % of var.              4.270   4.028   3.557   3.220   2.557   2.438   2.149
#> Cumulative % of var.  65.911  69.939  73.496  76.716  79.273  81.711  83.860
#>                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
#> Variance               0.634   0.586   0.522   0.456   0.427   0.397   0.326
#> % of var.              2.045   1.890   1.683   1.470   1.377   1.281   1.052
#> Cumulative % of var.  85.905  87.795  89.479  90.949  92.326  93.607  94.659
#>                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
#> Variance               0.321   0.305   0.257   0.224   0.189   0.145   0.088
#> % of var.              1.035   0.983   0.829   0.723   0.608   0.467   0.283
#> Cumulative % of var.  95.695  96.678  97.507  98.230  98.839  99.305  99.589
#>                       Dim.29  Dim.30  Dim.31
#> Variance               0.069   0.058   0.000
#> % of var.              0.223   0.188   0.000
#> Cumulative % of var.  99.812 100.000 100.000
#> 
#> Individuals (the 10 first)
#>                       Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
#> 1                 |  4.247 | -2.048  0.093  0.233 |  1.493  0.081  0.124 |
#> 2                 |  3.850 | -0.926  0.019  0.058 |  1.735  0.109  0.203 |
#> 3                 |  5.132 |  2.180  0.106  0.181 |  2.173  0.171  0.179 |
#> 4                 |  4.058 | -1.563  0.054  0.148 | -1.664  0.100  0.168 |
#> 5                 |  3.538 | -0.268  0.002  0.006 | -1.385  0.069  0.153 |
#> 6                 |  6.684 |  2.348  0.123  0.123 |  1.093  0.043  0.027 |
#> 7                 |  3.573 | -1.558  0.054  0.190 | -0.994  0.036  0.077 |
#> 8                 |  3.013 | -0.316  0.002  0.011 | -0.736  0.020  0.060 |
#> 9                 |  4.572 |  2.680  0.160  0.343 | -0.193  0.001  0.002 |
#> 10                |  5.183 | -4.493  0.450  0.751 |  0.813  0.024  0.025 |
#>                    Dim.3    ctr   cos2  
#> 1                 -1.005  0.050  0.056 |
#> 2                 -1.088  0.059  0.080 |
#> 3                 -1.318  0.087  0.066 |
#> 4                 -0.401  0.008  0.010 |
#> 5                 -0.488  0.012  0.019 |
#> 6                 -2.080  0.215  0.097 |
#> 7                  0.228  0.003  0.004 |
#> 8                  0.158  0.001  0.003 |
#> 9                 -0.186  0.002  0.002 |
#> 10                -0.516  0.013  0.010 |
#> 
#> Variables (the 10 first)
#>                      Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
#> against_bug       | -0.021  0.008  0.000 |  0.309  2.699  0.096 |  0.068  0.179
#> against_dark      |  0.123  0.262  0.015 | -0.181  0.928  0.033 | -0.727 20.578
#> against_dragon    |  0.170  0.500  0.029 |  0.247  1.717  0.061 |  0.200  1.558
#> against_electric  | -0.074  0.097  0.006 | -0.025  0.017  0.001 | -0.118  0.541
#> against_fairy     |  0.147  0.376  0.022 |  0.382  4.129  0.146 |  0.509 10.062
#> against_fight     |  0.146  0.369  0.021 | -0.329  3.049  0.108 |  0.712 19.731
#> against_fire      | -0.161  0.449  0.026 |  0.390  4.300  0.152 | -0.268  2.797
#> against_flying    | -0.251  1.094  0.063 |  0.702 13.934  0.493 | -0.080  0.250
#> against_ghost     |  0.152  0.403  0.023 | -0.116  0.381  0.013 | -0.755 22.185
#> against_grass     |  0.078  0.106  0.006 | -0.491  6.821  0.241 |  0.291  3.302
#>                     cos2  
#> against_bug        0.005 |
#> against_dark       0.529 |
#> against_dragon     0.040 |
#> against_electric   0.014 |
#> against_fairy      0.259 |
#> against_fight      0.507 |
#> against_fire       0.072 |
#> against_flying     0.006 |
#> against_ghost      0.571 |
#> against_grass      0.085 |
poke_pca$eig
#>                                     eigenvalue
#> comp 1  5.746837110817850202693080063909292221
#> comp 2  3.539974286840406936249792124726809561
#> comp 3  2.571830487017634059299098225892521441
#> comp 4  2.178718010262271409516188214183785021
#> comp 5  1.921484903732008087118288131023291498
#> comp 6  1.764573272713494844765591551549732685
#> comp 7  1.385259592335363887372068347758613527
#> comp 8  1.323641186321642848611190856900066137
#> comp 9  1.248792161663992539288869920710567385
#> comp 10 1.102685266225354565605698553554248065
#> comp 11 0.998045461858878213412538116244832054
#> comp 12 0.792737268365448222162683578062569723
#> comp 13 0.755854373349999386633157882897648960
#> comp 14 0.666237667336235350745710093178786337
#> comp 15 0.634020318960932582896816711581777781
#> comp 16 0.585897433631494712891196741111343727
#> comp 17 0.521782770639018345093518291832879186
#> comp 18 0.455754228971840480433286302286433056
#> comp 19 0.426882419754785302767885468711028807
#> comp 20 0.397226589530229823310492065502330661
#> comp 21 0.326187914240269494214885526162106544
#> comp 22 0.320953099687115550597837909663212486
#> comp 23 0.304815649070375838114443922677310184
#> comp 24 0.257060376717304195359758978156605735
#> comp 25 0.224150007665827322167473312219954096
#> comp 26 0.188605600325726707744422583346022293
#> comp 27 0.144673538992790395862897412371239625
#> comp 28 0.087832255380978246916967577817558777
#> comp 29 0.069143282958821272732308216291130520
#> comp 30 0.058343464631910930962011008205081453
#> comp 31 0.000000000000000000000000000003768007
#>                         percentage of variance
#> comp 1  18.53818422844467761478881584480404854
#> comp 2  11.41927189303357081939793715719133615
#> comp 3   8.29622737747623872905933239962905645
#> comp 4   7.02812261374926272594620968447998166
#> comp 5   6.19833839913551010170067456783726811
#> comp 6   5.69217184746288662466895402758382261
#> comp 7   4.46857933011407748580268162186257541
#> comp 8   4.26981027845691230027114215772598982
#> comp 9   4.02836181181933117301241509267129004
#> comp 10  3.55704924588824056286284758243709803
#> comp 11  3.21950148986734907552431650401558727
#> comp 12  2.55721699472725250146254438732285053
#> comp 13  2.43823991403225637242258017067797482
#> comp 14  2.14915376560075932488302896672394127
#> comp 15  2.04522683535784732811180219869129360
#> comp 16  1.88999172139191840003036304551642388
#> comp 17  1.68317022786780112753035609785001725
#> comp 18  1.47017493216722749949099124933127314
#> comp 19  1.37704006372511389422186312003759667
#> comp 20  1.28137609525880602490133242099545896
#> comp 21  1.05221907819441762299561560212168843
#> comp 22  1.03533257963585656469263085455168039
#> comp 23  0.98327628732379301901289636589353904
#> comp 24  0.82922702166872319651247380534186959
#> comp 25  0.72306454085750748728145254062837921
#> comp 26  0.60840516234105390669384405555319972
#> comp 27  0.46668883546061412648242594514158554
#> comp 28  0.28332985606767174813214182904630434
#> comp 29  0.22304284825426218263899613702960778
#> comp 30  0.18820472461906753713911655268020695
#> comp 31  0.00000000000000000000000000001215486
#>         cumulative percentage of variance
#> comp 1                           18.53818
#> comp 2                           29.95746
#> comp 3                           38.25368
#> comp 4                           45.28181
#> comp 5                           51.48014
#> comp 6                           57.17232
#> comp 7                           61.64090
#> comp 8                           65.91071
#> comp 9                           69.93907
#> comp 10                          73.49612
#> comp 11                          76.71562
#> comp 12                          79.27284
#> comp 13                          81.71108
#> comp 14                          83.86023
#> comp 15                          85.90546
#> comp 16                          87.79545
#> comp 17                          89.47862
#> comp 18                          90.94879
#> comp 19                          92.32583
#> comp 20                          93.60721
#> comp 21                          94.65943
#> comp 22                          95.69476
#> comp 23                          96.67804
#> comp 24                          97.50726
#> comp 25                          98.23033
#> comp 26                          98.83873
#> comp 27                          99.30542
#> comp 28                          99.58875
#> comp 29                          99.81180
#> comp 30                         100.00000
#> comp 31                         100.00000
  • Component.1 has 18.5% information of all our data
  • Component.30 has 0.18%
  • Component.31 has 0.0xx%

with this, we can choose how much information we want to retain to reduce our dimension.

keeping 13 Components will have 81.7% all of our data while removing 18 other components.

keeping 18 Components will have 91% all of our data while removing 13 other components.

these new features can be used for supervised learning.

df_pca <- data.frame(poke_pca$ind$coord[, 1:13])
df_pca$cluster <- kmeans$cluster
df_pca

7 Clustering with PCA

fviz_cluster(kmeans, poclean2 %>% select(-cluster))

Plot above didnt show clear separation, that is because its only plotted in 2 dimensions, i doubt plotting it in 3d will help because we have 31 total dimensions.

8 Conclusion

  • We can seperate pokemons into 4 different clusters, with more than 87% total sum of squares, meaning that the observations are close to their centroid and far from other clusters centroids. very clear distintction between clusters.

  • It is impossible to visually represent our cluster without the plot being overlapping because we have such high dimension, even when we reduce it to 13, we still 10 higher dimension.

  • while reducing the dimensions from 31 to 13 does not help with our visualization, it will help during supervised machine learning process, it might increase the accuracy, and it is definately help with the computational processes.