We are going to do cluster analysis on pokemon dataset, also we will see if dimension reductionality is possible on this dataset
Importing library
library(dplyr)
library(tidyverse)
library(lubridate)
library(cluster)
library(factoextra)
library(ggforce)
library(GGally)
library(scales)
library(cowplot)
library(FactoMineR)
library(factoextra)
library(plotly)
options(scipen = 123)Import dataset
pokemon <- read.csv('pokemon.csv')
str(pokemon)#> 'data.frame': 801 obs. of 41 variables:
#> $ abilities : chr "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Blaze', 'Solar Power']" ...
#> $ against_bug : num 1 1 1 0.5 0.5 0.25 1 1 1 1 ...
#> $ against_dark : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ against_dragon : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ against_electric : num 0.5 0.5 0.5 1 1 2 2 2 2 1 ...
#> $ against_fairy : num 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 ...
#> $ against_fight : num 0.5 0.5 0.5 1 1 0.5 1 1 1 0.5 ...
#> $ against_fire : num 2 2 2 0.5 0.5 0.5 0.5 0.5 0.5 2 ...
#> $ against_flying : num 2 2 2 1 1 1 1 1 1 2 ...
#> $ against_ghost : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ against_grass : num 0.25 0.25 0.25 0.5 0.5 0.25 2 2 2 0.5 ...
#> $ against_ground : num 1 1 1 2 2 0 1 1 1 0.5 ...
#> $ against_ice : num 2 2 2 0.5 0.5 1 0.5 0.5 0.5 1 ...
#> $ against_normal : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ against_poison : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ against_psychic : num 2 2 2 1 1 1 1 1 1 1 ...
#> $ against_rock : num 1 1 1 2 2 4 1 1 1 2 ...
#> $ against_steel : num 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 1 ...
#> $ against_water : num 0.5 0.5 0.5 2 2 2 0.5 0.5 0.5 1 ...
#> $ attack : int 49 62 100 52 64 104 48 63 103 30 ...
#> $ base_egg_steps : int 5120 5120 5120 5120 5120 5120 5120 5120 5120 3840 ...
#> $ base_happiness : int 70 70 70 70 70 70 70 70 70 70 ...
#> $ base_total : int 318 405 625 309 405 634 314 405 630 195 ...
#> $ capture_rate : chr "45" "45" "45" "45" ...
#> $ classfication : chr "Seed Pokémon" "Seed Pokémon" "Seed Pokémon" "Lizard Pokémon" ...
#> $ defense : int 49 63 123 43 58 78 65 80 120 35 ...
#> $ experience_growth: int 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1000000 ...
#> $ height_m : num 0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
#> $ hp : int 45 60 80 39 58 78 44 59 79 45 ...
#> $ japanese_name : chr "Fushigidaneフシギãƒ\200ãƒ\215" "Fushigisouフシギソウ" "Fushigibanaフシギãƒ\220ナ" "Hitokageヒãƒ\210カゲ" ...
#> $ name : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
#> $ percentage_male : num 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 50 ...
#> $ pokedex_number : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ sp_attack : int 65 80 122 60 80 159 50 65 135 20 ...
#> $ sp_defense : int 65 80 120 50 65 115 64 80 115 20 ...
#> $ speed : int 45 60 80 65 80 100 43 58 78 45 ...
#> $ type1 : chr "grass" "grass" "grass" "fire" ...
#> $ type2 : chr "poison" "poison" "poison" "" ...
#> $ weight_kg : num 6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
#> $ generation : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ is_legendary : int 0 0 0 0 0 0 0 0 0 0 ...
Data Dictionary :
Assign the right data type : -‘is_legendary’ : bool -‘capture_rate’ : numeric -‘generation’ : categorical -‘type1’ & ‘type2’ : categorical
poke <- pokemon %>% mutate(is_legendary = as.factor(if_else(is_legendary == 1, "yes", "no")),
capture_rate = if_else(capture_rate == "30 (Meteorite)255 (Core)", 30, as.numeric(capture_rate)),
generation = as.factor(generation),
type1 = as.factor(type1),
type2 = as.factor(type2))Check Missing Value
colSums(is.na(poke))#> abilities against_bug against_dark against_dragon
#> 0 0 0 0
#> against_electric against_fairy against_fight against_fire
#> 0 0 0 0
#> against_flying against_ghost against_grass against_ground
#> 0 0 0 0
#> against_ice against_normal against_poison against_psychic
#> 0 0 0 0
#> against_rock against_steel against_water attack
#> 0 0 0 0
#> base_egg_steps base_happiness base_total capture_rate
#> 0 0 0 0
#> classfication defense experience_growth height_m
#> 0 0 0 20
#> hp japanese_name name percentage_male
#> 0 0 0 98
#> pokedex_number sp_attack sp_defense speed
#> 0 0 0 0
#> type1 type2 weight_kg generation
#> 0 0 20 0
#> is_legendary
#> 0
Dropping features
We will be dropping ‘percentage_male’ feature because it has high amount of missing values and other unwanted features. Keeping all numeric features, pokemon’s name and their legendary status, then separate the numeric for cluster and PCA analysis.
poclean <- poke %>% select_if(~is.numeric(.)) %>%
select(-c(pokedex_number, percentage_male)) %>%
mutate(legendary = poke$is_legendary, name = pokemon$name, type1=poke$type1, type2=poke$type2) %>% na.omit()
poclean2 <- poclean %>% select(-c(legendary, name, type1, type2))colSums(is.na(poclean))#> against_bug against_dark against_dragon against_electric
#> 0 0 0 0
#> against_fairy against_fight against_fire against_flying
#> 0 0 0 0
#> against_ghost against_grass against_ground against_ice
#> 0 0 0 0
#> against_normal against_poison against_psychic against_rock
#> 0 0 0 0
#> against_steel against_water attack base_egg_steps
#> 0 0 0 0
#> base_happiness base_total capture_rate defense
#> 0 0 0 0
#> experience_growth height_m hp sp_attack
#> 0 0 0 0
#> sp_defense speed weight_kg legendary
#> 0 0 0 0
#> name type1 type2
#> 0 0 0
Lets plot pokemon legendary status with other attributes
ggplot(poclean, aes(base_egg_steps, base_total, color = legendary, size = experience_growth)) +
geom_point(alpha = 0.5) + theme_minimal()from the plot above we can see clear distinction that legendary pokemons tend to have high ‘base_total’ that is combined value of their stats. the amount of steps need to hatch their eggs also indicate clear separation between the pokemon legendary status. and the experience needed for them to grow also high.
lets take another look, non legendary pokemon types vs base_total
levels(poclean$type1)#> [1] "bug" "dark" "dragon" "electric" "fairy" "fighting"
#> [7] "fire" "flying" "ghost" "grass" "ground" "ice"
#> [13] "normal" "poison" "psychic" "rock" "steel" "water"
levels(poclean$type2)#> [1] "" "bug" "dark" "dragon" "electric" "fairy"
#> [7] "fighting" "fire" "flying" "ghost" "grass" "ground"
#> [13] "ice" "normal" "poison" "psychic" "rock" "steel"
#> [19] "water"
you might see an empty string on type2, that is not missing value, that indicates that that particular pokemon does not have secondary type, not a missing imputation.
poclean %>% filter(legendary =='no') %>% group_by(type1, type2) %>% summarise(base_avg = mean(base_total)) %>%
ggplot(aes(type1, type2, fill = base_avg)) + scale_fill_viridis_c(option = "B") +
geom_tile() +
theme_minimal() Plot above shows that pokemon with single type and/or normal type are generally weak (indicated with dark coloured blocks, meaning low base stats)
Lastly, lets see what makes legendary pokemon a ‘legendary’, are they strong?
p1 <- ggplot(poclean, aes(legendary, attack, fill = legendary)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Attack")
p2 <- ggplot(poclean, aes(legendary, defense, fill = legendary)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Defense")
p3 <- ggplot(poclean, aes(legendary, speed, fill = legendary)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Speed")
p4 <- ggplot(poclean, aes(legendary, hp, fill = legendary)) + geom_boxplot() +
theme_minimal() + theme(legend.position = "bottom") + labs(title = "Health Point (HP)")
plot_grid(p1, p2, p3, p4) One more plot just to be sure, are they really that good?
ggplot(poclean, aes(legendary, base_total, fill = legendary)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Base Total")Okay, so the first plot, legendary pokemon shows better stats value compared to non legendary pokemon, yes there are non legendary pokemon that are comparable with legendary pokemon when we compare it only to single stats. But, if we compare it with total stats, it shows clear distinction between non legendary vs legendary pokemons.
We know that base total are combined value of their strength (attack + defense + hp + sp_att + sp_def), that alone makes it clear that base_total are correlated with those features, lets see if there are other features correlating with other features.
We already seperate the numeric value into different dataframe, so lets use that
ggcorr(poclean2,hjust= 1) As we suspect previously, their stats are correlating strongly with each other, also, their strength againts type feature also shows correlation, it make sense because if certain pokemon type are strong to certain type, it also weak to other certain type. based on this result we can conclude that this dataset dimension could be reduced using PCA.
First, we need to choose the optimal number of clusters, to do that, to answer how many is optimal, we could use our knowledge within the business(or ask the expert), if you ask people who are into pokemons ‘how many type of pokemons are there?’ you will get the answer, but it also depend on the question, you will also get different answer. so rather than going around asking questions, we could use Elbow Method.
Enter elbow method, choosing number of clusters are arbitrary, anyone could come up with their own number, but with Elbow Method we can plot ‘Within Sum of Squares’ vs Number of cluster, we can choose the optimal cluster when ‘Wss’ does not show a significant reduction when we increase the number of cluster.
what is ‘within sum of square’? wss are the sum of distance of every observation to their cluster centroid what is centroid? centroid are the centre point of the clusters that means that high wss value meaning we have a lot of observation within that clusters or the observation have a high range to their cluster. logically we want lowest wss value possible but also we could, theoretically, pick the number of our observation as number of cluster that mean every observation will have their own cluster with 0 wss value because they act as their own centroid.
fviz_nbclust(poclean2, kmeans, method = "wss", k.max = 8) Plot above shows that the optimal number of clusters are 4, as you can see, when we increase the number of clusters to 5 we have an increase of wss value, that because even with smaller amount of observation within the clusters, the range of observation to the centroid increases. when we increase it to 7 clusters, we dont have a meaningful decrease in wss.
kmeans <- kmeans(poclean2, centers = 4)we now have the clusters, lets combine it with our data so we can analyze it
poclean2$cluster <- as.factor(kmeans$cluster)We can average the feature we want to analyze and group it with their clusters.
poclean2 %>% group_by(cluster) %>% summarise_if(is.numeric, 'mean') %>% select(c(cluster, base_egg_steps, experience_growth))poclean2 %>% group_by(cluster) %>% summarise_if(is.numeric, 'mean') %>% select(c(cluster, height_m, weight_kg))poclean2 %>% group_by(cluster) %>% summarise_if(is.numeric, 'mean') %>% select(c(cluster, attack, defense, speed, hp, sp_attack, sp_defense, base_total))poclean2 %>% group_by(cluster) %>% summarise_if(is.numeric, 'mean') %>% select(c(cluster, (1:19)))from data above we can take few key points like : - Pokemons on cluster no.2 are all around stronger, biggest, and tallest
also many other points, for example : - pokemon in cluster 4 are the worst againts bug type.
With PCA, we could reduce the dimension of our data by combining all our features to create new feature, this will help us dealing with highly correlated features, to put it simply, if we have 2 or more features, if it have a strong correlation we should remove one of them, without PCA, we lost all information within that features, what PCA did is if we plot both features, PCA will determine new axis to capture the highest variance within that data, resulting new features.
PCA is sensitive with range, dont forget to scale your data
poke_pca <- PCA(poclean2 %>% select(-cluster), scale.unit = T, ncp = 31)summary(poke_pca)#>
#> Call:
#> PCA(X = poclean2 %>% select(-cluster), scale.unit = T, ncp = 31)
#>
#>
#> Eigenvalues
#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
#> Variance 5.747 3.540 2.572 2.179 1.921 1.765 1.385
#> % of var. 18.538 11.419 8.296 7.028 6.198 5.692 4.469
#> Cumulative % of var. 18.538 29.957 38.254 45.282 51.480 57.172 61.641
#> Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
#> Variance 1.324 1.249 1.103 0.998 0.793 0.756 0.666
#> % of var. 4.270 4.028 3.557 3.220 2.557 2.438 2.149
#> Cumulative % of var. 65.911 69.939 73.496 76.716 79.273 81.711 83.860
#> Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
#> Variance 0.634 0.586 0.522 0.456 0.427 0.397 0.326
#> % of var. 2.045 1.890 1.683 1.470 1.377 1.281 1.052
#> Cumulative % of var. 85.905 87.795 89.479 90.949 92.326 93.607 94.659
#> Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
#> Variance 0.321 0.305 0.257 0.224 0.189 0.145 0.088
#> % of var. 1.035 0.983 0.829 0.723 0.608 0.467 0.283
#> Cumulative % of var. 95.695 96.678 97.507 98.230 98.839 99.305 99.589
#> Dim.29 Dim.30 Dim.31
#> Variance 0.069 0.058 0.000
#> % of var. 0.223 0.188 0.000
#> Cumulative % of var. 99.812 100.000 100.000
#>
#> Individuals (the 10 first)
#> Dist Dim.1 ctr cos2 Dim.2 ctr cos2
#> 1 | 4.247 | -2.048 0.093 0.233 | 1.493 0.081 0.124 |
#> 2 | 3.850 | -0.926 0.019 0.058 | 1.735 0.109 0.203 |
#> 3 | 5.132 | 2.180 0.106 0.181 | 2.173 0.171 0.179 |
#> 4 | 4.058 | -1.563 0.054 0.148 | -1.664 0.100 0.168 |
#> 5 | 3.538 | -0.268 0.002 0.006 | -1.385 0.069 0.153 |
#> 6 | 6.684 | 2.348 0.123 0.123 | 1.093 0.043 0.027 |
#> 7 | 3.573 | -1.558 0.054 0.190 | -0.994 0.036 0.077 |
#> 8 | 3.013 | -0.316 0.002 0.011 | -0.736 0.020 0.060 |
#> 9 | 4.572 | 2.680 0.160 0.343 | -0.193 0.001 0.002 |
#> 10 | 5.183 | -4.493 0.450 0.751 | 0.813 0.024 0.025 |
#> Dim.3 ctr cos2
#> 1 -1.005 0.050 0.056 |
#> 2 -1.088 0.059 0.080 |
#> 3 -1.318 0.087 0.066 |
#> 4 -0.401 0.008 0.010 |
#> 5 -0.488 0.012 0.019 |
#> 6 -2.080 0.215 0.097 |
#> 7 0.228 0.003 0.004 |
#> 8 0.158 0.001 0.003 |
#> 9 -0.186 0.002 0.002 |
#> 10 -0.516 0.013 0.010 |
#>
#> Variables (the 10 first)
#> Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
#> against_bug | -0.021 0.008 0.000 | 0.309 2.699 0.096 | 0.068 0.179
#> against_dark | 0.123 0.262 0.015 | -0.181 0.928 0.033 | -0.727 20.578
#> against_dragon | 0.170 0.500 0.029 | 0.247 1.717 0.061 | 0.200 1.558
#> against_electric | -0.074 0.097 0.006 | -0.025 0.017 0.001 | -0.118 0.541
#> against_fairy | 0.147 0.376 0.022 | 0.382 4.129 0.146 | 0.509 10.062
#> against_fight | 0.146 0.369 0.021 | -0.329 3.049 0.108 | 0.712 19.731
#> against_fire | -0.161 0.449 0.026 | 0.390 4.300 0.152 | -0.268 2.797
#> against_flying | -0.251 1.094 0.063 | 0.702 13.934 0.493 | -0.080 0.250
#> against_ghost | 0.152 0.403 0.023 | -0.116 0.381 0.013 | -0.755 22.185
#> against_grass | 0.078 0.106 0.006 | -0.491 6.821 0.241 | 0.291 3.302
#> cos2
#> against_bug 0.005 |
#> against_dark 0.529 |
#> against_dragon 0.040 |
#> against_electric 0.014 |
#> against_fairy 0.259 |
#> against_fight 0.507 |
#> against_fire 0.072 |
#> against_flying 0.006 |
#> against_ghost 0.571 |
#> against_grass 0.085 |
poke_pca$eig#> eigenvalue
#> comp 1 5.746837110817850202693080063909292221
#> comp 2 3.539974286840406936249792124726809561
#> comp 3 2.571830487017634059299098225892521441
#> comp 4 2.178718010262271409516188214183785021
#> comp 5 1.921484903732008087118288131023291498
#> comp 6 1.764573272713494844765591551549732685
#> comp 7 1.385259592335363887372068347758613527
#> comp 8 1.323641186321642848611190856900066137
#> comp 9 1.248792161663992539288869920710567385
#> comp 10 1.102685266225354565605698553554248065
#> comp 11 0.998045461858878213412538116244832054
#> comp 12 0.792737268365448222162683578062569723
#> comp 13 0.755854373349999386633157882897648960
#> comp 14 0.666237667336235350745710093178786337
#> comp 15 0.634020318960932582896816711581777781
#> comp 16 0.585897433631494712891196741111343727
#> comp 17 0.521782770639018345093518291832879186
#> comp 18 0.455754228971840480433286302286433056
#> comp 19 0.426882419754785302767885468711028807
#> comp 20 0.397226589530229823310492065502330661
#> comp 21 0.326187914240269494214885526162106544
#> comp 22 0.320953099687115550597837909663212486
#> comp 23 0.304815649070375838114443922677310184
#> comp 24 0.257060376717304195359758978156605735
#> comp 25 0.224150007665827322167473312219954096
#> comp 26 0.188605600325726707744422583346022293
#> comp 27 0.144673538992790395862897412371239625
#> comp 28 0.087832255380978246916967577817558777
#> comp 29 0.069143282958821272732308216291130520
#> comp 30 0.058343464631910930962011008205081453
#> comp 31 0.000000000000000000000000000003768007
#> percentage of variance
#> comp 1 18.53818422844467761478881584480404854
#> comp 2 11.41927189303357081939793715719133615
#> comp 3 8.29622737747623872905933239962905645
#> comp 4 7.02812261374926272594620968447998166
#> comp 5 6.19833839913551010170067456783726811
#> comp 6 5.69217184746288662466895402758382261
#> comp 7 4.46857933011407748580268162186257541
#> comp 8 4.26981027845691230027114215772598982
#> comp 9 4.02836181181933117301241509267129004
#> comp 10 3.55704924588824056286284758243709803
#> comp 11 3.21950148986734907552431650401558727
#> comp 12 2.55721699472725250146254438732285053
#> comp 13 2.43823991403225637242258017067797482
#> comp 14 2.14915376560075932488302896672394127
#> comp 15 2.04522683535784732811180219869129360
#> comp 16 1.88999172139191840003036304551642388
#> comp 17 1.68317022786780112753035609785001725
#> comp 18 1.47017493216722749949099124933127314
#> comp 19 1.37704006372511389422186312003759667
#> comp 20 1.28137609525880602490133242099545896
#> comp 21 1.05221907819441762299561560212168843
#> comp 22 1.03533257963585656469263085455168039
#> comp 23 0.98327628732379301901289636589353904
#> comp 24 0.82922702166872319651247380534186959
#> comp 25 0.72306454085750748728145254062837921
#> comp 26 0.60840516234105390669384405555319972
#> comp 27 0.46668883546061412648242594514158554
#> comp 28 0.28332985606767174813214182904630434
#> comp 29 0.22304284825426218263899613702960778
#> comp 30 0.18820472461906753713911655268020695
#> comp 31 0.00000000000000000000000000001215486
#> cumulative percentage of variance
#> comp 1 18.53818
#> comp 2 29.95746
#> comp 3 38.25368
#> comp 4 45.28181
#> comp 5 51.48014
#> comp 6 57.17232
#> comp 7 61.64090
#> comp 8 65.91071
#> comp 9 69.93907
#> comp 10 73.49612
#> comp 11 76.71562
#> comp 12 79.27284
#> comp 13 81.71108
#> comp 14 83.86023
#> comp 15 85.90546
#> comp 16 87.79545
#> comp 17 89.47862
#> comp 18 90.94879
#> comp 19 92.32583
#> comp 20 93.60721
#> comp 21 94.65943
#> comp 22 95.69476
#> comp 23 96.67804
#> comp 24 97.50726
#> comp 25 98.23033
#> comp 26 98.83873
#> comp 27 99.30542
#> comp 28 99.58875
#> comp 29 99.81180
#> comp 30 100.00000
#> comp 31 100.00000
with this, we can choose how much information we want to retain to reduce our dimension.
keeping 13 Components will have 81.7% all of our data while removing 18 other components.
keeping 18 Components will have 91% all of our data while removing 13 other components.
these new features can be used for supervised learning.
df_pca <- data.frame(poke_pca$ind$coord[, 1:13])
df_pca$cluster <- kmeans$cluster
df_pcafviz_cluster(kmeans, poclean2 %>% select(-cluster)) Plot above didnt show clear separation, that is because its only plotted in 2 dimensions, i doubt plotting it in 3d will help because we have 31 total dimensions.
We can seperate pokemons into 4 different clusters, with more than 87% total sum of squares, meaning that the observations are close to their centroid and far from other clusters centroids. very clear distintction between clusters.
It is impossible to visually represent our cluster without the plot being overlapping because we have such high dimension, even when we reduce it to 13, we still 10 higher dimension.
while reducing the dimensions from 31 to 13 does not help with our visualization, it will help during supervised machine learning process, it might increase the accuracy, and it is definately help with the computational processes.