library(ggplot2)
library(ggforce)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggbiplot)
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We’re going to do a K-Means clustering and PCA dimensionality reduction of characters from Pokemon game. The data set is taken from here: https://data.world/data-society/pokemon-with-stats.
This data set includes 898 Pokemon, 1072 including alternate forms, including their number, name, first and second type, the stat total and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed, generation, and legendary status.
pokemon <- read.csv("pokemon.csv")
head(pokemon, 3)
## number name type1 type2 total hp attack defense sp_attack sp_defense
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100 100
## speed generation legendary
## 1 45 1 FALSE
## 2 60 1 FALSE
## 3 80 1 FALSE
colSums(is.na(pokemon))
## number name type1 type2 total hp attack
## 0 0 0 0 0 0 0
## defense sp_attack sp_defense speed generation legendary
## 0 0 0 0 0 0
no columns with missing data
df <- pokemon %>% select_if(~is.numeric(.)) %>% select(-c(number))
head(df)
## total hp attack defense sp_attack sp_defense speed generation
## 1 318 45 49 49 65 65 45 1
## 2 405 60 62 63 80 80 60 1
## 3 525 80 82 83 100 100 80 1
## 4 625 80 100 123 122 120 80 1
## 5 525 80 82 83 100 100 80 1
## 6 309 39 52 43 60 50 65 1
fviz_nbclust(df, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")
Silhouette method graph shows that optimal number of clusters is 2.
fviz_nbclust(df, kmeans, method = "wss", k.max = 15) + labs(subtitle = "elbow method")
It seems that all options from 2 to 4 are possible. However, 4 cluster looks as a good compromise since it gives a good number of clusters and there is also no significant decline in total within-cluster sum of squares.
fviz_nbclust(df, kmeans, "gap_stat", k.max = 15) + labs(subtitle = "Gap Statistic method")
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
Gap statistic shows, that the optimal number of clusters is 5.
Two methods shows that optimal number of of clusters is between 4 and 5. I would like to pick up 5 cluster centers, to maximise between_SS / total_SS ratio.
km <- kmeans(df, centers = 5, nstart = 20)
km
## K-means clustering with 5 clusters of sizes 201, 396, 297, 131, 47
##
## Cluster means:
## total hp attack defense sp_attack sp_defense speed
## 1 399.5025 65.46269 71.07463 68.71642 64.16418 66.28856 64.06965
## 2 495.8005 80.16919 91.77273 85.75505 81.26010 82.90404 73.93939
## 3 288.2626 47.69697 51.36700 49.31987 45.14478 46.67677 48.05724
## 4 589.0153 85.77863 110.52672 96.84733 106.85496 94.23664 94.77099
## 5 706.7447 111.78723 136.23404 111.91489 129.08511 113.46809 104.25532
## generation
## 1 3.915423
## 2 4.575758
## 3 4.080808
## 4 4.343511
## 5 4.765957
##
## Clustering vector:
## [1] 3 1 2 4 2 3 1 2 4 4 2 3 1 2 4 2 3 3 1 1 3 3 1 2 3 1 2 4 3 3 1 1 3 1 3 1 3
## [38] 3 2 2 3 3 2 2 3 1 2 3 1 2 3 2 3 3 2 2 3 1 3 2 3 1 2 3 1 3 2 3 3 1 1 3 3 3
## [75] 3 1 1 3 2 3 2 1 4 3 1 2 3 1 2 4 3 1 2 2 3 1 2 3 2 3 3 1 1 2 2 1 1 2 2 3 3
## [112] 2 2 4 3 2 1 1 3 2 3 2 3 3 2 2 3 2 3 1 2 4 2 1 3 2 3 2 2 3 2 3 2 2 3 1 1 2
## [149] 2 1 3 2 2 1 2 1 1 2 4 3 1 3 2 3 2 2 2 2 2 2 2 2 4 2 3 2 4 2 2 3 3 3 2 2 2
## [186] 1 1 2 1 2 2 4 2 2 4 4 4 4 4 4 3 1 4 5 5 5 4 3 1 2 3 1 2 3 1 2 3 1 3 1 3 1
## [223] 3 1 2 3 2 3 3 3 3 1 3 2 3 1 2 4 2 3 1 1 2 3 3 2 1 3 1 1 3 1 2 2 1 2 2 1 3
## [260] 1 2 3 2 1 1 2 4 3 2 1 2 4 2 2 4 1 3 2 3 1 3 2 1 1 3 2 3 2 2 3 2 4 2 3 2 2
## [297] 2 3 3 2 3 1 1 2 2 4 4 4 3 1 4 5 5 5 4 3 1 2 4 3 1 2 4 3 1 2 4 3 1 3 3 1 1
## [334] 3 3 1 3 1 3 3 2 3 3 2 3 1 3 1 3 3 2 4 3 1 3 2 3 1 5 3 2 3 3 1 2 3 2 3 1 3
## [371] 1 1 2 1 2 3 1 2 4 3 1 2 3 2 4 1 1 1 1 1 3 2 3 2 4 1 2 3 2 4 2 3 2 1 3 3 2
## [408] 3 2 3 2 4 2 2 1 1 3 2 3 2 3 2 1 2 1 2 3 2 1 1 3 2 4 3 2 2 1 2 4 3 3 2 4 3
## [445] 1 2 1 2 2 2 3 3 1 4 5 3 1 4 5 4 4 4 4 5 4 5 5 5 5 5 5 5 4 4 4 4 4 3 1 2 3
## [482] 1 2 3 1 2 3 3 2 3 1 3 1 3 1 2 3 2 1 2 1 2 3 1 1 1 1 3 2 1 3 2 3 2 3 2 2 1
## [519] 2 1 2 4 2 2 3 2 3 3 2 3 2 3 3 3 1 2 3 1 4 5 1 3 2 4 3 2 3 2 3 2 2 3 2 1 3
## [556] 2 4 2 2 2 2 2 4 4 4 2 2 2 2 2 2 2 4 2 2 2 1 2 2 2 2 2 4 4 4 5 5 4 5 5 5 4
## [593] 2 4 4 4 4 5 4 3 1 2 3 1 2 3 1 2 3 1 3 1 2 3 1 3 2 3 2 3 2 3 2 3 1 2 3 2 3
## [630] 1 2 3 1 3 2 1 2 3 1 2 3 1 2 2 2 3 1 2 3 1 2 3 2 3 2 2 3 1 2 3 3 2 2 2 2 2
## [667] 3 2 1 2 2 3 3 2 1 2 1 4 3 2 2 3 2 3 2 3 1 2 3 1 2 3 2 3 1 2 3 2 1 3 2 3 2
## [704] 3 2 2 3 2 3 2 3 1 2 3 1 2 3 2 3 1 2 3 1 2 3 2 2 3 2 2 2 1 2 2 3 2 3 2 2 1
## [741] 2 1 2 2 2 3 1 4 1 4 4 4 4 4 4 4 4 5 5 4 4 5 5 5 4 4 4 4 4 3 1 2 3 1 2 3 1
## [778] 2 4 3 1 3 1 2 3 3 1 1 2 3 1 4 1 2 1 2 2 1 2 2 3 2 2 2 3 2 3 2 3 2 3 2 3 2
## [815] 3 2 3 2 1 2 1 2 2 2 1 2 3 2 4 2 3 2 3 3 3 3 2 2 2 2 3 2 3 2 5 5 2 4 5 4 5
## [852] 4 5 4 3 1 2 3 1 2 3 1 2 3 1 2 3 1 3 1 2 3 2 2 2 2 2 3 2 3 2 2 2 3 4 3 2 1
## [889] 2 3 2 3 2 3 1 3 2 3 2 3 3 2 2 2 2 3 2 3 2 1 2 4 1 2 2 2 1 2 2 2 2 3 1 4 4
## [926] 4 4 4 3 1 5 5 4 4 4 4 4 4 4 4 5 5 5 4 4 1 4 4 4 4 3 4 4 3 1 2 2 3 1 2 2 3
## [963] 1 2 2 3 2 3 1 2 2 3 3 2 2 3 2 3 2 3 2 3 2 2 3 2 3 1 2 2 3 2 2 2 2 3 2 2 2
## [1000] 3 2 3 2 2 2 3 2 2 3 2 3 2 3 1 2 2 3 1 2 2 2 1 2 2 2 2 3 2 2 2 1 3 2 2 2 2
## [1037] 2 2 1 1 3 2 2 2 2 2 2 2 2 3 1 4 5 5 5 5 5 5 1 4 4 4 4 4 4 4 4 4 4 2 5 5
##
## Within cluster sum of squares by cluster:
## [1] 676054.4 1697939.5 950123.7 723298.7 456603.6
## (between_SS / total_SS = 79.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The ratio between the sum of squares distance between cluster to the total sum of squares is 79.2%. This means that the biggest part of sum of squares distances comes from the distance between clusters. Therefore, our data is properly clustered. However sizes of clusters are not uniform and differ from 47 to 396 items.
pokemon_clustered <- pokemon %>% bind_cols(cluster = as.factor(km$cluster))
head(pokemon_clustered, 3)
## number name type1 type2 total hp attack defense sp_attack sp_defense
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100 100
## speed generation legendary cluster
## 1 45 1 FALSE 3
## 2 60 1 FALSE 1
## 3 80 1 FALSE 2
pokemon_clustered %>% mutate(cluster = cluster) %>% ggplot(aes(total, generation, color = cluster)) + geom_point(alpha = 0.5) + geom_mark_hull() + scale_color_brewer(palette = "Set1") + theme_minimal() + theme(legend.position = "top")
## Warning: The concaveman package is required for geom_mark_hull
It’s clear that regardless of generation total field was defining feature of assignment to specific cluster.
pokemon_clustered %>% group_by(cluster) %>% summarise_if(is.numeric, "mean")
## # A tibble: 5 × 10
## cluster number total hp attack defense sp_attack sp_defense speed
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 381. 400. 65.5 71.1 68.7 64.2 66.3 64.1
## 2 2 475. 496. 80.2 91.8 85.8 81.3 82.9 73.9
## 3 3 407. 288. 47.7 51.4 49.3 45.1 46.7 48.1
## 4 4 496. 589. 85.8 111. 96.8 107. 94.2 94.8
## 5 5 566. 707. 112. 136. 112. 129. 113. 104.
## # … with 1 more variable: generation <dbl>
We can see that:
pokemon_clustered %>% group_by(cluster, legendary) %>% summarise(n = n())
## `summarise()` has grouped output by 'cluster'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups: cluster [5]
## cluster legendary n
## <fct> <lgl> <int>
## 1 1 FALSE 199
## 2 1 TRUE 2
## 3 2 FALSE 393
## 4 2 TRUE 3
## 5 3 FALSE 295
## 6 3 TRUE 2
## 7 4 FALSE 62
## 8 4 TRUE 69
## 9 5 FALSE 5
## 10 5 TRUE 42
From this analysis we can see that:
head(df)
## total hp attack defense sp_attack sp_defense speed generation
## 1 318 45 49 49 65 65 45 1
## 2 405 60 62 63 80 80 60 1
## 3 525 80 82 83 100 100 80 1
## 4 625 80 100 123 122 120 80 1
## 5 525 80 82 83 100 100 80 1
## 6 309 39 52 43 60 50 65 1
pokemon_pca <- prcomp(df, center = TRUE, scale. = TRUE)
summary(pokemon_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9290 1.0602 0.9980 0.89156 0.81936 0.65945 0.50763
## Proportion of Variance 0.4652 0.1405 0.1245 0.09936 0.08392 0.05436 0.03221
## Cumulative Proportion 0.4652 0.6056 0.7301 0.82950 0.91342 0.96778 0.99999
## PC8
## Standard deviation 0.008368
## Proportion of Variance 0.000010
## Cumulative Proportion 1.000000
fviz_eig(pokemon_pca, ncp = 15, addlabels = T, main = "Variance explained by each dimension")
First two dimensions explain only 60% of variances, which is not enough to visualize it. We can safely reduce 8 dimensional data in half (to 4 dimensions), which gives us 80% of variances explained.
fviz_pca_var(pokemon_pca, select.var = list(contrib = 31), col.var = "contrib", gradient.cols = c("green", "white", "red"), repel = TRUE)
Variables located inside the circle in above graph tell us that we would need more that two components to represent our data correctly. We cannot visualize data in 2-dimensional graph.