Load neccessary libraries

library(ggplot2)
library(ggforce)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggbiplot)
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Introduction

We’re going to do a K-Means clustering and PCA dimensionality reduction of characters from Pokemon game. The data set is taken from here: https://data.world/data-society/pokemon-with-stats.

This data set includes 898 Pokemon, 1072 including alternate forms, including their number, name, first and second type, the stat total and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed, generation, and legendary status.

pokemon <- read.csv("pokemon.csv")
head(pokemon, 3)
##   number      name type1  type2 total hp attack defense sp_attack sp_defense
## 1      1 Bulbasaur Grass Poison   318 45     49      49        65         65
## 2      2   Ivysaur Grass Poison   405 60     62      63        80         80
## 3      3  Venusaur Grass Poison   525 80     82      83       100        100
##   speed generation legendary
## 1    45          1     FALSE
## 2    60          1     FALSE
## 3    80          1     FALSE

Data set structure:

  • name: The name of each pokemon
  • type1: Each pokemon has a type, this determines weakness/resistance to attacks
  • type2: Some pokemon are dual type and have 2
  • total: Sum of all stats that come after this, a general guide to how strong a pokemon is
  • hp: Hit points, or health, defines how much damage a pokemon can withstand before fainting
  • attack: The base modifier for normal attacks (eg. Scratch, Punch)
  • defense: The base damage resistance against normal attacks
  • sp_attack: Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
  • sp_defence: Special defense, the base damage resistance against special attacks
  • speed: Determines which pokemon attacks first each round
  • generation: The generation of games where the pokemon was first introduced
  • legendary: Some pokemon are much rarer than others, and are dubbed “legendary”

Check for columns with missing data

colSums(is.na(pokemon))
##     number       name      type1      type2      total         hp     attack 
##          0          0          0          0          0          0          0 
##    defense  sp_attack sp_defense      speed generation  legendary 
##          0          0          0          0          0          0

no columns with missing data

Creating data frame with numeric data for further clustering

df <- pokemon %>% select_if(~is.numeric(.)) %>% select(-c(number))
head(df)
##   total hp attack defense sp_attack sp_defense speed generation
## 1   318 45     49      49        65         65    45          1
## 2   405 60     62      63        80         80    60          1
## 3   525 80     82      83       100        100    80          1
## 4   625 80    100     123       122        120    80          1
## 5   525 80     82      83       100        100    80          1
## 6   309 39     52      43        60         50    65          1

Part 1. Clustering

Appling Silhouette method for finding optimal number of clusters

fviz_nbclust(df, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")

Silhouette method graph shows that optimal number of clusters is 2.

Appling Elbow method for finding optimal number of clusters

fviz_nbclust(df, kmeans, method = "wss", k.max = 15) + labs(subtitle = "elbow method")

It seems that all options from 2 to 4 are possible. However, 4 cluster looks as a good compromise since it gives a good number of clusters and there is also no significant decline in total within-cluster sum of squares.

Appling Gap statistic for finding optimal number of clusters

fviz_nbclust(df, kmeans, "gap_stat", k.max = 15) + labs(subtitle = "Gap Statistic method")
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

Gap statistic shows, that the optimal number of clusters is 5.

Two methods shows that optimal number of of clusters is between 4 and 5. I would like to pick up 5 cluster centers, to maximise between_SS / total_SS ratio.

Assinging 5 clusters by K-Means
km <- kmeans(df, centers = 5, nstart = 20)
km
## K-means clustering with 5 clusters of sizes 201, 396, 297, 131, 47
## 
## Cluster means:
##      total        hp    attack   defense sp_attack sp_defense     speed
## 1 399.5025  65.46269  71.07463  68.71642  64.16418   66.28856  64.06965
## 2 495.8005  80.16919  91.77273  85.75505  81.26010   82.90404  73.93939
## 3 288.2626  47.69697  51.36700  49.31987  45.14478   46.67677  48.05724
## 4 589.0153  85.77863 110.52672  96.84733 106.85496   94.23664  94.77099
## 5 706.7447 111.78723 136.23404 111.91489 129.08511  113.46809 104.25532
##   generation
## 1   3.915423
## 2   4.575758
## 3   4.080808
## 4   4.343511
## 5   4.765957
## 
## Clustering vector:
##    [1] 3 1 2 4 2 3 1 2 4 4 2 3 1 2 4 2 3 3 1 1 3 3 1 2 3 1 2 4 3 3 1 1 3 1 3 1 3
##   [38] 3 2 2 3 3 2 2 3 1 2 3 1 2 3 2 3 3 2 2 3 1 3 2 3 1 2 3 1 3 2 3 3 1 1 3 3 3
##   [75] 3 1 1 3 2 3 2 1 4 3 1 2 3 1 2 4 3 1 2 2 3 1 2 3 2 3 3 1 1 2 2 1 1 2 2 3 3
##  [112] 2 2 4 3 2 1 1 3 2 3 2 3 3 2 2 3 2 3 1 2 4 2 1 3 2 3 2 2 3 2 3 2 2 3 1 1 2
##  [149] 2 1 3 2 2 1 2 1 1 2 4 3 1 3 2 3 2 2 2 2 2 2 2 2 4 2 3 2 4 2 2 3 3 3 2 2 2
##  [186] 1 1 2 1 2 2 4 2 2 4 4 4 4 4 4 3 1 4 5 5 5 4 3 1 2 3 1 2 3 1 2 3 1 3 1 3 1
##  [223] 3 1 2 3 2 3 3 3 3 1 3 2 3 1 2 4 2 3 1 1 2 3 3 2 1 3 1 1 3 1 2 2 1 2 2 1 3
##  [260] 1 2 3 2 1 1 2 4 3 2 1 2 4 2 2 4 1 3 2 3 1 3 2 1 1 3 2 3 2 2 3 2 4 2 3 2 2
##  [297] 2 3 3 2 3 1 1 2 2 4 4 4 3 1 4 5 5 5 4 3 1 2 4 3 1 2 4 3 1 2 4 3 1 3 3 1 1
##  [334] 3 3 1 3 1 3 3 2 3 3 2 3 1 3 1 3 3 2 4 3 1 3 2 3 1 5 3 2 3 3 1 2 3 2 3 1 3
##  [371] 1 1 2 1 2 3 1 2 4 3 1 2 3 2 4 1 1 1 1 1 3 2 3 2 4 1 2 3 2 4 2 3 2 1 3 3 2
##  [408] 3 2 3 2 4 2 2 1 1 3 2 3 2 3 2 1 2 1 2 3 2 1 1 3 2 4 3 2 2 1 2 4 3 3 2 4 3
##  [445] 1 2 1 2 2 2 3 3 1 4 5 3 1 4 5 4 4 4 4 5 4 5 5 5 5 5 5 5 4 4 4 4 4 3 1 2 3
##  [482] 1 2 3 1 2 3 3 2 3 1 3 1 3 1 2 3 2 1 2 1 2 3 1 1 1 1 3 2 1 3 2 3 2 3 2 2 1
##  [519] 2 1 2 4 2 2 3 2 3 3 2 3 2 3 3 3 1 2 3 1 4 5 1 3 2 4 3 2 3 2 3 2 2 3 2 1 3
##  [556] 2 4 2 2 2 2 2 4 4 4 2 2 2 2 2 2 2 4 2 2 2 1 2 2 2 2 2 4 4 4 5 5 4 5 5 5 4
##  [593] 2 4 4 4 4 5 4 3 1 2 3 1 2 3 1 2 3 1 3 1 2 3 1 3 2 3 2 3 2 3 2 3 1 2 3 2 3
##  [630] 1 2 3 1 3 2 1 2 3 1 2 3 1 2 2 2 3 1 2 3 1 2 3 2 3 2 2 3 1 2 3 3 2 2 2 2 2
##  [667] 3 2 1 2 2 3 3 2 1 2 1 4 3 2 2 3 2 3 2 3 1 2 3 1 2 3 2 3 1 2 3 2 1 3 2 3 2
##  [704] 3 2 2 3 2 3 2 3 1 2 3 1 2 3 2 3 1 2 3 1 2 3 2 2 3 2 2 2 1 2 2 3 2 3 2 2 1
##  [741] 2 1 2 2 2 3 1 4 1 4 4 4 4 4 4 4 4 5 5 4 4 5 5 5 4 4 4 4 4 3 1 2 3 1 2 3 1
##  [778] 2 4 3 1 3 1 2 3 3 1 1 2 3 1 4 1 2 1 2 2 1 2 2 3 2 2 2 3 2 3 2 3 2 3 2 3 2
##  [815] 3 2 3 2 1 2 1 2 2 2 1 2 3 2 4 2 3 2 3 3 3 3 2 2 2 2 3 2 3 2 5 5 2 4 5 4 5
##  [852] 4 5 4 3 1 2 3 1 2 3 1 2 3 1 2 3 1 3 1 2 3 2 2 2 2 2 3 2 3 2 2 2 3 4 3 2 1
##  [889] 2 3 2 3 2 3 1 3 2 3 2 3 3 2 2 2 2 3 2 3 2 1 2 4 1 2 2 2 1 2 2 2 2 3 1 4 4
##  [926] 4 4 4 3 1 5 5 4 4 4 4 4 4 4 4 5 5 5 4 4 1 4 4 4 4 3 4 4 3 1 2 2 3 1 2 2 3
##  [963] 1 2 2 3 2 3 1 2 2 3 3 2 2 3 2 3 2 3 2 3 2 2 3 2 3 1 2 2 3 2 2 2 2 3 2 2 2
## [1000] 3 2 3 2 2 2 3 2 2 3 2 3 2 3 1 2 2 3 1 2 2 2 1 2 2 2 2 3 2 2 2 1 3 2 2 2 2
## [1037] 2 2 1 1 3 2 2 2 2 2 2 2 2 3 1 4 5 5 5 5 5 5 1 4 4 4 4 4 4 4 4 4 4 2 5 5
## 
## Within cluster sum of squares by cluster:
## [1]  676054.4 1697939.5  950123.7  723298.7  456603.6
##  (between_SS / total_SS =  79.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The ratio between the sum of squares distance between cluster to the total sum of squares is 79.2%. This means that the biggest part of sum of squares distances comes from the distance between clusters. Therefore, our data is properly clustered. However sizes of clusters are not uniform and differ from 47 to 396 items.

Adding information about assinged clusters to inintial data set

pokemon_clustered <- pokemon %>% bind_cols(cluster = as.factor(km$cluster))
head(pokemon_clustered, 3)
##   number      name type1  type2 total hp attack defense sp_attack sp_defense
## 1      1 Bulbasaur Grass Poison   318 45     49      49        65         65
## 2      2   Ivysaur Grass Poison   405 60     62      63        80         80
## 3      3  Venusaur Grass Poison   525 80     82      83       100        100
##   speed generation legendary cluster
## 1    45          1     FALSE       3
## 2    60          1     FALSE       1
## 3    80          1     FALSE       2

Analizing clusters

pokemon_clustered %>% mutate(cluster = cluster) %>% ggplot(aes(total, generation, color = cluster)) + geom_point(alpha = 0.5) + geom_mark_hull() + scale_color_brewer(palette = "Set1") + theme_minimal() + theme(legend.position = "top")
## Warning: The concaveman package is required for geom_mark_hull

It’s clear that regardless of generation total field was defining feature of assignment to specific cluster.

Checking clusters centroids:
pokemon_clustered %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") 
## # A tibble: 5 × 10
##   cluster number total    hp attack defense sp_attack sp_defense speed
##   <fct>    <dbl> <dbl> <dbl>  <dbl>   <dbl>     <dbl>      <dbl> <dbl>
## 1 1         381.  400.  65.5   71.1    68.7      64.2       66.3  64.1
## 2 2         475.  496.  80.2   91.8    85.8      81.3       82.9  73.9
## 3 3         407.  288.  47.7   51.4    49.3      45.1       46.7  48.1
## 4 4         496.  589.  85.8  111.     96.8     107.        94.2  94.8
## 5 5         566.  707. 112.   136.    112.      129.       113.  104. 
## # … with 1 more variable: generation <dbl>

We can see that:

  • Cluster 4 has the highest mean total number of scores and all other characteristics.
  • Then go Cluster 2 and Cluster 3 with close characteristics.
  • Cluster 5 and Cluster 1 are the bottom ones, with Cluster 1 has worst mean score for each parameter.

Analyzing clusters based on the legendary status

pokemon_clustered %>% group_by(cluster, legendary) %>% summarise(n = n()) 
## `summarise()` has grouped output by 'cluster'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups:   cluster [5]
##    cluster legendary     n
##    <fct>   <lgl>     <int>
##  1 1       FALSE       199
##  2 1       TRUE          2
##  3 2       FALSE       393
##  4 2       TRUE          3
##  5 3       FALSE       295
##  6 3       TRUE          2
##  7 4       FALSE        62
##  8 4       TRUE         69
##  9 5       FALSE         5
## 10 5       TRUE         42

From this analysis we can see that:

  • Difference between (close by mean scores) Cluster 5 and Cluster 1 is that Cluster 1 has only 2 legendary pokemons out of 221, when in Cluster 5 there 42 out of 47 legendary pokemons.

Part 1 summary

  1. We define 5 clusters, which gave us more than 79% of total sum of squares come from the distance between clusters. It’s enough to say, that our data is properly clustered.
  2. Clusters sizes are not uniform
  3. Clusters are defined by scores and legendary status.

Part 2. Principal component analysis

Making PCA from the df dataset
head(df)
##   total hp attack defense sp_attack sp_defense speed generation
## 1   318 45     49      49        65         65    45          1
## 2   405 60     62      63        80         80    60          1
## 3   525 80     82      83       100        100    80          1
## 4   625 80    100     123       122        120    80          1
## 5   525 80     82      83       100        100    80          1
## 6   309 39     52      43        60         50    65          1
pokemon_pca <- prcomp(df, center = TRUE, scale. = TRUE)

summary(pokemon_pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.9290 1.0602 0.9980 0.89156 0.81936 0.65945 0.50763
## Proportion of Variance 0.4652 0.1405 0.1245 0.09936 0.08392 0.05436 0.03221
## Cumulative Proportion  0.4652 0.6056 0.7301 0.82950 0.91342 0.96778 0.99999
##                             PC8
## Standard deviation     0.008368
## Proportion of Variance 0.000010
## Cumulative Proportion  1.000000
Visualization of variance percentage explained by each dimensions
fviz_eig(pokemon_pca, ncp = 15, addlabels = T, main = "Variance explained by each dimension")

First two dimensions explain only 60% of variances, which is not enough to visualize it. We can safely reduce 8 dimensional data in half (to 4 dimensions), which gives us 80% of variances explained.

Visualization of PCA in Variable Factor Map

fviz_pca_var(pokemon_pca, select.var = list(contrib = 31), col.var = "contrib", gradient.cols = c("green", "white", "red"), repel = TRUE)

Variables located inside the circle in above graph tell us that we would need more that two components to represent our data correctly. We cannot visualize data in 2-dimensional graph.

Part 2. Summary

  1. It’s possible to reduce 8 dimensional data to 4 dimensions with enough variance explained.
  2. Dimensionality reduction is still not enough to visualize the clustering of our data in 2 or 3 dimensional graph.