IS7036 Unsupervised Learning | Pokemon Clustering

1. Introduction

In this project we are exploring an interesting dataset which is the famous Pokemon dataset. The original Pokémon is a role-playing game based around building a small team of monsters to battle other monsters in a quest to become the best. Anyone who has played Pokemon has noticed that there are certain tropes that Pokemon fall into: Pikachu, Plusle, Pachirisu, and Dedenne are all cute Electric-type Pokemon; there are a variety of “big monster” Pokemon like Rhydon, Nidoking, Tyranitar, and Aggron; and many other classes one might notice. The algorithmic process of finding unknown groupings or clusters in data is called clustering. Clustering is generally unsupervised, that is that we do not know what the clusters should be in advance, although we might have opinions on what they should look like.Through this project we are trying to identify the groupings based on similarities in their statistics and create different clusters from the large data. This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed.The important variables of the dataset are as follows:

ID : ID for each pokemon
Name : Name of each pokemon
Type 1 : Each pokemon has a type, this determines weakness/resistance to attacks
Type 2* : Some pokemon are dual type and have 2
Total : sum of all stats that come after this, a general guide to how strong a pokemon is
HP : hit points, or health, defines how much damage a pokemon can withstand before fainting
Attack : the base modifier for normal attacks (eg. Scratch, Punch)
Defense : the base damage resistance against normal attacks
SP Atk : special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
SP Def : the base damage resistance against special attacks
Speed : determines which pokemon attacks first each round

Through this project we aim to :

Analyse the different statistics of the pokemons
Group the Pokemons based on important statistics using k-means clustering
Analyse the clusters through visualizations
Use decision tree to verify if our pokemon falls into the right cluster.
Identify Strongest and weakest pokemon

Packages Required

library(ggplot2)
library(gplots)
library(ROCR)
library(dplyr)
library(tidyverse)
library(highcharter)
library(reshape2)
library(factoextra)
library(scales)
library(rpart)
library(rpart.plot)

Importing the Dataset

pokemon_data <- read.csv("/Users/sindhuherle/Documents/Data mining/Pokemon.csv")

2. Exploratory Data Analysis

Before analyzing, let us examine the dataset using head() and str() functions

head(pokemon_data,10)

##    X.                      Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1   1                 Bulbasaur  Grass Poison   318 45     49      49      65
## 2   2                   Ivysaur  Grass Poison   405 60     62      63      80
## 3   3                  Venusaur  Grass Poison   525 80     82      83     100
## 4   3     VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 5   4                Charmander   Fire          309 39     52      43      60
## 6   5                Charmeleon   Fire          405 58     64      58      80
## 7   6                 Charizard   Fire Flying   534 78     84      78     109
## 8   6 CharizardMega Charizard X   Fire Dragon   634 78    130     111     130
## 9   6 CharizardMega Charizard Y   Fire Flying   634 78    104      78     159
## 10  7                  Squirtle  Water          314 44     48      65      50
##    Sp..Def Speed Generation Legendary
## 1       65    45          1     False
## 2       80    60          1     False
## 3      100    80          1     False
## 4      120    80          1     False
## 5       50    65          1     False
## 6       65    80          1     False
## 7       85   100          1     False
## 8       85   100          1     False
## 9      115   100          1     False
## 10      64    43          1     False

str(pokemon_data)

## 'data.frame':    800 obs. of  13 variables:
##  $ X.        : int  1 2 3 3 4 5 6 6 6 7 ...
##  $ Name      : chr  "Bulbasaur" "Ivysaur" "Venusaur" "VenusaurMega Venusaur" ...
##  $ Type.1    : chr  "Grass" "Grass" "Grass" "Grass" ...
##  $ Type.2    : chr  "Poison" "Poison" "Poison" "Poison" ...
##  $ Total     : int  318 405 525 625 309 405 534 634 634 314 ...
##  $ HP        : int  45 60 80 80 39 58 78 78 78 44 ...
##  $ Attack    : int  49 62 82 100 52 64 84 130 104 48 ...
##  $ Defense   : int  49 63 83 123 43 58 78 111 78 65 ...
##  $ Sp..Atk   : int  65 80 100 122 60 80 109 130 159 50 ...
##  $ Sp..Def   : int  65 80 100 120 50 65 85 85 115 64 ...
##  $ Speed     : int  45 60 80 80 65 80 100 100 100 43 ...
##  $ Generation: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Legendary : chr  "False" "False" "False" "False" ...

Let us also check if there are any missing values in the dataset

colSums(is.na(pokemon_data))

##         X.       Name     Type.1     Type.2      Total         HP     Attack 
##          0          0          0          0          0          0          0 
##    Defense    Sp..Atk    Sp..Def      Speed Generation  Legendary 
##          0          0          0          0          0          0

The dataset looks clean with no missing values. Basic insights of the data can be obtained by exploring the data through visualizations.

2.1 Data distributions

Pokemon distribution of type1 and type 2

pokemon.plot2 <- ggplot(pokemon_data, aes(Type.2)) + 
    geom_bar(aes(fill = as.factor(Type.2))) +
    scale_fill_discrete(name = "Type 2") +
    labs(x="Type 1", y="Count", Title = "Distr. of Type 1 and Type 2") +
    facet_wrap(~Type.1) +
    theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
pokemon.plot2

2.2 Boxplots

Box plots show the five-number summary of a set of data: including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score. Before we move on to clustering, lets visualize the pokemon skills like HP, Defense, Attack, Sp Attack, Sp Def, Speed through boxplots.

boxplot(pokemon_data[6:11])

From the boxplot, we can see that all the variables have outliers HP and defense variables have the highest among them.

2.3 Correlation heatmap

A correlation heatmap is a heatmap that shows a 2D correlation matrix between two discrete dimensions, using colored cells to represent data. The value of correlation can take any value from -1 to 1. It helps us to find the relationship between different variables

## select only numeric columns
pokemon_numeric <- select_if(pokemon_data, is.numeric)   


## coorelation matrix
cormat <- round(cor(pokemon_numeric),2)
head(cormat)

##           X. Total   HP Attack Defense Sp..Atk Sp..Def Speed Generation
## X.      1.00  0.12 0.10   0.10    0.09    0.09    0.09  0.01       0.98
## Total   0.12  1.00 0.62   0.74    0.61    0.75    0.72  0.58       0.05
## HP      0.10  0.62 1.00   0.42    0.24    0.36    0.38  0.18       0.06
## Attack  0.10  0.74 0.42   1.00    0.44    0.40    0.26  0.38       0.05
## Defense 0.09  0.61 0.24   0.44    1.00    0.22    0.51  0.02       0.04
## Sp..Atk 0.09  0.75 0.36   0.40    0.22    1.00    0.51  0.47       0.04

#basic coorelation plot
melted_cormat <- melt(cormat)
head(melted_cormat)

##      Var1 Var2 value
## 1      X.   X.  1.00
## 2   Total   X.  0.12
## 3      HP   X.  0.10
## 4  Attack   X.  0.10
## 5 Defense   X.  0.09
## 6 Sp..Atk   X.  0.09

ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill = value)) +
  geom_tile()

### to plot the heatmpa
# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

upper_tri <- get_upper_tri(cormat)
upper_tri

##            X. Total   HP Attack Defense Sp..Atk Sp..Def Speed Generation
## X.          1  0.12 0.10   0.10    0.09    0.09    0.09  0.01       0.98
## Total      NA  1.00 0.62   0.74    0.61    0.75    0.72  0.58       0.05
## HP         NA    NA 1.00   0.42    0.24    0.36    0.38  0.18       0.06
## Attack     NA    NA   NA   1.00    0.44    0.40    0.26  0.38       0.05
## Defense    NA    NA   NA     NA    1.00    0.22    0.51  0.02       0.04
## Sp..Atk    NA    NA   NA     NA      NA    1.00    0.51  0.47       0.04
## Sp..Def    NA    NA   NA     NA      NA      NA    1.00  0.26       0.03
## Speed      NA    NA   NA     NA      NA      NA      NA  1.00      -0.02
## Generation NA    NA   NA     NA      NA      NA      NA    NA       1.00

# Melt the correlation matrix

melted_cormat <- melt(upper_tri, na.rm = TRUE)

Correlation Insights

Total rank is based on all other attributes so it has high correlation with others
Speed has high correlation with Attack and Special Attack, makes sense special attacks are faster
Defense has less correlation with Speed
Attack has positive correlation with all variables.

3. Clustering

For clustering we are choosing the following pokemon statistics

HP
Attack
Defense
Special Attack
Sp Defense
Speed

Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use elbow method or silhouette method.

pokemon <- pokemon_data %>% select(6:11)

fviz_nbclust(pokemon, kmeans, method = "wss", k.max = 15) + scale_y_continuous(labels = number_format(scale = 10^(-9), 
    big.mark = ",", suffix = " bil.")) + labs(subtitle = "Elbow method")

fviz_nbclust(pokemon, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")

We can see from both the methods optimal value of cluster is somewhere around 2-4. Considering the multiple factors lets take the k value as 4 which implies grouping the pokemons into 4 clusters based on the statistics.

# Select number of clusters
k <- 4

# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)

# View the resulting model
km.pokemon

## K-means clustering with 4 clusters of sizes 288, 283, 114, 115
## 
## Cluster means:
##         HP    Attack   Defense   Sp..Atk  Sp..Def     Speed
## 1 79.18056  81.31944  69.19097  82.01042 77.53125  80.10417
## 2 50.29682  54.03180  51.62898  47.90459 49.15548  49.74912
## 3 89.20175 121.09649  92.73684 120.45614 97.67544 100.44737
## 4 71.30435  92.91304 121.42609  63.89565 88.23478  52.36522
## 
## Clustering vector:
##   [1] 2 1 1 3 2 1 1 3 3 2 1 1 3 2 2 1 2 2 1 1 2 2 1 3 2 1 2 1 2 1 2 1 2 4 2 2 1
##  [38] 2 2 1 2 1 2 1 2 1 2 1 2 1 1 2 4 2 1 2 1 2 1 2 1 2 1 2 3 2 2 1 2 1 1 3 2 1
##  [75] 4 2 1 1 2 1 2 4 4 1 1 2 4 4 2 1 2 2 1 2 1 2 1 2 4 2 1 1 3 4 2 1 2 4 2 1 2
## [112] 1 2 4 1 4 2 2 4 2 4 1 1 1 3 2 1 2 1 2 1 1 1 1 1 1 4 3 1 2 1 3 1 2 2 1 1 3
## [149] 1 2 4 2 4 1 3 1 3 3 3 2 1 3 3 3 3 3 2 1 1 2 1 1 2 1 1 2 1 2 1 2 1 2 2 1 2
## [186] 1 2 2 2 2 1 2 1 2 2 1 3 4 2 1 4 1 2 2 1 2 2 1 1 2 4 1 4 1 1 1 2 1 1 2 4 1
## [223] 4 4 4 2 1 1 4 4 4 1 4 1 2 1 2 4 2 1 2 2 1 2 1 4 2 1 3 1 2 4 1 1 2 2 4 2 2
## [260] 2 1 1 3 3 4 2 1 3 3 3 3 3 2 1 1 3 2 1 3 3 2 1 1 3 2 1 2 1 2 2 1 2 2 2 2 1
## [297] 2 2 1 2 1 2 1 2 2 1 3 2 1 2 1 2 1 3 2 1 2 2 2 1 2 1 2 4 2 2 2 4 2 4 2 4 4
## [334] 4 2 1 1 2 1 3 1 1 1 1 1 2 1 2 1 3 1 1 2 1 3 4 2 1 2 2 2 1 2 1 2 1 3 1 1 1
## [371] 1 2 1 2 1 2 4 2 4 2 4 2 1 1 4 2 1 3 2 4 1 1 1 3 2 2 1 3 2 1 1 2 4 4 4 2 2
## [408] 4 3 3 2 4 4 3 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 2 4 4 2 1 1 2 1 1 2 2 1
## [445] 2 1 2 2 2 2 1 2 1 2 1 4 4 2 4 4 4 1 2 4 1 2 1 2 1 2 1 1 2 1 2 1 3 1 1 2 1
## [482] 2 2 1 2 4 2 2 2 1 4 2 1 3 3 1 2 3 3 2 4 2 4 2 1 1 2 1 2 2 1 3 1 4 4 4 4 3
## [519] 3 1 1 4 1 4 1 1 1 3 4 4 1 1 1 1 1 1 1 4 3 3 3 3 3 3 3 3 4 1 3 3 3 3 3 3 2
## [556] 1 1 2 1 1 2 1 1 2 1 2 2 4 2 1 2 1 2 1 2 1 2 1 2 2 1 2 1 2 4 4 2 1 2 1 1 4
## [593] 2 4 4 2 1 1 4 1 2 4 1 2 2 1 2 1 2 1 1 2 2 1 2 1 1 1 2 4 2 4 1 2 4 4 4 1 3
## [630] 2 1 2 1 2 1 2 1 1 2 2 1 2 1 2 1 1 2 1 1 2 4 2 1 2 1 1 2 1 2 4 2 4 4 2 1 1
## [667] 2 1 2 2 1 2 1 3 2 1 1 2 1 1 2 1 4 2 4 2 4 4 2 1 2 4 1 4 2 1 3 2 1 3 3 3 3
## [704] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 4 4 2 1 1 2 1 1 2 1 2 2 1 2 2 1 2 1 2 2 1
## [741] 2 1 2 1 1 2 1 1 2 4 3 4 2 1 2 1 2 1 2 4 2 4 2 1 2 1 2 4 2 1 1 1 1 4 2 1 3
## [778] 1 2 1 2 2 2 2 4 4 4 4 2 4 2 1 3 3 3 4 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 871531.3 513476.6 408271.9 455965.7
##  (between_SS / total_SS =  47.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

From the cluster vector, we can see how the different rows have been categorized into the clusters. However, for ease of further analysis, we are now converting this cluster vector into a dataframe.

km.pokemon.table <- data.frame(km.pokemon$size, km.pokemon$centers)
km.pokemon.df <- data.frame(Cluster = km.pokemon$cluster, pokemon)
# head of df
head(km.pokemon.df)

##   Cluster HP Attack Defense Sp..Atk Sp..Def Speed
## 1       2 45     49      49      65      65    45
## 2       1 60     62      63      80      80    60
## 3       1 80     82      83     100     100    80
## 4       3 80    100     123     122     120    80
## 5       2 39     52      43      60      50    65
## 6       1 58     64      58      80      65    80

Before we move on to do a visual analysis of the clusters, we are going to check the quality of our partitions.The quality of a k-means partition is found by calculating the percentage of the TSS “explained” by the partition using the following formula:

(BSS/TSS) × 100 %

where BSS and TSS stand for Between Sum of Squares and Total Sum of Squares, respectively. The higher the percentage, the better the score (and thus the quality) because it means that BSS is large and/or WSS is small.

#Quality of partition
(BSS <- km.pokemon$betweenss)

## [1] 2039482

(TSS <- km.pokemon$totss)

## [1] 4288727

BSS / TSS * 100

## [1] 47.55448

With our current value of K=4, we have a good quality of partition at 47.5%. On decreasing the value of K, we observed that this percentage went lower. When the k value was increased to 5, we did get a higher percentage of quality partition (@ 51.2%), but we also noticed higher level of overlaps between the clusters, hence we decided to keep k at a value of 4.

We are now proceeding to do a visualization of the clusters to better understand how the pokemons have been grouped.

library(cluster)
library(fpc)
fviz_cluster(km.pokemon, data = pokemon, geom = "point")

If we plot the number of pokemons in each cluster, we can see that more number of pokemons have been categorized in clusters 1 and 2, with cluster 1 having the highest number of pokemons.

#Count of clusters
ggplot(data = km.pokemon.df, aes(y = Cluster)) +
  geom_bar(fill = "lightblue") +
  ggtitle("Count of pokemons by cluster") +
  theme(plot.title = element_text(hjust = 0.5))

#Cluster comparison with the different variables
with(pokemon_data, pairs(pokemon, col=c(1:3)[km.pokemon$cluster]))

We are going to focus mainly on 2 variables - Attack and Speed for a comprehensive analysis of our clusters.

ggplot(km.pokemon.df) + 
  geom_point(aes(x = Cluster, y = Attack, color = Cluster ))

From this, we can see that the pokemons with the highest Attack values are categorized in cluster 3. Cluster 2 has the ones with the lowest Attack values.Similarly when we do a visualization for speed, we can observe that the cluster 3 has the pokemons with the highest Speed value while clusters 2 and 4 appear to have significant overlap for this variable.

ggplot(km.pokemon.df) + 
  geom_point(aes(x = Cluster, y = Speed, color = Cluster ))

ggplot(km.pokemon.df) +
geom_point(aes(x = Cluster, y = HP, color = Cluster ))

In examining the different clusters, we would like to conclude that cluster 3 has the pokemons with the highest values for all attributes making this the strongest cluster of pokemons, while cluster 2 will be the weakest pokemons.

4. Decision Tree

We have used decision tree to predict which cluster a Pokémon belongs to, based on its skills.

Split the dataset into training and testing data (80:20)
Pick the cluster column for decision making
Build the model using the training dataset
Plot the decision tree using “rpart.plot”
Verify by using the testing dataif the Pokémon falls into the right cluster by traversing the decision tree

trainingIndex <- sample(nrow(km.pokemon.df ),nrow(km.pokemon.df )*0.80)
Train <- km.pokemon.df [trainingIndex,]
Test <- km.pokemon.df [-trainingIndex,]

# train the model using training dataset
dectree <- rpart(Cluster~., data = Train, method = 'class')
prp(dectree, type=3, main= "Probabilities per class")

We compare this decison tree using test data. That is we randomly pick up a pokemon from the below test data

head(Test,5)

##    Cluster HP Attack Defense Sp..Atk Sp..Def Speed
## 1        2 45     49      49      65      65    45
## 3        1 80     82      83     100     100    80
## 7        1 78     84      78     109      85   100
## 12       1 79     83     100      85     105    78
## 14       2 45     30      35      20      20    45

From the above data lets pick pokemon 1 and traverse the tree to see if it falls into right cluster.

Starting with first node - Sp.Def >=58, traverse through the right branch
second node - Defense < 90 , move to the left path
third node - Attack < 56, move to the left path
fourth node - Speed < 76 move to the left path which is the final node in this path.

From the decision tree result we see that Pokemon 1 falls into 2nd cluster which is same as what we see from our test data above

5. Conclusion

From our analyis of the dataset through EDA, Clustering and decision tree, we come to a conclusion on strongest and weakest pokemon.

Strongest Pokemon

Weakest Pokemon