1. Introduction

In this project we are exploring an interesting dataset which is the famous Pokemon dataset. The original Pokémon is a role-playing game based around building a small team of monsters to battle other monsters in a quest to become the best. Anyone who has played Pokemon has noticed that there are certain tropes that Pokemon fall into: Pikachu, Plusle, Pachirisu, and Dedenne are all cute Electric-type Pokemon; there are a variety of “big monster” Pokemon like Rhydon, Nidoking, Tyranitar, and Aggron; and many other classes one might notice. The algorithmic process of finding unknown groupings or clusters in data is called clustering. Clustering is generally unsupervised, that is that we do not know what the clusters should be in advance, although we might have opinions on what they should look like.Through this project we are trying to identify the groupings based on similarities in their statistics and create different clusters from the large data. This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed.The important variables of the dataset are as follows:

  • ID : ID for each pokemon
  • Name : Name of each pokemon
  • Type 1 : Each pokemon has a type, this determines weakness/resistance to attacks
  • Type 2* : Some pokemon are dual type and have 2
  • Total : sum of all stats that come after this, a general guide to how strong a pokemon is
  • HP : hit points, or health, defines how much damage a pokemon can withstand before fainting
  • Attack : the base modifier for normal attacks (eg. Scratch, Punch)
  • Defense : the base damage resistance against normal attacks
  • SP Atk : special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
  • SP Def : the base damage resistance against special attacks
  • Speed : determines which pokemon attacks first each round

Through this project we aim to :

  • Analyse the different statistics of the pokemons
  • Group the Pokemons based on important statistics using k-means clustering
  • Analyse the clusters through visualizations
  • Use decision tree to verify if our pokemon falls into the right cluster.

Packages Required

library(ggplot2)
library(gplots)
library(ROCR)
library(dplyr)
library(tidyverse)
library(highcharter)
library(reshape2)
library(factoextra)
library(scales)

Importing the Dataset

pokemon_data <- read.csv("/Users/sindhuherle/Documents/Data mining/Pokemon.csv")

2. Exploratory Data Analysis

Before analyzing, let us examine the dataset using head() and str() functions

head(pokemon_data,10)
##    X.                      Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1   1                 Bulbasaur  Grass Poison   318 45     49      49      65
## 2   2                   Ivysaur  Grass Poison   405 60     62      63      80
## 3   3                  Venusaur  Grass Poison   525 80     82      83     100
## 4   3     VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 5   4                Charmander   Fire          309 39     52      43      60
## 6   5                Charmeleon   Fire          405 58     64      58      80
## 7   6                 Charizard   Fire Flying   534 78     84      78     109
## 8   6 CharizardMega Charizard X   Fire Dragon   634 78    130     111     130
## 9   6 CharizardMega Charizard Y   Fire Flying   634 78    104      78     159
## 10  7                  Squirtle  Water          314 44     48      65      50
##    Sp..Def Speed Generation Legendary
## 1       65    45          1     False
## 2       80    60          1     False
## 3      100    80          1     False
## 4      120    80          1     False
## 5       50    65          1     False
## 6       65    80          1     False
## 7       85   100          1     False
## 8       85   100          1     False
## 9      115   100          1     False
## 10      64    43          1     False
str(pokemon_data)
## 'data.frame':    800 obs. of  13 variables:
##  $ X.        : int  1 2 3 3 4 5 6 6 6 7 ...
##  $ Name      : chr  "Bulbasaur" "Ivysaur" "Venusaur" "VenusaurMega Venusaur" ...
##  $ Type.1    : chr  "Grass" "Grass" "Grass" "Grass" ...
##  $ Type.2    : chr  "Poison" "Poison" "Poison" "Poison" ...
##  $ Total     : int  318 405 525 625 309 405 534 634 634 314 ...
##  $ HP        : int  45 60 80 80 39 58 78 78 78 44 ...
##  $ Attack    : int  49 62 82 100 52 64 84 130 104 48 ...
##  $ Defense   : int  49 63 83 123 43 58 78 111 78 65 ...
##  $ Sp..Atk   : int  65 80 100 122 60 80 109 130 159 50 ...
##  $ Sp..Def   : int  65 80 100 120 50 65 85 85 115 64 ...
##  $ Speed     : int  45 60 80 80 65 80 100 100 100 43 ...
##  $ Generation: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Legendary : chr  "False" "False" "False" "False" ...

Let us also check if there are any missing values in the dataset

colSums(is.na(pokemon_data))
##         X.       Name     Type.1     Type.2      Total         HP     Attack 
##          0          0          0          0          0          0          0 
##    Defense    Sp..Atk    Sp..Def      Speed Generation  Legendary 
##          0          0          0          0          0          0

The dataset looks clean with no missing values. Basic insights of the data can be obtained by exploring the data through visualizations.

2.1 Data distributions

Pokemon distribution of type1 and type 2

pokemon.plot2 <- ggplot(pokemon_data, aes(Type.2)) + 
    geom_bar(aes(fill = as.factor(Type.2))) +
    scale_fill_discrete(name = "Type 2") +
    labs(x="Type 1", y="Count", Title = "Distr. of Type 1 and Type 2") +
    facet_wrap(~Type.1) +
    theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
pokemon.plot2

2.2 Boxplots

Box plots show the five-number summary of a set of data: including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score. Before we move on to clustering, lets visualize the pokemon skills like HP, Defense, Attack, Sp Attack, Sp Def, Speed through boxplots.

boxplot(pokemon_data[6:11])

From the boxplot, we can see that all the variables have outliers HP and defense variables have the highest among them.

2.3 Correlation heatmap

A correlation heatmap is a heatmap that shows a 2D correlation matrix between two discrete dimensions, using colored cells to represent data. The value of correlation can take any value from -1 to 1. It helps us to find the relationship between different variables

## select only numeric columns
pokemon_numeric <- select_if(pokemon_data, is.numeric)   


## coorelation matrix
cormat <- round(cor(pokemon_numeric),2)
head(cormat)
##           X. Total   HP Attack Defense Sp..Atk Sp..Def Speed Generation
## X.      1.00  0.12 0.10   0.10    0.09    0.09    0.09  0.01       0.98
## Total   0.12  1.00 0.62   0.74    0.61    0.75    0.72  0.58       0.05
## HP      0.10  0.62 1.00   0.42    0.24    0.36    0.38  0.18       0.06
## Attack  0.10  0.74 0.42   1.00    0.44    0.40    0.26  0.38       0.05
## Defense 0.09  0.61 0.24   0.44    1.00    0.22    0.51  0.02       0.04
## Sp..Atk 0.09  0.75 0.36   0.40    0.22    1.00    0.51  0.47       0.04
#basic coorelation plot
melted_cormat <- melt(cormat)
head(melted_cormat)
##      Var1 Var2 value
## 1      X.   X.  1.00
## 2   Total   X.  0.12
## 3      HP   X.  0.10
## 4  Attack   X.  0.10
## 5 Defense   X.  0.09
## 6 Sp..Atk   X.  0.09
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill = value)) +
  geom_tile()

### to plot the heatmpa
# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

upper_tri <- get_upper_tri(cormat)
upper_tri
##            X. Total   HP Attack Defense Sp..Atk Sp..Def Speed Generation
## X.          1  0.12 0.10   0.10    0.09    0.09    0.09  0.01       0.98
## Total      NA  1.00 0.62   0.74    0.61    0.75    0.72  0.58       0.05
## HP         NA    NA 1.00   0.42    0.24    0.36    0.38  0.18       0.06
## Attack     NA    NA   NA   1.00    0.44    0.40    0.26  0.38       0.05
## Defense    NA    NA   NA     NA    1.00    0.22    0.51  0.02       0.04
## Sp..Atk    NA    NA   NA     NA      NA    1.00    0.51  0.47       0.04
## Sp..Def    NA    NA   NA     NA      NA      NA    1.00  0.26       0.03
## Speed      NA    NA   NA     NA      NA      NA      NA  1.00      -0.02
## Generation NA    NA   NA     NA      NA      NA      NA    NA       1.00
# Melt the correlation matrix

melted_cormat <- melt(upper_tri, na.rm = TRUE)

Correlation Insights

  • Total rank is based on all other attributes so it has high correlation with others
  • Speed has high correlation with Attack and Special Attack, makes sense special attacks are faster
  • Defense has less correlation with Speed
  • Attack has positive correlation with all variables.

3. Clustering

For clustering we are choosing following pokemon statistics

  • HP
  • Attack
  • Defense
  • Special Attack
  • Sp Defense
  • Speed

Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use elbow method or silhouette method.

pokemon <- pokemon_data %>% select(6:11)

fviz_nbclust(pokemon, kmeans, method = "wss", k.max = 15) + scale_y_continuous(labels = number_format(scale = 10^(-9), 
    big.mark = ",", suffix = " bil.")) + labs(subtitle = "Elbow method")

fviz_nbclust(pokemon, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")

We can see from both the methods optimal value of cluster is somewhere around 2-4. Considering the multiple factors lets take the k value as 4 which implies grouping the pokemons into 4 clusters based on the statistics.

# Select number of clusters
k <- 4

# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)

# View the resulting model
km.pokemon
## K-means clustering with 4 clusters of sizes 288, 283, 114, 115
## 
## Cluster means:
##         HP    Attack   Defense   Sp..Atk  Sp..Def     Speed
## 1 79.18056  81.31944  69.19097  82.01042 77.53125  80.10417
## 2 50.29682  54.03180  51.62898  47.90459 49.15548  49.74912
## 3 89.20175 121.09649  92.73684 120.45614 97.67544 100.44737
## 4 71.30435  92.91304 121.42609  63.89565 88.23478  52.36522
## 
## Clustering vector:
##   [1] 2 1 1 3 2 1 1 3 3 2 1 1 3 2 2 1 2 2 1 1 2 2 1 3 2 1 2 1 2 1 2 1 2 4 2 2 1
##  [38] 2 2 1 2 1 2 1 2 1 2 1 2 1 1 2 4 2 1 2 1 2 1 2 1 2 1 2 3 2 2 1 2 1 1 3 2 1
##  [75] 4 2 1 1 2 1 2 4 4 1 1 2 4 4 2 1 2 2 1 2 1 2 1 2 4 2 1 1 3 4 2 1 2 4 2 1 2
## [112] 1 2 4 1 4 2 2 4 2 4 1 1 1 3 2 1 2 1 2 1 1 1 1 1 1 4 3 1 2 1 3 1 2 2 1 1 3
## [149] 1 2 4 2 4 1 3 1 3 3 3 2 1 3 3 3 3 3 2 1 1 2 1 1 2 1 1 2 1 2 1 2 1 2 2 1 2
## [186] 1 2 2 2 2 1 2 1 2 2 1 3 4 2 1 4 1 2 2 1 2 2 1 1 2 4 1 4 1 1 1 2 1 1 2 4 1
## [223] 4 4 4 2 1 1 4 4 4 1 4 1 2 1 2 4 2 1 2 2 1 2 1 4 2 1 3 1 2 4 1 1 2 2 4 2 2
## [260] 2 1 1 3 3 4 2 1 3 3 3 3 3 2 1 1 3 2 1 3 3 2 1 1 3 2 1 2 1 2 2 1 2 2 2 2 1
## [297] 2 2 1 2 1 2 1 2 2 1 3 2 1 2 1 2 1 3 2 1 2 2 2 1 2 1 2 4 2 2 2 4 2 4 2 4 4
## [334] 4 2 1 1 2 1 3 1 1 1 1 1 2 1 2 1 3 1 1 2 1 3 4 2 1 2 2 2 1 2 1 2 1 3 1 1 1
## [371] 1 2 1 2 1 2 4 2 4 2 4 2 1 1 4 2 1 3 2 4 1 1 1 3 2 2 1 3 2 1 1 2 4 4 4 2 2
## [408] 4 3 3 2 4 4 3 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 2 4 4 2 1 1 2 1 1 2 2 1
## [445] 2 1 2 2 2 2 1 2 1 2 1 4 4 2 4 4 4 1 2 4 1 2 1 2 1 2 1 1 2 1 2 1 3 1 1 2 1
## [482] 2 2 1 2 4 2 2 2 1 4 2 1 3 3 1 2 3 3 2 4 2 4 2 1 1 2 1 2 2 1 3 1 4 4 4 4 3
## [519] 3 1 1 4 1 4 1 1 1 3 4 4 1 1 1 1 1 1 1 4 3 3 3 3 3 3 3 3 4 1 3 3 3 3 3 3 2
## [556] 1 1 2 1 1 2 1 1 2 1 2 2 4 2 1 2 1 2 1 2 1 2 1 2 2 1 2 1 2 4 4 2 1 2 1 1 4
## [593] 2 4 4 2 1 1 4 1 2 4 1 2 2 1 2 1 2 1 1 2 2 1 2 1 1 1 2 4 2 4 1 2 4 4 4 1 3
## [630] 2 1 2 1 2 1 2 1 1 2 2 1 2 1 2 1 1 2 1 1 2 4 2 1 2 1 1 2 1 2 4 2 4 4 2 1 1
## [667] 2 1 2 2 1 2 1 3 2 1 1 2 1 1 2 1 4 2 4 2 4 4 2 1 2 4 1 4 2 1 3 2 1 3 3 3 3
## [704] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 4 4 2 1 1 2 1 1 2 1 2 2 1 2 2 1 2 1 2 2 1
## [741] 2 1 2 1 1 2 1 1 2 4 3 4 2 1 2 1 2 1 2 4 2 4 2 1 2 1 2 4 2 1 1 1 1 4 2 1 3
## [778] 1 2 1 2 2 2 2 4 4 4 4 2 4 2 1 3 3 3 4 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 871531.3 513476.6 408271.9 455965.7
##  (between_SS / total_SS =  47.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"