In this project we are exploring an interesting dataset which is the famous Pokemon dataset. The original Pokémon is a role-playing game based around building a small team of monsters to battle other monsters in a quest to become the best. Anyone who has played Pokemon has noticed that there are certain tropes that Pokemon fall into: Pikachu, Plusle, Pachirisu, and Dedenne are all cute Electric-type Pokemon; there are a variety of “big monster” Pokemon like Rhydon, Nidoking, Tyranitar, and Aggron; and many other classes one might notice. The algorithmic process of finding unknown groupings or clusters in data is called clustering. Clustering is generally unsupervised, that is that we do not know what the clusters should be in advance, although we might have opinions on what they should look like.Through this project we are trying to identify the groupings based on similarities in their statistics and create different clusters from the large data. This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed.The important variables of the dataset are as follows:
Through this project we aim to :
Packages Required
library(ggplot2)
library(gplots)
library(ROCR)
library(dplyr)
library(tidyverse)
library(highcharter)
library(reshape2)
library(factoextra)
library(scales)
library(rpart)
library(rpart.plot)
Importing the Dataset
pokemon_data <- read.csv("/Users/sindhuherle/Documents/Data mining/Pokemon.csv")
Before analyzing, let us examine the dataset using head() and str() functions
head(pokemon_data,10)
## X. Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122
## 5 4 Charmander Fire 309 39 52 43 60
## 6 5 Charmeleon Fire 405 58 64 58 80
## 7 6 Charizard Fire Flying 534 78 84 78 109
## 8 6 CharizardMega Charizard X Fire Dragon 634 78 130 111 130
## 9 6 CharizardMega Charizard Y Fire Flying 634 78 104 78 159
## 10 7 Squirtle Water 314 44 48 65 50
## Sp..Def Speed Generation Legendary
## 1 65 45 1 False
## 2 80 60 1 False
## 3 100 80 1 False
## 4 120 80 1 False
## 5 50 65 1 False
## 6 65 80 1 False
## 7 85 100 1 False
## 8 85 100 1 False
## 9 115 100 1 False
## 10 64 43 1 False
str(pokemon_data)
## 'data.frame': 800 obs. of 13 variables:
## $ X. : int 1 2 3 3 4 5 6 6 6 7 ...
## $ Name : chr "Bulbasaur" "Ivysaur" "Venusaur" "VenusaurMega Venusaur" ...
## $ Type.1 : chr "Grass" "Grass" "Grass" "Grass" ...
## $ Type.2 : chr "Poison" "Poison" "Poison" "Poison" ...
## $ Total : int 318 405 525 625 309 405 534 634 634 314 ...
## $ HP : int 45 60 80 80 39 58 78 78 78 44 ...
## $ Attack : int 49 62 82 100 52 64 84 130 104 48 ...
## $ Defense : int 49 63 83 123 43 58 78 111 78 65 ...
## $ Sp..Atk : int 65 80 100 122 60 80 109 130 159 50 ...
## $ Sp..Def : int 65 80 100 120 50 65 85 85 115 64 ...
## $ Speed : int 45 60 80 80 65 80 100 100 100 43 ...
## $ Generation: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Legendary : chr "False" "False" "False" "False" ...
Let us also check if there are any missing values in the dataset
colSums(is.na(pokemon_data))
## X. Name Type.1 Type.2 Total HP Attack
## 0 0 0 0 0 0 0
## Defense Sp..Atk Sp..Def Speed Generation Legendary
## 0 0 0 0 0 0
The dataset looks clean with no missing values. Basic insights of the data can be obtained by exploring the data through visualizations.
Pokemon distribution of type1 and type 2
pokemon.plot2 <- ggplot(pokemon_data, aes(Type.2)) +
geom_bar(aes(fill = as.factor(Type.2))) +
scale_fill_discrete(name = "Type 2") +
labs(x="Type 1", y="Count", Title = "Distr. of Type 1 and Type 2") +
facet_wrap(~Type.1) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
pokemon.plot2
Box plots show the five-number summary of a set of data: including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score. Before we move on to clustering, lets visualize the pokemon skills like HP, Defense, Attack, Sp Attack, Sp Def, Speed through boxplots.
boxplot(pokemon_data[6:11])
From the boxplot, we can see that all the variables have outliers HP and defense variables have the highest among them.
A correlation heatmap is a heatmap that shows a 2D correlation matrix between two discrete dimensions, using colored cells to represent data. The value of correlation can take any value from -1 to 1. It helps us to find the relationship between different variables
## select only numeric columns
pokemon_numeric <- select_if(pokemon_data, is.numeric)
## coorelation matrix
cormat <- round(cor(pokemon_numeric),2)
head(cormat)
## X. Total HP Attack Defense Sp..Atk Sp..Def Speed Generation
## X. 1.00 0.12 0.10 0.10 0.09 0.09 0.09 0.01 0.98
## Total 0.12 1.00 0.62 0.74 0.61 0.75 0.72 0.58 0.05
## HP 0.10 0.62 1.00 0.42 0.24 0.36 0.38 0.18 0.06
## Attack 0.10 0.74 0.42 1.00 0.44 0.40 0.26 0.38 0.05
## Defense 0.09 0.61 0.24 0.44 1.00 0.22 0.51 0.02 0.04
## Sp..Atk 0.09 0.75 0.36 0.40 0.22 1.00 0.51 0.47 0.04
#basic coorelation plot
melted_cormat <- melt(cormat)
head(melted_cormat)
## Var1 Var2 value
## 1 X. X. 1.00
## 2 Total X. 0.12
## 3 HP X. 0.10
## 4 Attack X. 0.10
## 5 Defense X. 0.09
## 6 Sp..Atk X. 0.09
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill = value)) +
geom_tile()
### to plot the heatmpa
# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
upper_tri
## X. Total HP Attack Defense Sp..Atk Sp..Def Speed Generation
## X. 1 0.12 0.10 0.10 0.09 0.09 0.09 0.01 0.98
## Total NA 1.00 0.62 0.74 0.61 0.75 0.72 0.58 0.05
## HP NA NA 1.00 0.42 0.24 0.36 0.38 0.18 0.06
## Attack NA NA NA 1.00 0.44 0.40 0.26 0.38 0.05
## Defense NA NA NA NA 1.00 0.22 0.51 0.02 0.04
## Sp..Atk NA NA NA NA NA 1.00 0.51 0.47 0.04
## Sp..Def NA NA NA NA NA NA 1.00 0.26 0.03
## Speed NA NA NA NA NA NA NA 1.00 -0.02
## Generation NA NA NA NA NA NA NA NA 1.00
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
Correlation Insights
For clustering we are choosing the following pokemon statistics
Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use elbow method or silhouette method.
pokemon <- pokemon_data %>% select(6:11)
fviz_nbclust(pokemon, kmeans, method = "wss", k.max = 15) + scale_y_continuous(labels = number_format(scale = 10^(-9),
big.mark = ",", suffix = " bil.")) + labs(subtitle = "Elbow method")
fviz_nbclust(pokemon, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")
We can see from both the methods optimal value of cluster is somewhere around 2-4. Considering the multiple factors lets take the k value as 4 which implies grouping the pokemons into 4 clusters based on the statistics.
# Select number of clusters
k <- 4
# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)
# View the resulting model
km.pokemon
## K-means clustering with 4 clusters of sizes 288, 283, 114, 115
##
## Cluster means:
## HP Attack Defense Sp..Atk Sp..Def Speed
## 1 79.18056 81.31944 69.19097 82.01042 77.53125 80.10417
## 2 50.29682 54.03180 51.62898 47.90459 49.15548 49.74912
## 3 89.20175 121.09649 92.73684 120.45614 97.67544 100.44737
## 4 71.30435 92.91304 121.42609 63.89565 88.23478 52.36522
##
## Clustering vector:
## [1] 2 1 1 3 2 1 1 3 3 2 1 1 3 2 2 1 2 2 1 1 2 2 1 3 2 1 2 1 2 1 2 1 2 4 2 2 1
## [38] 2 2 1 2 1 2 1 2 1 2 1 2 1 1 2 4 2 1 2 1 2 1 2 1 2 1 2 3 2 2 1 2 1 1 3 2 1
## [75] 4 2 1 1 2 1 2 4 4 1 1 2 4 4 2 1 2 2 1 2 1 2 1 2 4 2 1 1 3 4 2 1 2 4 2 1 2
## [112] 1 2 4 1 4 2 2 4 2 4 1 1 1 3 2 1 2 1 2 1 1 1 1 1 1 4 3 1 2 1 3 1 2 2 1 1 3
## [149] 1 2 4 2 4 1 3 1 3 3 3 2 1 3 3 3 3 3 2 1 1 2 1 1 2 1 1 2 1 2 1 2 1 2 2 1 2
## [186] 1 2 2 2 2 1 2 1 2 2 1 3 4 2 1 4 1 2 2 1 2 2 1 1 2 4 1 4 1 1 1 2 1 1 2 4 1
## [223] 4 4 4 2 1 1 4 4 4 1 4 1 2 1 2 4 2 1 2 2 1 2 1 4 2 1 3 1 2 4 1 1 2 2 4 2 2
## [260] 2 1 1 3 3 4 2 1 3 3 3 3 3 2 1 1 3 2 1 3 3 2 1 1 3 2 1 2 1 2 2 1 2 2 2 2 1
## [297] 2 2 1 2 1 2 1 2 2 1 3 2 1 2 1 2 1 3 2 1 2 2 2 1 2 1 2 4 2 2 2 4 2 4 2 4 4
## [334] 4 2 1 1 2 1 3 1 1 1 1 1 2 1 2 1 3 1 1 2 1 3 4 2 1 2 2 2 1 2 1 2 1 3 1 1 1
## [371] 1 2 1 2 1 2 4 2 4 2 4 2 1 1 4 2 1 3 2 4 1 1 1 3 2 2 1 3 2 1 1 2 4 4 4 2 2
## [408] 4 3 3 2 4 4 3 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 2 4 4 2 1 1 2 1 1 2 2 1
## [445] 2 1 2 2 2 2 1 2 1 2 1 4 4 2 4 4 4 1 2 4 1 2 1 2 1 2 1 1 2 1 2 1 3 1 1 2 1
## [482] 2 2 1 2 4 2 2 2 1 4 2 1 3 3 1 2 3 3 2 4 2 4 2 1 1 2 1 2 2 1 3 1 4 4 4 4 3
## [519] 3 1 1 4 1 4 1 1 1 3 4 4 1 1 1 1 1 1 1 4 3 3 3 3 3 3 3 3 4 1 3 3 3 3 3 3 2
## [556] 1 1 2 1 1 2 1 1 2 1 2 2 4 2 1 2 1 2 1 2 1 2 1 2 2 1 2 1 2 4 4 2 1 2 1 1 4
## [593] 2 4 4 2 1 1 4 1 2 4 1 2 2 1 2 1 2 1 1 2 2 1 2 1 1 1 2 4 2 4 1 2 4 4 4 1 3
## [630] 2 1 2 1 2 1 2 1 1 2 2 1 2 1 2 1 1 2 1 1 2 4 2 1 2 1 1 2 1 2 4 2 4 4 2 1 1
## [667] 2 1 2 2 1 2 1 3 2 1 1 2 1 1 2 1 4 2 4 2 4 4 2 1 2 4 1 4 2 1 3 2 1 3 3 3 3
## [704] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 4 4 2 1 1 2 1 1 2 1 2 2 1 2 2 1 2 1 2 2 1
## [741] 2 1 2 1 1 2 1 1 2 4 3 4 2 1 2 1 2 1 2 4 2 4 2 1 2 1 2 4 2 1 1 1 1 4 2 1 3
## [778] 1 2 1 2 2 2 2 4 4 4 4 2 4 2 1 3 3 3 4 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 871531.3 513476.6 408271.9 455965.7
## (between_SS / total_SS = 47.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
From the cluster vector, we can see how the different rows have been categorized into the clusters. However, for ease of further analysis, we are now converting this cluster vector into a dataframe.
km.pokemon.table <- data.frame(km.pokemon$size, km.pokemon$centers)
km.pokemon.df <- data.frame(Cluster = km.pokemon$cluster, pokemon)
# head of df
head(km.pokemon.df)
## Cluster HP Attack Defense Sp..Atk Sp..Def Speed
## 1 2 45 49 49 65 65 45
## 2 1 60 62 63 80 80 60
## 3 1 80 82 83 100 100 80
## 4 3 80 100 123 122 120 80
## 5 2 39 52 43 60 50 65
## 6 1 58 64 58 80 65 80
Before we move on to do a visual analysis of the clusters, we are going to check the quality of our partitions.The quality of a k-means partition is found by calculating the percentage of the TSS “explained” by the partition using the following formula:
(BSS/TSS) × 100 %
where BSS and TSS stand for Between Sum of Squares and Total Sum of Squares, respectively. The higher the percentage, the better the score (and thus the quality) because it means that BSS is large and/or WSS is small.
#Quality of partition
(BSS <- km.pokemon$betweenss)
## [1] 2039482
(TSS <- km.pokemon$totss)
## [1] 4288727
BSS / TSS * 100
## [1] 47.55448
With our current value of K=4, we have a good quality of partition at 47.5%. On decreasing the value of K, we observed that this percentage went lower. When the k value was increased to 5, we did get a higher percentage of quality partition (@ 51.2%), but we also noticed higher level of overlaps between the clusters, hence we decided to keep k at a value of 4.
We are now proceeding to do a visualization of the clusters to better understand how the pokemons have been grouped.
library(cluster)
library(fpc)
fviz_cluster(km.pokemon, data = pokemon, geom = "point")
If we plot the number of pokemons in each cluster, we can see that more number of pokemons have been categorized in clusters 1 and 2, with cluster 1 having the highest number of pokemons.
#Count of clusters
ggplot(data = km.pokemon.df, aes(y = Cluster)) +
geom_bar(fill = "lightblue") +
ggtitle("Count of pokemons by cluster") +
theme(plot.title = element_text(hjust = 0.5))
#Cluster comparison with the different variables
with(pokemon_data, pairs(pokemon, col=c(1:3)[km.pokemon$cluster]))
We are going to focus mainly on 2 variables - Attack and Speed for a comprehensive analysis of our clusters.
ggplot(km.pokemon.df) +
geom_point(aes(x = Cluster, y = Attack, color = Cluster ))
From this, we can see that the pokemons with the highest Attack values are categorized in cluster 3. Cluster 2 has the ones with the lowest Attack values.Similarly when we do a visualization for speed, we can observe that the cluster 3 has the pokemons with the highest Speed value while clusters 2 and 4 appear to have significant overlap for this variable.
ggplot(km.pokemon.df) +
geom_point(aes(x = Cluster, y = Speed, color = Cluster ))
ggplot(km.pokemon.df) +
geom_point(aes(x = Cluster, y = HP, color = Cluster ))
In examining the different clusters, we would like to conclude that cluster 3 has the pokemons with the highest values for all attributes making this the strongest cluster of pokemons, while cluster 2 will be the weakest pokemons.
We have used decision tree to predict which cluster a Pokémon belongs to, based on its skills.
trainingIndex <- sample(nrow(km.pokemon.df ),nrow(km.pokemon.df )*0.80)
Train <- km.pokemon.df [trainingIndex,]
Test <- km.pokemon.df [-trainingIndex,]
# train the model using training dataset
dectree <- rpart(Cluster~., data = Train, method = 'class')
prp(dectree, type=3, main= "Probabilities per class")
We compare this decison tree using test data. That is we randomly pick up a pokemon from the below test data
head(Test,5)
## Cluster HP Attack Defense Sp..Atk Sp..Def Speed
## 1 2 45 49 49 65 65 45
## 3 1 80 82 83 100 100 80
## 7 1 78 84 78 109 85 100
## 12 1 79 83 100 85 105 78
## 14 2 45 30 35 20 20 45
From the above data lets pick pokemon 1 and traverse the tree to see if it falls into right cluster.
From the decision tree result we see that Pokemon 1 falls into 2nd cluster which is same as what we see from our test data above
From our analyis of the dataset through EDA, Clustering and decision tree, we come to a conclusion on strongest and weakest pokemon.
Strongest Pokemon
Weakest Pokemon