In this project we are exploring an interesting dataset which is the famous Pokemon dataset. The original Pokémon is a role-playing game based around building a small team of monsters to battle other monsters in a quest to become the best. Anyone who has played Pokemon has noticed that there are certain tropes that Pokemon fall into: Pikachu, Plusle, Pachirisu, and Dedenne are all cute Electric-type Pokemon; there are a variety of “big monster” Pokemon like Rhydon, Nidoking, Tyranitar, and Aggron; and many other classes one might notice. The algorithmic process of finding unknown groupings or clusters in data is called clustering. Clustering is generally unsupervised, that is that we do not know what the clusters should be in advance, although we might have opinions on what they should look like.Through this project we are trying to identify the groupings based on similarities in their statistics and create different clusters from the large data. This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed.The important variables of the dataset are as follows:
Through this project we aim to :
Packages Required
library(ggplot2)
library(gplots)
library(ROCR)
library(dplyr)
library(tidyverse)
library(highcharter)
library(reshape2)
library(factoextra)
library(scales)
Importing the Dataset
pokemon_data <- read.csv("/Users/sindhuherle/Documents/Data mining/Pokemon.csv")
Before analyzing, let us examine the dataset using head() and str() functions
head(pokemon_data,10)
## X. Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122
## 5 4 Charmander Fire 309 39 52 43 60
## 6 5 Charmeleon Fire 405 58 64 58 80
## 7 6 Charizard Fire Flying 534 78 84 78 109
## 8 6 CharizardMega Charizard X Fire Dragon 634 78 130 111 130
## 9 6 CharizardMega Charizard Y Fire Flying 634 78 104 78 159
## 10 7 Squirtle Water 314 44 48 65 50
## Sp..Def Speed Generation Legendary
## 1 65 45 1 False
## 2 80 60 1 False
## 3 100 80 1 False
## 4 120 80 1 False
## 5 50 65 1 False
## 6 65 80 1 False
## 7 85 100 1 False
## 8 85 100 1 False
## 9 115 100 1 False
## 10 64 43 1 False
str(pokemon_data)
## 'data.frame': 800 obs. of 13 variables:
## $ X. : int 1 2 3 3 4 5 6 6 6 7 ...
## $ Name : chr "Bulbasaur" "Ivysaur" "Venusaur" "VenusaurMega Venusaur" ...
## $ Type.1 : chr "Grass" "Grass" "Grass" "Grass" ...
## $ Type.2 : chr "Poison" "Poison" "Poison" "Poison" ...
## $ Total : int 318 405 525 625 309 405 534 634 634 314 ...
## $ HP : int 45 60 80 80 39 58 78 78 78 44 ...
## $ Attack : int 49 62 82 100 52 64 84 130 104 48 ...
## $ Defense : int 49 63 83 123 43 58 78 111 78 65 ...
## $ Sp..Atk : int 65 80 100 122 60 80 109 130 159 50 ...
## $ Sp..Def : int 65 80 100 120 50 65 85 85 115 64 ...
## $ Speed : int 45 60 80 80 65 80 100 100 100 43 ...
## $ Generation: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Legendary : chr "False" "False" "False" "False" ...
Let us also check if there are any missing values in the dataset
colSums(is.na(pokemon_data))
## X. Name Type.1 Type.2 Total HP Attack
## 0 0 0 0 0 0 0
## Defense Sp..Atk Sp..Def Speed Generation Legendary
## 0 0 0 0 0 0
The dataset looks clean with no missing values. Basic insights of the data can be obtained by exploring the data through visualizations.
Pokemon distribution of type1 and type 2
pokemon.plot2 <- ggplot(pokemon_data, aes(Type.2)) +
geom_bar(aes(fill = as.factor(Type.2))) +
scale_fill_discrete(name = "Type 2") +
labs(x="Type 1", y="Count", Title = "Distr. of Type 1 and Type 2") +
facet_wrap(~Type.1) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
pokemon.plot2
Box plots show the five-number summary of a set of data: including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score. Before we move on to clustering, lets visualize the pokemon skills like HP, Defense, Attack, Sp Attack, Sp Def, Speed through boxplots.
boxplot(pokemon_data[6:11])
From the boxplot, we can see that all the variables have outliers HP and defense variables have the highest among them.
A correlation heatmap is a heatmap that shows a 2D correlation matrix between two discrete dimensions, using colored cells to represent data. The value of correlation can take any value from -1 to 1. It helps us to find the relationship between different variables
## select only numeric columns
pokemon_numeric <- select_if(pokemon_data, is.numeric)
## coorelation matrix
cormat <- round(cor(pokemon_numeric),2)
head(cormat)
## X. Total HP Attack Defense Sp..Atk Sp..Def Speed Generation
## X. 1.00 0.12 0.10 0.10 0.09 0.09 0.09 0.01 0.98
## Total 0.12 1.00 0.62 0.74 0.61 0.75 0.72 0.58 0.05
## HP 0.10 0.62 1.00 0.42 0.24 0.36 0.38 0.18 0.06
## Attack 0.10 0.74 0.42 1.00 0.44 0.40 0.26 0.38 0.05
## Defense 0.09 0.61 0.24 0.44 1.00 0.22 0.51 0.02 0.04
## Sp..Atk 0.09 0.75 0.36 0.40 0.22 1.00 0.51 0.47 0.04
#basic coorelation plot
melted_cormat <- melt(cormat)
head(melted_cormat)
## Var1 Var2 value
## 1 X. X. 1.00
## 2 Total X. 0.12
## 3 HP X. 0.10
## 4 Attack X. 0.10
## 5 Defense X. 0.09
## 6 Sp..Atk X. 0.09
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill = value)) +
geom_tile()
### to plot the heatmpa
# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
upper_tri
## X. Total HP Attack Defense Sp..Atk Sp..Def Speed Generation
## X. 1 0.12 0.10 0.10 0.09 0.09 0.09 0.01 0.98
## Total NA 1.00 0.62 0.74 0.61 0.75 0.72 0.58 0.05
## HP NA NA 1.00 0.42 0.24 0.36 0.38 0.18 0.06
## Attack NA NA NA 1.00 0.44 0.40 0.26 0.38 0.05
## Defense NA NA NA NA 1.00 0.22 0.51 0.02 0.04
## Sp..Atk NA NA NA NA NA 1.00 0.51 0.47 0.04
## Sp..Def NA NA NA NA NA NA 1.00 0.26 0.03
## Speed NA NA NA NA NA NA NA 1.00 -0.02
## Generation NA NA NA NA NA NA NA NA 1.00
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
Correlation Insights
For clustering we are choosing following pokemon statistics
Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use elbow method or silhouette method.
pokemon <- pokemon_data %>% select(6:11)
fviz_nbclust(pokemon, kmeans, method = "wss", k.max = 15) + scale_y_continuous(labels = number_format(scale = 10^(-9),
big.mark = ",", suffix = " bil.")) + labs(subtitle = "Elbow method")
fviz_nbclust(pokemon, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")
We can see from both the methods optimal value of cluster is somewhere around 2-4. Considering the multiple factors lets take the k value as 4 which implies grouping the pokemons into 4 clusters based on the statistics.
# Select number of clusters
k <- 4
# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)
# View the resulting model
km.pokemon
## K-means clustering with 4 clusters of sizes 288, 283, 114, 115
##
## Cluster means:
## HP Attack Defense Sp..Atk Sp..Def Speed
## 1 79.18056 81.31944 69.19097 82.01042 77.53125 80.10417
## 2 50.29682 54.03180 51.62898 47.90459 49.15548 49.74912
## 3 89.20175 121.09649 92.73684 120.45614 97.67544 100.44737
## 4 71.30435 92.91304 121.42609 63.89565 88.23478 52.36522
##
## Clustering vector:
## [1] 2 1 1 3 2 1 1 3 3 2 1 1 3 2 2 1 2 2 1 1 2 2 1 3 2 1 2 1 2 1 2 1 2 4 2 2 1
## [38] 2 2 1 2 1 2 1 2 1 2 1 2 1 1 2 4 2 1 2 1 2 1 2 1 2 1 2 3 2 2 1 2 1 1 3 2 1
## [75] 4 2 1 1 2 1 2 4 4 1 1 2 4 4 2 1 2 2 1 2 1 2 1 2 4 2 1 1 3 4 2 1 2 4 2 1 2
## [112] 1 2 4 1 4 2 2 4 2 4 1 1 1 3 2 1 2 1 2 1 1 1 1 1 1 4 3 1 2 1 3 1 2 2 1 1 3
## [149] 1 2 4 2 4 1 3 1 3 3 3 2 1 3 3 3 3 3 2 1 1 2 1 1 2 1 1 2 1 2 1 2 1 2 2 1 2
## [186] 1 2 2 2 2 1 2 1 2 2 1 3 4 2 1 4 1 2 2 1 2 2 1 1 2 4 1 4 1 1 1 2 1 1 2 4 1
## [223] 4 4 4 2 1 1 4 4 4 1 4 1 2 1 2 4 2 1 2 2 1 2 1 4 2 1 3 1 2 4 1 1 2 2 4 2 2
## [260] 2 1 1 3 3 4 2 1 3 3 3 3 3 2 1 1 3 2 1 3 3 2 1 1 3 2 1 2 1 2 2 1 2 2 2 2 1
## [297] 2 2 1 2 1 2 1 2 2 1 3 2 1 2 1 2 1 3 2 1 2 2 2 1 2 1 2 4 2 2 2 4 2 4 2 4 4
## [334] 4 2 1 1 2 1 3 1 1 1 1 1 2 1 2 1 3 1 1 2 1 3 4 2 1 2 2 2 1 2 1 2 1 3 1 1 1
## [371] 1 2 1 2 1 2 4 2 4 2 4 2 1 1 4 2 1 3 2 4 1 1 1 3 2 2 1 3 2 1 1 2 4 4 4 2 2
## [408] 4 3 3 2 4 4 3 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 2 4 4 2 1 1 2 1 1 2 2 1
## [445] 2 1 2 2 2 2 1 2 1 2 1 4 4 2 4 4 4 1 2 4 1 2 1 2 1 2 1 1 2 1 2 1 3 1 1 2 1
## [482] 2 2 1 2 4 2 2 2 1 4 2 1 3 3 1 2 3 3 2 4 2 4 2 1 1 2 1 2 2 1 3 1 4 4 4 4 3
## [519] 3 1 1 4 1 4 1 1 1 3 4 4 1 1 1 1 1 1 1 4 3 3 3 3 3 3 3 3 4 1 3 3 3 3 3 3 2
## [556] 1 1 2 1 1 2 1 1 2 1 2 2 4 2 1 2 1 2 1 2 1 2 1 2 2 1 2 1 2 4 4 2 1 2 1 1 4
## [593] 2 4 4 2 1 1 4 1 2 4 1 2 2 1 2 1 2 1 1 2 2 1 2 1 1 1 2 4 2 4 1 2 4 4 4 1 3
## [630] 2 1 2 1 2 1 2 1 1 2 2 1 2 1 2 1 1 2 1 1 2 4 2 1 2 1 1 2 1 2 4 2 4 4 2 1 1
## [667] 2 1 2 2 1 2 1 3 2 1 1 2 1 1 2 1 4 2 4 2 4 4 2 1 2 4 1 4 2 1 3 2 1 3 3 3 3
## [704] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 4 4 2 1 1 2 1 1 2 1 2 2 1 2 2 1 2 1 2 2 1
## [741] 2 1 2 1 1 2 1 1 2 4 3 4 2 1 2 1 2 1 2 4 2 4 2 1 2 1 2 4 2 1 1 1 1 4 2 1 3
## [778] 1 2 1 2 2 2 2 4 4 4 4 2 4 2 1 3 3 3 4 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 871531.3 513476.6 408271.9 455965.7
## (between_SS / total_SS = 47.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"