The main aim is to learn some of the visualisations presented using the Pokémon dataset by Alberto Barradas. There are some neat little things we can do here. We’ll then try to use the different Pokémon attributes to classify / predict the likeliest type of Pokémon, if any.
Below shows the first few rows of the dataset, which contains the following information:
pokemon = read.csv("Pokemon.csv")
# going to reference this pokemonCluster later on
pokemonCluster <- pokemon %>%
select(Name:Type.2, Total:Speed)
# Below code is the equivalent of str(Pokémon) but with better formatting
data.frame(variable = names(pokemon),
class = sapply(pokemon, class),
first_values = sapply(
pokemon, function(x) paste0(head(x), collapse = ", ")),
row.names = NULL) %>%
kable()
| variable | class | first_values |
|---|---|---|
| X. | integer | 1, 2, 3, 3, 4, 5 |
| Name | factor | Bulbasaur, Ivysaur, Venusaur, VenusaurMega Venusaur, Charmander, Charmeleon |
| Type.1 | factor | Grass, Grass, Grass, Grass, Fire, Fire |
| Type.2 | factor | Poison, Poison, Poison, Poison, , |
| Total | integer | 318, 405, 525, 625, 309, 405 |
| HP | integer | 45, 60, 80, 80, 39, 58 |
| Attack | integer | 49, 62, 82, 100, 52, 64 |
| Defense | integer | 49, 63, 83, 123, 43, 58 |
| Sp..Atk | integer | 65, 80, 100, 122, 60, 80 |
| Sp..Def | integer | 65, 80, 100, 120, 50, 65 |
| Speed | integer | 45, 60, 80, 80, 65, 80 |
| Generation | integer | 1, 1, 1, 1, 1, 1 |
| Legendary | factor | False, False, False, False, False, False |
We should probably clean up the Pokémon names in the dataset as there seem to be repetitions with the ‘Mega’ variety, e.g. “AbomasnowMega” as seen above. There are others with squashed names later on such as “PumpkabooAverage Size” so we’ll clean those up too.
# Clean names that are supposed to start with "Mega"
pokemon$Name = gsub(".*Mega", "Mega",pokemon$Name,ignore.case=T)
# Separate names a little bit more
for (name in c(
"Deoxys", "Wormadam", "Pumpkaboo", "Gourgeist", "Aegislash",
"Meowstic", "Tornadus", "Thundurus", "Landorus", "Kyurem",
"Keldeo", "Meloetta", "Darmanitan", "Giratina", "Shaymin",
"Rotom", "Kyogre", "Groudon"
)){
pokemon$Name = gsub(
paste0("^", name), paste0(name, " "),pokemon$Name)
}
# Fix back names in previous list that now look like "Rotom " and "Kyurem "
for(name in c("Kyurem", "Rotom", "Kyogre", "Groudon")){
pokemon$Name = gsub(
paste0("^", name, " $"), name ,pokemon$Name)
}
kable(head(pokemon))
| X. | Name | Type.1 | Type.2 | Total | HP | Attack | Defense | Sp..Atk | Sp..Def | Speed | Generation | Legendary |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
| 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
| 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
| 3 | Mega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
| 4 | Charmander | Fire | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False | |
| 5 | Charmeleon | Fire | 405 | 58 | 64 | 58 | 80 | 65 | 80 | 1 | False |
Next, we’re going to replicate what the Seaborn Kernel in Python by Andrew Gelé can do, just to show off some cool visualisations. It’s also to show that the same visualisations created in Python can be reproduced in R as well.
So firstly, a scatter plot / histogram combination. On the left is shown the generic plot that R can produce, using the ggplot2 and ggExtra packages and the right shows a fancier more colourful version, both of which can be constructed using some simple enough code.
# Firstly, drop unnecessary X., Legendary, Generation, Total columns
pokemonSeaborn = pokemon %>%
select(Name:Type.2, HP:Speed)
# Then we can make a simple scatterplot / histogram combination
p <- ggplot(pokemonSeaborn, aes(HP, Attack)) +
geom_point() +
geom_smooth(method=lm, se=FALSE) +
geom_text(x = 190, y = 190, label = corr_eqn(
pokemonSeaborn$HP,
pokemonSeaborn$Attack))
p <- ggMarginal(p, type = "histogram")
# Let's make it fancy, blue, and also a box plot in the margins.
pColour <- ggplot(pokemonSeaborn, aes(HP, Attack)) +
geom_point(colour='blue') +
geom_smooth(method=lm, se=FALSE) +
theme_classic() +
geom_text(x = 190, y = 190, label = corr_eqn(
pokemonSeaborn$HP,
pokemonSeaborn$Attack))
pColour <- ggMarginal(pColour, type = "boxplot", fill='blue')
grid.arrange(p, pColour, ncol=2)
We can take a look at a box plot of a single variable easily enough, using the ggplot2 package. An example is shown on the left for the HP attribute. We can then show box plots of the 6 attributes all at once too, which provides a decent visualisation. The HP attribute for example is much less varied across the entire Pokémon set while the Attack, SP. Attack and Speed attributes vary a bit more.
pBoxplot <- ggplot(pokemonSeaborn, aes(x='', y=HP)) +
stat_boxplot(geom ='errorbar') +
xlab("") +
geom_boxplot(fill='red')
pBoxPlots <- ggplot(data= melt(pokemonSeaborn), aes(x=variable, y=value)) +
stat_boxplot(geom ='errorbar') +
geom_boxplot(aes(fill=variable))
grid.arrange(pBoxplot, pBoxPlots, ncol=2, widths=1:2)
How about if we want insights on Pokémon attribute comparisons with their type? We can do that - the Seaborn package in Python uses Swarm Plots which can be manipulated easily enough. In R however the respective Swarm Plot method (using the beeswarm package) generates plots with non-overlapping data points and the resulting visualisations can become quite cluttered.
My original thought process was to simply try a jitter plot and to observe what happens. As shown this splits up all the data points by both type and attribute however the positions of the data points are randomly shifted horizontally. A better way to do this is to plot all of the data points for each attribute using a line plot, and then spreading out the data points in each type line by line, as shown in the second plot. This way the data points can overlap! Although there is a lot of information in this visualisation it still looks pretty cool!
# Firstly, drop the Legendary and Generation columns
pokemonSwarm = pokemonSeaborn %>%
melt(id.vars=(c("Name", "Type.1", "Type.2")))
# Define colours by extending those in the colorbrewer Set 1 palette
# Cal also try Set3, Accent or Paired for cool effects
colourCount = length(unique(pokemonSwarm$Type.1))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
# The Author (see top) recommended these colours instead
colours = c("#8ED752", "#F95643", "#53AFFE", "#C3D221", "#BBBDAF",
"#AD5CA2", "#F8E64E", "#F0CA42", "#F9AEFE", "#A35449",
"#FB61B4", "#CDBD72", "#7673DA", "#66EBFF", "#8B76FF",
"#8E6856", "#C3C1D7", "#75A4F9")
ggplot(pokemonSwarm, aes(x=variable, y=value)) +
geom_jitter(aes(colour=Type.1)) +
scale_color_manual(values = getPalette(18)) +
xlab("Attribute") +
ylab("Value") +
ggtitle("Pokemon Stats by Type 1") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(pokemonSwarm, aes(x=variable, y=value, color=Type.1)) +
geom_point(na.rm=TRUE, position=position_dodge(width=0.8), size=2) +
theme_bw() +
scale_color_manual(values = getPalette(18)) +
#scale_color_manual(values = colours) +
xlab("Attribute") +
ylab("Value") +
ggtitle("Pokemon Stats by Type 1") +
labs(color=' Type 1') +
theme(plot.title = element_text(hjust = 0.5))
The above two plots summarises the Seaborn tutorial comparison, but what if we wanted to further explore the individual attributes by type? We could just do a box plot of a separate attribute such as HP, Defence etc. split by type. Plain box plots are boring though, let’s try different styles!
pokemonSwarmHP <- pokemonSwarm %>%
filter(variable=="Attack")
pBoxPlots <- ggplot(data= pokemonSwarmHP, aes(x=Type.1, y=value)) +
geom_boxplot(aes(fill=Type.1), lwd=1.5) +
xlab("Type 1") +
ylab("Attack") +
theme_bw() +
theme(legend.position="none")
pBoxPlots
pokemonSwarmHP <- pokemonSwarm %>%
filter(variable=="Defense")
pBoxPlots <- ggplot(data= pokemonSwarmHP, aes(x=Type.1, y=value)) +
geom_tufteboxplot(aes(colour = Type.1, fill=Type.1), lwd=2) +
xlab("Type 1") +
ylab("Defense") +
theme_light() +
theme(legend.position="none")
pBoxPlots
pokemonSwarmHP <- pokemonSwarm %>%
filter(variable=="Speed")
pBoxPlots <- ggplot(data= pokemonSwarmHP, aes(x=Type.1, y=value)) +
geom_violin(aes(fill=Type.1), lwd=1.5) +
xlab("Type 1") +
ylab("Speed") +
theme_classic() +
theme(legend.position="none")
pBoxPlots
There are a couple of interesting insights to note from these plots:
The violin plot also shows that there are 2 regions of Flying types, either super fast or just plain average. We can then for example explore this insight and study which Pokémon are fast and which are slow.
pokemonFliers <- pokemon %>%
filter(Type.1 == 'Flying')
kable(pokemonFliers)
| X. | Name | Type.1 | Type.2 | Total | HP | Attack | Defense | Sp..Atk | Sp..Def | Speed | Generation | Legendary |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 641 | Tornadus Incarnate Forme | Flying | 580 | 79 | 115 | 70 | 125 | 80 | 111 | 5 | True | |
| 641 | Tornadus Therian Forme | Flying | 580 | 79 | 100 | 80 | 110 | 90 | 121 | 5 | True | |
| 714 | Noibat | Flying | Dragon | 245 | 40 | 30 | 35 | 45 | 40 | 55 | 6 | False |
| 715 | Noivern | Flying | Dragon | 535 | 85 | 70 | 80 | 97 | 80 | 123 | 6 | False |
So it looks like the Noibat is the slow culprit, however it evolves into Noivern later on, which is a lot faster. Tornadus is of Legendary status, and starts off a fast flier anyway from the looks of it.
Mary Vikhreva’s analysis uses Python to study whether two different dimension reduction techniques can differentiate Pokémon types or not. The first technique is the well known Principal Component Analysis which uses a linear combination of variables to explain the most amount of variance found in the dataset, and can provide insights into high dimension datasets. The second technique is called t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a highly effective method for visualising multidimensional data in 2 dimensions. While the technique is highly effective in visualising data it appears tricky to interpret as it relies on using hypergeometric / nonlinear algorithms to focus on both global and local effects simultaneously. Think about this in terms of modeling blood flow - you’d need to model both high flows in arteries as well as small flows in capillaries, which requires hypergeometric functions to model correctly.
Below is the same comparison, performed in R instead. This requires the ‘tsne’ package. We’ll use all of the attributes except its Generation to predict Pokémon Type 1. Firstly we need to normalise all of the data, since the Total attribute is on a larger scale than the rest of the variables. This way all of the variables will have mean 0 and standard deviation 1.
We can then train 2 models, one using PCA and the other using t-SNE. Let’s firstly plot all of the Pokémon and see if there are any overall predictions we can make. Shown below are the scatterplots using both models. It looks like both models predict a cluster that stands out. Are these the Legendary Pokémon? Let’s find out… It looks like they are, and the t-SNE model better distinguishes Legendary Pokémon than the PCA model by the looks of it. Hence we can at least use t-SNE to predict the legendary Pokémon in our dataset.
# Standardization stuff
pokemon$Type1ID <- as.integer(pokemon$Type.1)
pokemon$LegendaryID <- as.integer(pokemon$Legendary)
pokemon$GenerationID <- pokemon$Generation # Preserve integer value of Gen
features = c('Total', 'HP', 'Attack', 'Defense', 'Sp..Atk',
'Sp..Def', 'Speed', 'LegendaryID')
pokemon[,features] =scale(pokemon[features])
set.seed(789)
pokemonTSNE <- Rtsne(pokemon[,features], eta=500, check_duplicates=FALSE)
pokemonPCA <- prcomp(pokemon[,features], scale = TRUE)
predictPCA <- predict(pokemonPCA, pokemon[,features])
pokemonPCAcols <- cbind(pokemon, as.data.frame(predictPCA)[,1:2])
ggplot(pokemonPCAcols, aes(x=PC1, y=PC2)) +
geom_point(aes(fill=Type.1),
colour='black',
pch=21,
size=2) +
theme_bw()
ggplot(pokemonPCAcols, aes(x=PC1, y=PC2)) +
geom_point(aes(fill=Legendary),
colour='black',
pch=21,
size=2) +
theme_bw()
pokemonTSNEcols <- cbind(pokemon, as.data.frame(pokemonTSNE$Y))
ggplot(pokemonTSNEcols, aes(x=V1, y=V2)) +
geom_point(aes(fill=Type.1),
colour='black',
pch=21,
size=2) +
theme_bw()
# Easy fix to relate the TSNE components back to the data frame
ggplot(pokemonTSNEcols, aes(x=V1, y=V2)) +
geom_point(aes(fill=Legendary),
colour='black',
pch=21,
size=2) +
theme_bw()
Let’s plot 5 different Pokémon types against each other with the results from the t-SNE model. We should probably remove the legendary Pokémon from the dataset first though, as the TSNE layout depends on whether or not there are Legendary Pokémon in the dataset.
pokemon <- pokemon[pokemon$Legendary == "False",]
set.seed(789)
pokemonTSNE <- Rtsne(pokemon[,features], eta=500, check_duplicates=FALSE)
# removing the Generation variable destroys predictability of TSNE model
#pokemonTSNE <- Rtsne(subset(pokemon[,features], select=-c(Generation)), eta=500, check_duplicates=FALSE)
pokemonTSNEcols <- cbind(pokemon, as.data.frame(pokemonTSNE$Y))
# Get a set of types by name
# Get a set of type by ID
# Calculate how many types there are in the set
types1 <- unique(pokemon$Type.1)
types1IDs <- unique(pokemon$Type1ID)
numTypes1 <- length(types1)
# Make a grid and a place to store plots
start <- 10 # Used to plot different pokemon comparisons, can go up to 14
g <- 5 # grid size
p <- list()
colours = c("red", "orange", "yellow", "green", "blue")
# Construct plots
for (row in 1:g){
for(col in 1:g){
Xi <- pokemonTSNEcols[which(pokemon$Type1ID == (start + row)),]
Xj <- pokemonTSNEcols[which(pokemon$Type1ID == (start + col)),]
p[[(row - 1)*g + col]] <- ggplot(Xi, aes(V1, V2)) +
geom_point(colour='black',
fill = colours[row],
pch=21,
size=2) +
geom_point(data=Xj,
aes(V1, V2),
colour='black',
fill = colours[col],
pch=21,
size=2) +
theme_classic() +
xlab("") +
ylab("") +
ggtitle(paste(types1[[start+row]], 'Vs', types1[[start+col]])) +
theme(plot.title = element_text(hjust = 0.5))
}
}
# Print them out
grid.arrange(grobs=p,
ncol=g,
top=textGrob('TSNE - Comparing different types of Pokémon',
gp=gpar(fontsize=25))
)
It appears we can’t make any predictions about Pokémon types here. The plots themselves look cool though.
Let’s do one last thing, and cluster the Pokémon into 4 different components. The left tab shows the clusters predicted by PCA, and t-SNE is shown in the right tab. We’ve removed the legendary pokemon for the time being to focus on the larger dataset present.
# first, need to redo PCA without Legendary status
# PCA isn't happy with the LegendaryID column having all False values
features = c('Total', 'HP', 'Attack', 'Defense', 'Sp..Atk',
'Sp..Def', 'Speed')
pokemonPCA <- prcomp(pokemon[,features], scale = TRUE)
predictPCA <- predict(pokemonPCA, pokemon[,features])
pokemonPCAcols <- cbind(pokemon, as.data.frame(predictPCA)[,1:2])
kClusters = 4
features1 = c('Total', 'HP', 'Attack', 'Defense', 'Sp..Atk', 'Sp..Def')
set.seed(234)
kmeansTSNE = kmeans(pokemonTSNEcols[,c('V1', 'V2')], 4)
set.seed(345)
kmeansPCA <- kmeans(pokemonPCAcols[,c('PC1', 'PC2')], 4)
ggplot(pokemonPCAcols, aes(x=PC1, y=PC2)) +
geom_point(aes(fill=factor(kmeansPCA$cluster)),
colour='black',
pch=21,
size=2) +
scale_fill_manual(values = c("black","red", 'green', 'blue')) +
guides(fill=guide_legend(title="Cluster")) +
theme_bw()
ggplot(pokemonTSNEcols, aes(x=V1, y=V2)) +
geom_point(aes(fill=factor(kmeansTSNE$cluster)),
colour='black',
pch=21,
size=2) +
scale_fill_manual(values = c("black","red", 'green', 'blue')) +
guides(fill=guide_legend(title="Cluster")) +
theme_bw()
#subplot(1,2,1)
#scatter(X_tsne[:, 0], X_tsne[:, 1], c=cmap(kmeans_tsne.labels_ / num_clusters))
#title('TSNE')
#subplot(1,2,2)
#scatter(X_pca[:, 0], X_pca[:, 1], c=cmap(kmeans_pca.labels_ / num_clusters))
#title('PCA');
So both models predict reasonably clustered Pokémon sets. Which attributes do these different clusters have in them? We’ll check the clusters using the t-SNE model.
pokemonCluster = pokemonCluster[names(kmeansTSNE$cluster),]
pokemonMeanKable <- kable(setNames(aggregate(pokemonCluster$Total, by=list(cluster=kmeansTSNE$cluster), mean), c('Cluster', 'Total')), 'html')
pokemonCluster <- pokemonCluster %>% select(Name:Type.2, HP:Speed)
TSNEcluster1 <- pokemonCluster[which(kmeansTSNE$cluster == 1),]
ggplot(data= melt(TSNEcluster1), aes(x=variable, y=value)) +
stat_boxplot(geom ='errorbar') +
geom_boxplot(aes(fill=variable)) +
coord_cartesian(ylim = c(0, 175))
TSNEcluster2 <- pokemonCluster[which(kmeansTSNE$cluster == 2),]
ggplot(data= melt(TSNEcluster2), aes(x=variable, y=value)) +
stat_boxplot(geom ='errorbar') +
geom_boxplot(aes(fill=variable)) +
coord_cartesian(ylim = c(0, 175))
TSNEcluster3 <- pokemonCluster[which(kmeansTSNE$cluster == 3),]
ggplot(data= melt(TSNEcluster3), aes(x=variable, y=value)) +
stat_boxplot(geom ='errorbar') +
geom_boxplot(aes(fill=variable)) +
coord_cartesian(ylim = c(0, 175))
TSNEcluster4 <- pokemonCluster[which(kmeansTSNE$cluster == 4),]
ggplot(data= melt(TSNEcluster4), aes(x=variable, y=value)) +
stat_boxplot(geom ='errorbar') +
geom_boxplot(aes(fill=variable)) +
coord_cartesian(ylim = c(0, 175))
|
So a summary of the 4 clusters is as follows, with also referring to the table on the right displaying the mean sum of attributes for each cluster:
|
|
So there are some neat little visualisations and techniques learned here that can be used in other projects, hopefully they are useful later on! If you’ve stopped by just to take a look at what I’ve done, thanks for reading!