In the world of Pokémon academia, one name towers above any other – Professor Samuel Oak. While his colleague Professor Elm specializes in Pokémon evolution, Oak has dedicated his career to understanding the relationship between Pokémon and their human trainers. A former trainer himself, the professor has first-hand experience of how obstinate Pokémon can be – particularly when they hold legendary status.
For his latest research project, Professor Oak has decided to investigate the defining characteristics of legendary Pokémon to improve our understanding of their temperament. Hearing of our expertise in classification problems, he has enlisted us as the lead researchers.
Our journey begins at the professor’s research lab in Pallet Town, Kanto. The first step is to open up the Pokédex, an encyclopaedic guide to 801 Pokémon from all seven generations.
# Load the tidyverse
library(tidyverse)
# Import the dataset and convert variables
library(readxl)
pokedex <- read_excel("C:/Users/Luis/Desktop/R-Projects/pokedex.xlsx")
pokedex <- pokedex %>% mutate(name=as.factor(name),
height_m=as.numeric(height_m),
percentage_male=as.numeric(percentage_male),
type=as.factor(type),
weight_kg=as.numeric(weight_kg),
is_legendary = as.factor(is_legendary))
# Look at the first six rows
head(pokedex)
## # A tibble: 6 x 14
## pokedex_number name attack defense height_m hp percentage_male sp_attack
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Bulb~ 49 49 0.7 45 88.1 65
## 2 2 Ivys~ 62 63 1 60 88.1 80
## 3 3 Venu~ 100 123 2 80 88.1 122
## 4 4 Char~ 52 43 0.6 39 88.1 60
## 5 5 Char~ 64 58 1.1 58 88.1 80
## 6 6 Char~ 104 78 1.7 78 88.1 159
## # ... with 6 more variables: sp_defense <dbl>, speed <dbl>, type <fct>,
## # weight_kg <dbl>, generation <dbl>, is_legendary <fct>
# Examine the structure
pokedex <- pokedex %>% mutate(name=as.factor(name),
height_m=as.numeric(height_m),
percentage_male=as.numeric(percentage_male),
type=as.factor(type),
weight_kg=as.numeric(weight_kg))
After browsing the Pokédex, we can see several variables that could feasibly explain what makes a Pokémon legendary. We have a series of numerical fighter stats – attack, defense, speed and so on – as well as a categorization of Pokemon type (bug, dark, dragon, etc.). is_legendary is the binary classification variable we will eventually be predicting, tagged 1 if a Pokémon is legendary and 0 if it is not.
Before we explore these variables in any depth, let’s find out how many Pokémon are legendary out of the 801 total, using the handy count() function from the dplyr package.
# Prepare the data
legendary_pokemon <- pokedex %>%
count(is_legendary) %>%
mutate(prop = n / sum(n))
# Print the data frame
print(legendary_pokemon)
## # A tibble: 2 x 3
## is_legendary n prop
## <fct> <int> <dbl>
## 1 0 731 0.913
## 2 1 70 0.0874
We now know that there are 70 legendary Pokémon – a sizable minority at 9% of the population! Let’s start to explore some of their distinguishing characteristics.
First of all, we’ll plot the relationship between height_m and weight_kg for all 801 Pokémon, highlighting those that are classified as legendary. We’ll also add conditional labels to the plot, which will only print a Pokémon’s name if it is taller than 7.5m or heavier than 600kg.
library(ggplot2)
# Prepare the plot
legend_by_heightweight_plot <- pokedex %>%
ggplot(aes(x = height_m, y = weight_kg)) +
geom_point(aes(color = is_legendary), size = 2) +
geom_text(aes(label = ifelse(height_m > 7.5|weight_kg > 600, as.character(name), '')),
vjust = 0, hjust = 0) +
geom_smooth(method = "lm", se = FALSE, col = "black", linetype = "dashed") +
expand_limits(x = 16) +
labs(title = "Legendary Pokemon by height and weight",
x = "Height (m)",
y = "Weight (kg)") +
guides(color = guide_legend(title = "Pokémon status")) +
scale_color_manual(labels = c("Non-Legendary", "Legendary"),
values = c("#F8766D", "#00BFC4"))
# Print the plot
legend_by_heightweight_plot
It seems that legendary Pokémon are generally heavier and taller, but with many exceptions. For example, Onix (Gen 1), Steelix (Gen 2) and Wailord (Gen 3) are all extremely tall, but none of them have legendary status. There must be other factors at play.
We will now look at the effect of a Pokémon’s type on its legendary/non-legendary classification. There are 18 possible types, ranging from the common (Grass / Normal / Water) to the rare (Fairy / Flying / Ice). We will calculate the proportion of legendary Pokémon within each category, and then plot these proportions using a simple bar chart.
# Prepare the data
legend_by_type <- pokedex %>%
group_by(type) %>%
mutate(is_legendary = as.numeric(is_legendary) - 1) %>%
summarise(prop_legendary = mean(is_legendary)) %>%
ungroup() %>%
mutate(type = fct_reorder(type, prop_legendary))
# Prepare the plot
legend_by_type_plot <- legend_by_type %>%
ggplot(aes(x = type, y = prop_legendary, fill = prop_legendary)) +
geom_col() +
labs(title = "Legendary Pokemon by type") +
coord_flip() +
guides(fill = FALSE)
# Print the plot
legend_by_type_plot
There are clear differences between Pokémon types in their relation to legendary status. While more than 30% of flying and psychic Pokémon are legendary, there is no such thing as a legendary poison or fighting Pokémon!
Before fitting the model, we will consider the influence of a Pokémon’s fighter stats (attack, defense, etc.) on its status. Rather than considering each stat in isolation, we will produce a boxplot for all of them simultaneously using the facet_wrap() function.
# Prepare the data
legend_by_stats <- pokedex %>%
select(is_legendary, attack, sp_attack, defense, sp_defense, hp, speed) %>%
gather(key = fght_stats, value = value, -is_legendary)
# Prepare the plot
legend_by_stats_plot <- legend_by_stats %>%
ggplot(aes(x = is_legendary, y = value, fill = is_legendary)) +
geom_boxplot(varwidth = TRUE) +
facet_wrap(~fght_stats) +
labs(title = "Pokemon fight statistics",
x = "Legendary status") + scale_fill_discrete(name="Legendary",
breaks=c("0", "1"),
labels=c("No Legendary", "Legendary"))
# Print the plot
legend_by_stats_plot
As we might expect, legendary Pokémon outshine their ordinary counterparts in all fighter stats. Although we haven’t formally tested a difference in means, the boxplots suggest a significant difference with respect to all six variables. Nonetheless, there are a number of outliers in each case, meaning that some legendary Pokémon are anomalously weak.
We have now explored all of the predictor variables we will use to explain what makes a Pokémon legendary. Before fitting our model, we will split the pokedex into a training set (pokedex_train) and a test set (pokedex_test). This will allow us to test the model on unseen data.
# Set seed for reproducibility
set.seed(1234)
# Save number of rows in dataset
n = nrow(pokedex)
# Generate 60% sample of rows
sample_rows <- sample(n, n*0.6)
# Create training set
pokedex_train <- pokedex %>%
filter(row_number() %in% sample_rows)
# Create test set
pokedex_test <- pokedex %>%
filter(!row_number() %in% sample_rows)
Now we have our training and test sets, we can go about building our classifier. But before we fit a random forest, we will fit a simple classification decision tree. This will give us a baseline fit against which to compare the results of the random forest, as well as an informative graphical representation of the model.
Here, and also in the random forest, we will omit incomplete observations by setting the na.action argument to na.omit. This will remove a few Pokémon with missing values for height_m and weight_kg from the training set. Remember the warning messages when we made our height/weight plot in Task 3? These are the Pokémon to blame!
# Load packages and set seed
library(rpart)
library(rpart.plot)
set.seed(1234)
# Fit decision tree
model_tree <- rpart(is_legendary ~ attack + defense + height_m +
hp + sp_attack + sp_defense + speed + type + weight_kg,
data = pokedex_train,
method = "class",
na.action = na.omit)
# Plot decision tree
rpart.plot(model_tree)
Each node of the tree shows the predicted class, the probability of being legendary, and the percentage of Pokémon in that node. The bottom-left node, for example – for those with sp_attack < 118 and weight_kg < 169 – represents 84% of Pokémon in the training set, predicting that each only has a 3% chance of being legendary.
Decision trees place the most important variables at the top and exclude any they don’t find to be useful. In this case, sp_attack occupies node 1 while attack, defense, sp_defense and height_m are all excluded.
However, decision trees are unstable and sensitive to small variations in the data. It therefore makes sense to fit a random forest – an ensemble method that averages over several decision trees all at once. This should give us a more robust model that classifies Pokémon with greater accuracy.
# Load package and set seed
library(randomForest)
set.seed(1234)
# Fit random forest
model_forest <- randomForest(is_legendary ~ attack + defense + height_m +
hp + sp_attack + sp_defense + speed + type + weight_kg,
data = pokedex_train,
importance = TRUE,
na.action = na.omit)
# Print model output
print(model_forest)
##
## Call:
## randomForest(formula = is_legendary ~ attack + defense + height_m + hp + sp_attack + sp_defense + speed + type + weight_kg, data = pokedex_train, importance = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 6.25%
## Confusion matrix:
## 0 1 class.error
## 0 429 5 0.01152074
## 1 25 21 0.54347826
Looking at the model output, we can see that the random forest has an out-of-bag (OOB) error of 7.48%, which isn’t bad by most accounts. However, since there are 24 true positives and 24 false negatives, the model only has a recall of 50%, which means that it struggles to successfully retrieve every legendary Pokémon in the dataset.
In order to allow direct comparison with the decision tree, we will plot the ROC curves for both models using the ROCR package, which will visualize their true positive rate (TPR) and false positive rate (FPR) respectively. The closer the curve is to the top left of the plot, the higher the area under the curve (AUC) and the better the model.
# Load the ROCR package
library(ROCR)
# Create prediction and performance objects for the decision tree
probs_tree <- predict(model_tree, pokedex_test, type = "prob")
pred_tree <- prediction(probs_tree[,2], pokedex_test$is_legendary)
perf_tree <- performance(pred_tree, "tpr", "fpr")
# Create prediction and performance objects for the random forest
probs_forest <- predict(model_forest, pokedex_test, type = "prob")
pred_forest <- prediction(probs_forest[,2], pokedex_test$is_legendary)
perf_forest <- performance(pred_forest, "tpr", "fpr")
# Plot the ROC curves: first for the decision tree, then for the random forest
plot(perf_tree, col = "red", main = "ROC curves")
plot(perf_forest, add = TRUE, col = "blue")
legend(x = "bottomright", legend = c("Decision Tree", "Random Forest"), fill = c("red", "blue"))
It’s clear from the ROC curves that the random forest is a substantially better model, boasting an AUC (not calculated above) of 91% versus the decision tree’s 78%. When calculating variable importance, it makes sense to do so with the best model available, so we’ll use the random forest for the final part of our analysis.
Note that a random forest returns two measures of variable importance:
MeanDecreaseAccuracy – how much the model accuracy suffers if you leave out a particular variable MeanDecreaseGini – the degree to which a variable improves the probability of an observation being classified one way or another (i.e. ‘node purity’). Together, these two measures will allow us to answer our original research question – what makes a Pokémon legendary?
# Print variable importance measures
importance_forest <- importance(model_forest)
importance_forest
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## attack 5.814993 10.153144 10.735059 5.433762
## defense 3.734499 7.665729 7.225728 4.567162
## height_m 5.681012 12.729027 11.844716 11.441686
## hp 4.679572 17.386117 14.533107 8.004268
## sp_attack 4.042541 21.039516 14.604964 12.493568
## sp_defense 5.780318 12.127445 11.308429 9.356214
## speed 4.883203 21.136570 15.996784 10.573576
## type 3.957582 4.360565 5.628097 9.911056
## weight_kg 9.807129 8.595080 12.623416 10.193048
# Create a dotchart of variable importance
varImpPlot_forest <- varImpPlot(model_forest)
varImpPlot_forest
## MeanDecreaseAccuracy MeanDecreaseGini
## attack 10.735059 5.433762
## defense 7.225728 4.567162
## height_m 11.844716 11.441686
## hp 14.533107 8.004268
## sp_attack 14.604964 12.493568
## sp_defense 11.308429 9.356214
## speed 15.996784 10.573576
## type 5.628097 9.911056
## weight_kg 12.623416 10.193048
According to the variable importance plot, sp_attack is the most important factor in determining whether or not a Pokémon is legendary, followed by speed. The plot doesn’t tell us whether the variables have a positive or a negative effect, but we know from our exploratory analysis that the relationship is generally positive. We therefore conclude that legendary Pokémon are characterized primarily by the power of their special attacks and secondarily by their speediness, while also exhibiting higher fighting abilities across the board.
Congratulations on completing your research into legendary Pokémon – Professor Oak is excited to share the findings! To finish, we’ll answer a few of his questions about the variable importance results.
# According to MeanDecreaseAccuracy:
# Q1. Is attack or defense more important?
answer1 <- "attack"
# Q2. Is weight_kg or height_m more important?
answer2 <- "weight_kg"
# According to MeanDecreaseGini:
# Q3. Is attack or defense more important?
answer3 <- "defense"
# Q4. Is weight_kg or height_m more important?
answer4 <- "weight_kg"