This little project is one that I borrowed from Data Camp where they’ve challenged users to build a random forest model that can accurately predict whether a pokemon is legendary or not. A decision tree is a machine learning algorithm that that takes in a set of inputs (in this case the Pokemon and their attributes) and results in a tree-like model of decisions and possible outcomes to predict a result (in this case, whether the Pokemon is legendary or not), and a random forest combines hundreds or even thousands of trees for an even more accurate prediction. The traits and characteristics of pokemon, legendary and non-legendary alike, have changed much over the years and I was rather interested to see if either of these models could accurately predict legendary status.

The data used in this particular project comes from a Kaggle Data set called The Complete Pokemon Dataset. This data set contains the first 801 Pokemon beginning with Bulbasaur ending with the legendary Steel/Fairy type Pokemon Magearna. Each row in the data set is a different Pokemon and the data set contains a total of 41 variables, with the “is_legendary”" variable being the outcome we’re trying to predict. Many of the columns in this data set correspond to the individual Pokemon’s weaknesses and resistances against different types. For this study, I’ll be creating a data frame that neglects these variables as well as a few others. The variable “is_legendary”" is read in as an integer (1 if the Pokemon is legendary and 0 if it is not), so in order to make things easier to read I recoded 1 as “Legendary”, and 0 as “Non-Legendary”.

pokemon_csv <- read_csv("D:/Data Analytics/my data setz/pokemon.csv")


pkmn_df <- pokemon_csv[c('name', 'pokedex_number', 'type1', 'type2', 'hp', 'attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'weight_kg',
                         'height_m', 'is_legendary')]

class(pkmn_df$is_legendary)
## [1] "integer"
pkmn_df$is_legendary <- as.factor(pkmn_df$is_legendary)
levels(pkmn_df$is_legendary) <- c("Non-Legendary", "Legendary")
class(pkmn_df$is_legendary)
## [1] "factor"
legendaries <- subset(pkmn_df, is_legendary == "Legendary")

str(pkmn_df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    801 obs. of  13 variables:
##  $ name          : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
##  $ pokedex_number: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ type1         : chr  "grass" "grass" "grass" "fire" ...
##  $ type2         : chr  "poison" "poison" "poison" NA ...
##  $ hp            : int  45 60 80 39 58 78 44 59 79 45 ...
##  $ attack        : int  49 62 100 52 64 104 48 63 103 30 ...
##  $ defense       : int  49 63 123 43 58 78 65 80 120 35 ...
##  $ sp_attack     : int  65 80 122 60 80 159 50 65 135 20 ...
##  $ sp_defense    : int  65 80 120 50 65 115 64 80 115 20 ...
##  $ speed         : int  45 60 80 65 80 100 43 58 78 45 ...
##  $ weight_kg     : num  6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
##  $ height_m      : num  0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
##  $ is_legendary  : Factor w/ 2 levels "Non-Legendary",..: 1 1 1 1 1 1 1 1 1 1 ...

What kind of Pokemon are you?

To get an idea of what I was dealing with I created some initial statistics and visualizations. In this data set there are 731 non-legendary and 70 legendary Pokemon. When we plot the legendary Pokemon by their primary type (Type1), we see that the two most predominant types are Psychic (24%) and Dragon (10%).

## 
## Non-Legendary     Legendary 
##           731            70

As the name implies, legendary Pokemon tend to be in a class of their own, outperforming non-legendary Pokemon with their superior speed and power. My gamer and analyst instincts made me wonder if there was any correlation among the six stats; additionally, if I could identify any patterns among the legendaries in relation to their stats. Using the ggpairs function from the ggally package, I was able to create the below scatter plot matrix that reveals the various correlations among the six stats; with legendaries colored in blue and non-legendaries colored in red. I was a little disappointed that none of the stats showed any particularly strong correlations. I was hoping Defense and Speed or Defense and Special Attack would be inversely correlated, however Special Defense and Defense had the highest correlation (.53). The below scatter plots do prove, however, that legendary Pokemon, on average, have higher stats than non-legendaries as they typically tend to gravitate towards the upper right portion of the scatter plots. What’s even more interesting is the way the legendaries seem to cluster together on certain plots (specifically attack vs special attack, defense vs speed, and pretty much all the combinations of HP). It would be interesting to see if clustering methods like K-Means or SVM produce good models.

I decided to take a closer look at the combinations of stats that had the highest correlations. When comparing defense and special defense (Corr: .526), many non-legendaries outclass the legendaries in terms of defense. It appears that legendary Pokemon may be somewhat weak against physical attacks. Although, still above average, legendaries seem to have lower defense compared to some of their other stats.

Special Attack and Special Defense had a correlation coefficient of 0.511. Many legendary Pokemon boast high special attack with Mewtwo being the most powerful special attacker, and Kyogre, the legendary sea Pokemon, having the most balanced combination of Special Attack and Special Defense.


You may have noticed the name Shuckle appear as an outlier in the previous two plots. Shuckle is a non-legendary Bug/Rock type Pokemon that has both the highest Defense and Special Defense in the game; its other stats are meager however, which could possibly be why Shuckle is considered low-tier in competitive play.


I also wanted to take a look at the legendary’s physical characteristics (weight and height) to see if I could find any relationships there as well. Most legendary Pokemon were under 450 kg, with spikes at 50kg(110.23 lbs) and 200kg(440.96 lbs) respectively. Most Pokemon were under 5 meters in height, and they seemed to be largely concentrated around 2.5 meters (8.2 ft) or less. In the scatter plot below, you can clearly see the concentration of legendary Pokemon in these regions. It would appear that many legendary Pokemon tend to be short in stature and light-weight.


Decision Tree

Finally, I decided to try my hand at some actual modeling! I created the training set by randomly selecting two-thirds of the observations (534) from the entire data frame, and the test set was comprised of the one-third that was left over (267). I then used the tree() function from the tree package to create the decision tree. The purpose of this project was to create a random forest, but I wanted to start out small and see how well a simple decision tree would perform. The resulting tree diagram would also give me an indication of what variables were most important in predicting the outcome. Using only the six stats in addition to weight_kg and height_m as predictors, the decision tree achieved a misclassification rate of 3.63% misclassifying only 19 out of the 523 training set observations. (By default the tree() function removes any observations with missing values. Eleven Pokemon had a height and/or weight of NA and were thus excluded from the model.)

set.seed(2)
train <- sample(1:nrow(pkmn_df), (2/3)*nrow(pkmn_df))
test.data <- pkmn_df[-train,]
legendary.test <- pkmn_df$is_legendary[-train]

tree.pkmn <- tree(is_legendary ~ hp + attack + defense + sp_attack + sp_defense + speed + weight_kg + height_m, pkmn_df, subset = train)
summary(tree.pkmn) #misclass of training set is 0.03
## 
## Classification tree:
## tree(formula = is_legendary ~ hp + attack + defense + sp_attack + 
##     sp_defense + speed + weight_kg + height_m, data = pkmn_df, 
##     subset = train)
## Variables actually used in tree construction:
## [1] "weight_kg" "sp_attack" "attack"    "speed"     "defense"   "hp"       
## Number of terminal nodes:  15 
## Residual mean deviance:  0.1332 = 67.68 / 508 
## Misclassification error rate: 0.03633 = 19 / 523

When the decision tree was applied to the test set it achieved an 89% success rate, misclassifying only 29 of the 267 Pokemon in this set.

tree.pred <- predict(tree.pkmn, test.data, type = "class")
table(tree.pred, legendary.test)
##                legendary.test
## tree.pred       Non-Legendary Legendary
##   Non-Legendary           227         7
##   Legendary                22        11
(227+11)/267  #89% success rate
## [1] 0.8913858

The initial tree produced the below dendogram. Weight seems to be the most significant predictor followed by special attack and speed.

These fully grown trees tend to be overly complex and can lead to poor test set performance, so I used the cv.tree() function to “prune” the tree (this results in a smaller tree, but it will likely perform better on the test set). Looking at the plot below, we can see that the tree achieves the minimum error rate at 7 splits.

I then applied this optimized tree to the test data set and it indeed performed better that the fully matured decision tree. This model achieved a 90% success rate, only misclassifying 27 of the observations in the test set. Natu bad!

prune.tree <- prune.misclass(tree.pkmn, best = 7)
prune.tree.pred <- predict(prune.tree, test.data, type = "class")
table(prune.tree.pred, legendary.test)
##                legendary.test
## prune.tree.pred Non-Legendary Legendary
##   Non-Legendary           230         8
##   Legendary                19        10
240/267   #90% success rate with pruned tree
## [1] 0.8988764

Looking at the dendogram for the pruned tree, we see that again, weight, special attack, and speed are the three strongest predictors of legendary status. I find the left side of the tree very interesting; all of the nodes (leaves/ends) are Non-Legendary with the exception of one. It’s basically saying, “If you’re under 171.5 kg, and have high special attack, speed, and defense, then you’re probably a legendary”. The right side of the tree is a little more straight forward, but interesting nonetheless. If you’re a heavier Pokemon and you have high speed (the legendary dogs Suicune, Raikou, and Entei come to mind), then you’re a legendary. Similarly, if you’re a heavier Pokemon and also have low speed and low attack (looking at you Regirock, Regice, and Registeel) then you’re also probably a legendary.


Decision trees are cool because while they aren’t the most powerful predictive model, they’re easy to understand as the output produces this little diagram called a dendogram that, much like our friend Sudowoodo here, mimics a tree. Sudowoodo (classified as the imitation Pokemon) is a humanoid tree-like Pokemon originally from the Johto region. Its’ name is actually a play on words; Pseudo (prefix meaning false) and wood. :)


The Random Forest

As mentioned before, a random forest creates hundreds or even thousands of decision trees and averages the results together to predict the outcomes. It would be Farfetch’d to say that that random forests are less accurate than decision trees.

set.seed(2)
rf.pkmn <- randomForest(is_legendary ~ hp + attack + defense + sp_attack + sp_defense + speed + weight_kg + height_m, 
                         data = pkmn_df, subset = train, importance = TRUE, na.action = na.omit, ntree = 500)

When applying the random forest model to the test data set, the model achieved a 92.5% accuracy rate, misclassifying only twenty of the observations in the test set! The random forest I stuck with only used 500 trees. I tried using 1,000, 1,500, and 2,000 trees, however, none of those models significantly reduced the error rate.

rf.pkmn  #oob error 7.47
## 
## Call:
##  randomForest(formula = is_legendary ~ hp + attack + defense +      sp_attack + sp_defense + speed + weight_kg + height_m, data = pkmn_df,      importance = TRUE, ntree = 500, subset = train, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 6.69%
## Confusion matrix:
##               Non-Legendary Legendary class.error
## Non-Legendary           467         5  0.01059322
## Legendary                30        21  0.58823529
rf.pred <- predict(rf.pkmn, newdata = test.data)
table(rf.pred, legendary.test)
##                legendary.test
## rf.pred         Non-Legendary Legendary
##   Non-Legendary           237        10
##   Legendary                 3         8
247/267 
## [1] 0.9250936

While random forests are more powerful than their smaller counterparts, things can obviously get all tangela and clef-hairy when averaging together hundreds of trees, and the down side is that we lose the dendogram. However, since the random forest collected information from so many trees, it can still identify which variables are the most important predictors. Below, the plot on the left indicates that speed, weight, and HP were the best predictors as excluding them from the tree would lead to a greater decrease in accuracy. Similarly, the plot on the right shows which variables had the largest impact on the gini index, which is a metric that measures node purity. Just because a node displays “Legendary” doesn’t mean that every observation from the hundreds of trees that produced it are a Legendary, but most of them were. Accordingly, the higher the gini index, the more legendaries there were in that node. So, the plot on the right is saying that weight, speed, height, special attack, and defense produced the purest nodes. This exactly corresponds with the logic we saw earlier on the left side of the decision tree!

varImpPlot(rf.pkmn)

So what are the stuff of legends made of?

Based on the results of the optimized decision tree and random forest, a Pokemon’s physical characteristics (specifically weight) seem to be the biggest indicators of legendary status. Both the decision tree and gini index plot showed that weight was the most important predictor, and from here I think we can divide the legendaries into three distinct groups based on their characteristics.

Legendary Titans

These legendary Pokemon are heavy, slow, and have low attack power. They don’t seem to have any stats that stand out, however their defense and special defense tend to be higher than average when compared to the other legendary Pokemon which does further emphasize their bulkiness. Not surprisingly, the Regi family, Heatran, and Celesteela fall in to this category.

Fairy Legendary Pokemon

If you are low-weight, fast, and have high special attack and defense then you’re probably a legendary. This makes intuitive sense because if you’re small and have unnaturally high stats then you’re certainly a force to be reckoned with. I mean, Celebi and Poliwag are both 0.6 meters tall, but the former can travel through time while the latter is just kind of a tadpole. Pokemon like Mew, Celebi, Jiriachi, Vicitini, and Manaphy fall in to this category.

Continent Breakers and World Enders

These behemoths are heavy, fast, and boast high attack and special attack power. Not only are they offensive power houses, but most of them have considerable defensive capabilities as well. This echelon of creatures are truly well-rounded and prove why legendary Pokemon are in a class above the rest. Thinking back to the correlation plot from earlier, speed was most highly correlated with special attack, which in turn was highly correlated with attack. Additionally, defense and special defense also had the highest correlation out of all the possible pairs of stats. Given these facts, as well as their inherent similarities, many of these Pokemon came in their corresponding pairs or trios - Suicune/Raikou/Entei, Lugia/Ho-oh, Kyogre/Groudon/Rayquaza, Dialga/Palkia (I could keep going but you get the idea). Pokemon lore states that clashes between some of these Pokemon shaped the continents while others control time and space.

The Misclassifieds

I also wanted to look at the Pokemon that were misclassified by the optimized decision tree. It seems that many of the incorrect predictions were non-legendary Pokemon that the tree classified as legendary. Pokemon like Houndoom, Kingdra, Rotom, and Togekiss are light-weight, fast, and have high special attack. These little guys pack a decent punch and could easily be mistaken for being in the Fairy Legendary class. Onix, Lapras, Metang, Bronzong, and Wailord have heavy builds and were likely mistaken for legendary titans. However, the question that irked me the most is why was MEWTWO misclassified?! Turns out Mewtwo had low weight, absurdly high special attack, ridiculous speed, but his defense is lack luster; looks like we found the chink in your armor buddy. What if Ash had actually gotten a punch in on this guy?

##         name  hp attack defense sp_attack sp_defense speed weight_kg
## 1   Venusaur  80    100     123       122        120    80     100.0
## 2       Onix  35     45     160        30         45    70     210.0
## 3     Seadra  55     65      95        95         45    85      25.0
## 4     Lapras 130     85      80        85         95    60     220.0
## 5     Zapdos  90     90      85       125         90   100      52.6
## 6     Mewtwo 106    150      70       194        120   140     122.0
## 7   Houndoom  75     90      90       140         90   115      35.0
## 8    Kingdra  75     95      95        95         95    85     152.0
## 9    Slaking 150    160     100        95         65   100     130.5
## 10   Wailord 170     90      45        90         45    60     398.0
## 11 Salamence  95    145     130       120         90   120     102.6
## 12    Metang  60     75     100        55         80    50     202.5
## 13  Bronzong  67     89     116        79        116    33     187.0
## 14  Garchomp 108    170     115       120         95    92      95.0
## 15  Togekiss  85     50      95       120        115    80      38.0
## 16 Probopass  60     55     145        75        150    40     340.0
## 17     Rotom  50     65     107       105        107    86       0.3
## 18     Azelf  75    125      70       125         70   115       0.3
## 19 Cresselia 120     70     120        75        130    85      85.6
## 20   Aurorus 123     77      72        99         92    58     225.0
## 21 Volcanion  80    110     120       130         90    70     195.0
## 22 Palossand  85     75     110       100         75    35     250.0
## 23    Drampa  78     60      85       135         91    36     185.0
## 24   Kommo-o  75    110     125       100        105    85      78.2
## 25  Nihilego 109     53      47       127        131   103      55.5
## 26  Necrozma  97    107     101       127         89    79     230.0
## 27  Magearna  80     95     115       130        115    65      80.5
##    height_m  is_legendary prune.tree.pred
## 1       2.0 Non-Legendary       Legendary
## 2       8.8 Non-Legendary       Legendary
## 3       1.2 Non-Legendary       Legendary
## 4       2.5 Non-Legendary       Legendary
## 5       1.6     Legendary   Non-Legendary
## 6       2.0     Legendary   Non-Legendary
## 7       1.4 Non-Legendary       Legendary
## 8       1.8 Non-Legendary       Legendary
## 9       2.0 Non-Legendary       Legendary
## 10     14.5 Non-Legendary       Legendary
## 11      1.5 Non-Legendary       Legendary
## 12      1.2 Non-Legendary       Legendary
## 13      1.3 Non-Legendary       Legendary
## 14      1.9 Non-Legendary       Legendary
## 15      1.5 Non-Legendary       Legendary
## 16      1.4 Non-Legendary       Legendary
## 17      0.3 Non-Legendary       Legendary
## 18      0.3     Legendary   Non-Legendary
## 19      1.5     Legendary   Non-Legendary
## 20      2.7 Non-Legendary       Legendary
## 21      1.7     Legendary   Non-Legendary
## 22      1.3 Non-Legendary       Legendary
## 23      3.0 Non-Legendary       Legendary
## 24      1.6 Non-Legendary       Legendary
## 25      1.2     Legendary   Non-Legendary
## 26      2.4     Legendary   Non-Legendary
## 27      1.0     Legendary   Non-Legendary

Well, that wraps up my analysis on legendary Pokemon. It was a fun one, and I’m excited to see where things go from here. This won’t be the last time I do an analytics project on Pokemon. The newest games, Pokemon Sword and Shield, are set to release in November of 2019. While there isn’t an official count on the number of new Pokemon, the developers have already stated that with this new iteration there will now be over one thousand Pokemon including two new legendary Pokemon, Zacian and Zamazenta. It’ll be interesting to see how the random forest model performs on these new additions.

Please check out my Tableau Story that outlines this project as well.