This article analyses all 801 Pokemon from 7 generations and answers the following questions:

  1. How does height and weight of a Pokemon correlate with its various base stats?
  2. Which type is the strongest overall? Which is the weakest?
  3. Which type is the most likely to be a legendary Pokemon?
  4. Can you build a Pokemon dream team? A team of 6 Pokemon that inflicts the most damage while remaining relatively impervious to any other team of 6 Pokemon.
  5. Is it possible to build a classifier to identify legendary Pokemon?

Dataset taken from Kaggle (https://www.kaggle.com/rounakbanik/pokemon)

1. How does height and weight of a Pokemon correlate with its various base stats?

Let us first read in the dataset and explore the data a little.

library(tidyverse)
pkmn <- read_csv("D:/wiley/pokemon.csv")
names(pkmn)
##  [1] "abilities"         "against_bug"       "against_dark"     
##  [4] "against_dragon"    "against_electric"  "against_fairy"    
##  [7] "against_fight"     "against_fire"      "against_flying"   
## [10] "against_ghost"     "against_grass"     "against_ground"   
## [13] "against_ice"       "against_normal"    "against_poison"   
## [16] "against_psychic"   "against_rock"      "against_steel"    
## [19] "against_water"     "attack"            "base_egg_steps"   
## [22] "base_happiness"    "base_total"        "capture_rate"     
## [25] "classfication"     "defense"           "experience_growth"
## [28] "height_m"          "hp"                "japanese_name"    
## [31] "name"              "percentage_male"   "pokedex_number"   
## [34] "sp_attack"         "sp_defense"        "speed"            
## [37] "type1"             "type2"             "weight_kg"        
## [40] "generation"        "is_legendary"

We can see that the dataset contains all 801 pokemon and 41 columns containing attributes for each pokemon. There are also some missing values.

sapply(pkmn, function(x) sum(is.na(x)))
##         abilities       against_bug      against_dark    against_dragon 
##                 0                 0                 0                 0 
##  against_electric     against_fairy     against_fight      against_fire 
##                 0                 0                 0                 0 
##    against_flying     against_ghost     against_grass    against_ground 
##                 0                 0                 0                 0 
##       against_ice    against_normal    against_poison   against_psychic 
##                 0                 0                 0                 0 
##      against_rock     against_steel     against_water            attack 
##                 0                 0                 0                 0 
##    base_egg_steps    base_happiness        base_total      capture_rate 
##                 0                 0                 0                 0 
##     classfication           defense experience_growth          height_m 
##                 0                 0                 0                20 
##                hp     japanese_name              name   percentage_male 
##                 0                 0                 0                98 
##    pokedex_number         sp_attack        sp_defense             speed 
##                 0                 0                 0                 0 
##             type1             type2         weight_kg        generation 
##                 0               384                20                 0 
##      is_legendary 
##                 0
pkmn[pkmn$name=="Rattata",]$height_m = 0.3
pkmn[pkmn$name=="Rattata",]$weight_kg = 3.5
pkmn[20, "height_m"] = 0.7
pkmn[20, "weight_kg"] = 18.5
pkmn[26, "height_m"] = 0.8
pkmn[26, "weight_kg"] = 30
pkmn[27, "height_m"] = 0.6
pkmn[27, "weight_kg"] = 12
pkmn[28, "height_m"] = 1
pkmn[28, "weight_kg"] = 29.5
pkmn[37, "height_m"] = 0.6
pkmn[37, "weight_kg"] = 9.9
pkmn[38, "height_m"] = 1.1
pkmn[38, "weight_kg"] = 19.9
pkmn[50, "height_m"] = 0.2
pkmn[50, "weight_kg"] = 0.8
pkmn[51, "height_m"] = 0.7
pkmn[51, "weight_kg"] = 33.3
pkmn[52, "height_m"] = 0.4
pkmn[52, "weight_kg"] = 4.2
pkmn[53, "height_m"] = 1
pkmn[53, "weight_kg"] = 32
pkmn[74, "height_m"] = 0.4
pkmn[74, "weight_kg"] = 20
pkmn[75, "height_m"] = 1
pkmn[75, "weight_kg"] = 105
pkmn[76, "height_m"] = 1.4
pkmn[76, "weight_kg"] = 300
pkmn[88, "height_m"] = 0.9
pkmn[88, "weight_kg"] = 30
pkmn[89, "height_m"] = 1.2
pkmn[89, "weight_kg"] = 30
pkmn[103, "height_m"] = 2
pkmn[103, "weight_kg"] = 120 
pkmn[105, "height_m"] = 1
pkmn[105, "weight_kg"] = 45
pkmn[720, "height_m"] = 0.5
pkmn[720, "weight_kg"] = 9
pkmn[745, "height_m"] = 0.8
pkmn[745, "weight_kg"] = 25
pkmn[774, "capture_rate"] = "30"

There are missing values in the height_m, weight_kg, percentage_male, and type2 columns. For percentage_male, the dataset description tells us that these pokemon do not have a gender, hence there is no need to fix this column. We also know that some pokemon only have 1 primary type, hence these would have NA as their second type. It is interesting to note that about half of all pokemon are single types. As for the missing heights and weights, their data can be easily found from the source of data, serebii.net. We will go ahead and manually search and impute these missing values.

Let us now investigate the relationships between height, weight and stats of the pokemon. We need the library “GGally” to help us do a pair plot. This may take several seconds to run.

library(GGally)
ggpairs(data=pkmn, columns = c("height_m", "weight_kg", "hp", "attack", "defense", "sp_attack", "sp_defense", "speed", "base_total"))

From the pair plots we observe that there isn’t a strong correlation between any stat in particular. However height and weight seems to be correlated, and this make sense as larger pokemon tend to be heavier/taller. All stats have a significant correlation with base_total, this is nothing surprising as base_total is simply the sum of all six stats. Interestingly, the distribution of base_total shows two peaks. Let’s investigate this further.

ggpairs(data=pkmn, columns = c("height_m", "weight_kg", "base_total"), mapping = aes(color=as.factor(is_legendary)))

As this analysis focuses largely on legendary pokemon, this plot shows us that the heights and weights cannot really differentiate whether a pokemon is legendary, but the base_total stats show a clear distinction - legendary pokemon have a much higher base_total than non-legendary ones. This will be useful in building our classifier later on.

2. Which type is the strongest overall? Which is the weakest?

In the world of pokemon there isn’t a standard definition of strong pokemon but there are some measures that give us a clue as to whether the game considers a pokemon strong:

pkmn %>% 
  group_by(type1) %>%
  summarise(
    avg_hp = mean(hp),
    avg_att = mean(attack),
    avg_def = mean(defense),
    avg_satt = mean(sp_attack),
    avg_sdef = mean(sp_defense),
    avg_spd = mean(speed)
  ) %>%
  gather(`avg_hp`,`avg_att`,`avg_def`,`avg_satt`,`avg_sdef`,`avg_spd`, key="attribute", value="value") %>%
  ggplot() +
      geom_col(aes(x=type1, fill=attribute, y=value), position='dodge', width=0.5)

pkmn %>% 
  group_by(type1) %>%
  summarise(
    avg_base = mean(base_total),
  ) %>%
    ggplot() +
      geom_col(aes(x=type1, y=avg_base), width=0.7)

The first plot compares the 6 different attributes amongst the primary types. We see that Dragon types have the highest offensive power, Flying types have the fastest speed while Steel types have extremely high defense. Looking at the second plot and considering all stats, we notice that Dragon and Steel types have the highest average base stats while Bug types have the lowest.

pkmn$capture_rate = as.numeric(pkmn$capture_rate)

pkmn %>% 
  group_by(type1) %>%
  summarise(
    avg_capture = mean(capture_rate)
  ) %>%
  ggplot() +
  geom_col(aes(x=type1, y=avg_capture), width=0.7) 

In this dataset, capture_rate is coded as a character variable and we should first convert it to a numeric variable. From the plot, we see that Dragon and Steel types have the lowest capture rates and this is an indication that they are stronger in the game. On the other hand, Bug and Poison types have the highest capture rates.

A simple way to aggregate resistance for each pokemon is to add up the 18 columns of “against_X”. Since there are a total of 18 different types of attacks, a pokemon which takes neutral damage from all will have a total resistance of 18. Hence, the lower the total resistance, the more defensive a pokemon is, since it will take lesser damage from attacks in general.

pkmn = pkmn %>% 
  mutate(tot_resist = against_bug+against_dark+against_dragon+against_electric+against_fairy+against_fight+against_fire+against_flying+against_ghost+against_grass+against_ground+against_ice+against_normal+against_poison+against_psychic+against_rock+against_steel+against_water)

Analysing resistance is a little more complex as we are unable to aggregate by type, since pokemon typing can synergise with each other to confer different resistances to a pokemon. We can however perform some other analyses to have a rough sense.

boxplot(pkmn$tot_resist)

pkmn %>%
  arrange(min_rank(tot_resist)) %>% 
  dplyr::select(name, tot_resist, type1, type2) %>%
  print(n=20)
## # A tibble: 801 x 4
##    name       tot_resist type1    type2   
##    <chr>           <dbl> <chr>    <chr>   
##  1 Mawile           13.2 steel    fairy   
##  2 Klefki           13.2 steel    fairy   
##  3 Magearna         13.2 steel    fairy   
##  4 Skarmory         13.5 steel    flying  
##  5 Celesteela       13.5 steel    flying  
##  6 Dialga           14.2 steel    dragon  
##  7 Honedge          14.2 steel    ghost   
##  8 Doublade         14.2 steel    ghost   
##  9 Aegislash        14.2 steel    ghost   
## 10 Empoleon         14.5 water    steel   
## 11 Registeel        15   steel    <NA>    
## 12 Klink            15   steel    <NA>    
## 13 Klang            15   steel    <NA>    
## 14 Klinklang        15   steel    <NA>    
## 15 Sableye          15.5 dark     ghost   
## 16 Spiritomb        15.5 ghost    dark    
## 17 Lucario          15.5 fighting steel   
## 18 Cobalion         15.5 steel    fighting
## 19 Forretress       15.8 bug      steel   
## 20 Steelix          15.8 steel    ground  
## # ... with 781 more rows
pkmn %>%
  arrange(min_rank(desc(tot_resist))) %>% 
  dplyr::select(name, tot_resist, type1, type2) %>%
  print(n=20)
## # A tibble: 801 x 4
##    name      tot_resist type1  type2  
##    <chr>          <dbl> <chr>  <chr>  
##  1 Amaura          26   rock   ice    
##  2 Aurorus         26   rock   ice    
##  3 Paras           25   bug    grass  
##  4 Parasect        25   bug    grass  
##  5 Wormadam        25   bug    grass  
##  6 Snover          25   grass  ice    
##  7 Abomasnow       25   grass  ice    
##  8 Sewaddle        25   bug    grass  
##  9 Swadloon        25   bug    grass  
## 10 Leavanny        25   bug    grass  
## 11 Geodude         24.2 rock   ground 
## 12 Graveler        24.2 rock   ground 
## 13 Golem           24.2 rock   ground 
## 14 Onix            24.2 rock   ground 
## 15 Rhyhorn         24.2 ground rock   
## 16 Rhydon          24.2 ground rock   
## 17 Larvitar        24.2 rock   ground 
## 18 Pupitar         24.2 rock   ground 
## 19 Rhyperior       24.2 ground rock   
## 20 Exeggcute       24   grass  psychic
## # ... with 781 more rows

The boxplot shows that most pokemon have total resistances of 18 to 20. Ordering the pokemon by their total resistances reveals that the top 20 pokemon with the highest resistance (lowest tot_resist value) overwhelmingly have Steel as one of their types. On the other hand, pokemon with either Rock, Ground, Bug, Ice or Grass often take increased damage and are thus weaker.

In conclusion, from the three indicators we can tell that Dragon and Steel types are generally stronger while Bug types are weaker.

3. Which type is the most likely to be a legendary Pokemon?

Another way to look at the question will be - which typing has the highest proportion of legendary pokemon? To answer this we will need to consider both the primary and secondary typing of the pokemon.

table_1 = table(pkmn$is_legendary, pkmn$type1)
table_1
##    
##     bug dark dragon electric fairy fighting fire flying ghost grass ground
##   0  69   26     20       34    17       28   47      2    26    74     30
##   1   3    3      7        5     1        0    5      1     1     4      2
##    
##     ice normal poison psychic rock steel water
##   0  21    102     32      36   41    18   108
##   1   2      3      0      17    4     6     6
table_2 = table(pkmn$is_legendary, pkmn$type2)
table_2
##    
##     bug dark dragon electric fairy fighting fire flying ghost grass ground
##   0   5   21     13        8    23       19   11     85    12    18     33
##   1   0    0      4        1     6        6    2     10     2     2      1
##    
##     ice normal poison psychic rock steel water
##   0  14      4     33      25   14    18    16
##   1   1      0      1       4    0     4     1
table_3 = table_1 + table_2
table_3
##    
##     bug dark dragon electric fairy fighting fire flying ghost grass ground
##   0  74   47     33       42    40       47   58     87    38    92     63
##   1   3    3     11        6     7        6    7     11     3     6      3
##    
##     ice normal poison psychic rock steel water
##   0  35    106     65      61   55    36   124
##   1   3      3      1      21    4    10     7

To interpret table_3, we can say that a Bug type pokemon has 3/77 chance to be legendary. Let’s do this visually:

table_3 = as.data.frame(table_1 + table_2)
colnames(table_3) = c("is_legendary", "type", "freq")

table_3 = table_3%>%
  mutate(is_legendary = ifelse(is_legendary==0, "no", "yes")) %>%
  spread(key=is_legendary, value=freq) %>%
  mutate(proportion=yes/no)
table_3
ggplot(table_3) + 
  geom_col(aes(type, proportion))

The diagram makes it clear that Psychic and Dragon types are the mostly likely to be legendary. In other words, if a pokemon has Psychic as one of its types, it has a 34% chance to be legendary.

4. Can you build a Pokemon dream team? A team of 6 Pokemon that inflicts the most damage while remaining relatively impervious to damage.

In the world of Pokemon, having a “dream team” is more than just assembling 6 “star” pokemon. There are many other factors that affect the performance of a team. For example, abilities and moveset can drastically alter a pokemon’s effectiveness. Some pokemon have specific roles in a team (eg. being a damage dealer, a support or a shield) while others synergise together such that their combined effect is greater than their individual prowess.

However, for the sake of analysis, we shall only focus on each pokemon’s individual attributes; we aim to find pokemon that are all-rounders. Looking at the columns we have, we can analyse the base_total as well as the tot_resist columns. Doing a scatter plot yields the results:

ggplot(pkmn) +
  geom_point(aes(base_total, tot_resist, color=is_legendary))

We notice that the points roughly form a circular shape and we should be looking for those which reside around the lower-right corner, where base_total is high and tot_resist is low. Interestingly, we note that while most legendary pokemon have high base_total stats, their resistances toward attack types are largely varied. From our previous typing analysis, we know that Steel types are overwhelmingly resistant against attacks.

pkmn %>%
  filter(base_total>=600, tot_resist<17)%>%
  arrange(desc(base_total))%>%
  dplyr::select(name, base_total, type1, type2, tot_resist)

Going by our lower-right selection criteria, we can visually identify the 6 pokemon: Metagross, Dialga, Solgaleo, Magearna, Lucario, Steelix/Genesect. Once again, we see that all 6 have Steel as one of their types.

This approach of selection ignores team synergy and while these pokemon are individually strong, any experienced pokemon player will immediately point out that such a team will be decimated easily. This is because all of them have a Steel typing and are thus weak to Fire, Ground and Fighting attacks. A good team should ideally consists of pokemon which are able to cover for one another’s weaknesses. Hence, for every possible attack type the team should have at least one pokemon which is able to resist it. We will use a trial and error approach here:

  1. Select a random team of 6 pokemon
  2. Calculate the team’s average base_total (the higher the better)
  3. Check that for each against_type we have at least one pokemon that resists it
  4. Initiate a counter that increases by one for every against_type that the team is unable to resist (we assume that taking normal damage is acceptable)
  5. Sum the counter for the team (the lower the better. A score of 0 indicates that all attacking types are weak against at least one pokemon in the team)
  6. Store the 6 pokemon together with the team’s average base_total and counter in a dataframe
  7. Repeat steps 1 - 6 to find the best team
teams = data.frame(
  p1 = 0,
  p2 = 0,
  p3 = 0,
  p4 = 0,
  p5 = 0,
  p6 = 0,
  resist = 0,
  stat = 0)

set.seed(565)
for (i in 1:10000) {
  team = pkmn %>%
    dplyr::select(33, 2:19, 23) %>%
    sample_n(6)
  
  team_resist = team %>% 
    dplyr::select(2:19) %>%
    t() %>%
    as.data.frame() %>%
    mutate(V7 = ifelse(
      V1 < 1, 0, ifelse(
        V2 < 1, 0, ifelse(
          V3 < 1, 0, ifelse(
            V4 < 1, 0, ifelse(
              V5 < 1, 0, ifelse(
                V6 < 1, 0, ifelse(
                  V1 == 1 & V2 == 1 & V3 == 1 & V4 == 1 & V5 == 1 & V6 == 1, 0, 1
                )
              )
            )
          )
        )
      )
    )
    )
  team_resist = sum(team_resist$V7)
  team_stat = mean(team$base_total)
  
  new = c(team$pokedex_number[1],
    team$pokedex_number[2],
    team$pokedex_number[3],
    team$pokedex_number[4],
    team$pokedex_number[5],
    team$pokedex_number[6],
    team_resist,
    team_stat
  )
  
  teams = rbind(teams, new)
}

plot(teams$resist, teams$stat)

Selecting the top 10 teams by base_total shows us the following results:

teams %>%
  filter(resist == 0, stat > 0) %>%
  arrange(desc(stat)) %>%
  top_n(10)
pkmn %>%
  slice(612, 635, 681, 254, 485, 382) %>%
  dplyr::select(name, type1, type2, base_total, tot_resist, is_legendary)

This team is clearly more all-rounded and is able resist all types of attacks. We have 2 legendary pokemons in the mix and quite a diversity in terms of typing, stats and resistances. It is also important to note that this is a very rudimentary method of selection and ignores many other forms of synergy such as items, abilities, natures and movesets which can vastly affect the effectiveness of a team. The sample space of the search is also extremely small compared to all the combinations out there. We can take a look at one of the top teams in the Masters Division of the 2017 World Pokemon Championships.

pkmn %>%
  slice(373,445,787,186,785,376) %>%
  dplyr::select(name, type1, type2, base_total, tot_resist, is_legendary)

A pretty solid team indeed! While the described analysis will definitely not outperform a professional team, it serves as a good starting point for a novice looking to try out different combinations.

5. Is it possible to build a classifier to identify legendary Pokemon?

We have a whole bunch of features to use but clearly not all of them are useful. We have the following features:

names(pkmn)
##  [1] "abilities"         "against_bug"       "against_dark"     
##  [4] "against_dragon"    "against_electric"  "against_fairy"    
##  [7] "against_fight"     "against_fire"      "against_flying"   
## [10] "against_ghost"     "against_grass"     "against_ground"   
## [13] "against_ice"       "against_normal"    "against_poison"   
## [16] "against_psychic"   "against_rock"      "against_steel"    
## [19] "against_water"     "attack"            "base_egg_steps"   
## [22] "base_happiness"    "base_total"        "capture_rate"     
## [25] "classfication"     "defense"           "experience_growth"
## [28] "height_m"          "hp"                "japanese_name"    
## [31] "name"              "percentage_male"   "pokedex_number"   
## [34] "sp_attack"         "sp_defense"        "speed"            
## [37] "type1"             "type2"             "weight_kg"        
## [40] "generation"        "is_legendary"      "tot_resist"

We can condense the against_type columns into tot_resist and drop the individual stat columns as these will be highly correlated with base_total. As height and weight are also correlated, we will use only one of them. Features like classfication, name and pokedex_number are unique to each pokemon and do not offer extra information. generation is also not useful as every generation has a certain number of legendary pokemon. Pokemon abilities and typing may possibly be a predictor of legendary status but we will first have to manipulate them in order to use it in our model later on.

To make things simple, we will first identify the abilities of legendary pokemon and determine the most common words in the abilities.

library(textclean)
pkmn$abilities = strip(pkmn$abilities, char.keep=NULL, apostrophe.remove = T)

x = pkmn %>%
  filter(is_legendary=="1") %>%
  dplyr::select(abilities)
x = x$abilities
x = str_c(x, collapse=" ")
x = unlist(str_split(x, " "))
sort(table(x), decreasing=T)
## x
##    pressure        body       beast       boost    levitate   telepathy 
##          14           8           7           7           7           7 
##       clear   justified       surge        aura       focus       grace 
##           4           4           4           3           3           3 
##       inner      serene      absorb        cure     defiant       flame 
##           3           3           2           2           2           2 
##       force       metal     natural   prankster regenerator      sturdy 
##           2           2           2           2           2           2 
##    teravolt  turboblaze         air       armor         bad       break 
##           2           2           1           1           1           1 
##       cloak   construct        dark    download      dreams     drizzle 
##           1           1           1           1           1           1 
##     drought    electric       fairy        fire       flash        full 
##           1           1           1           1           1           1 
##      grassy   hydration         ice  intimidate       light        lock 
##           1           1           1           1           1           1 
##    magician       misty  multiscale   multitype       power       prism 
##           1           1           1           1           1           1 
##     psychic        sand      shadow       sheer      shield        slow 
##           1           1           1           1           1           1 
##        snow   soulheart        star       start      static synchronize 
##           1           1           1           1           1           1 
##     unaware     unnerve     victory        volt       water 
##           1           1           1           1           1
x = c("pressure", "body", "beast", "boost", "levitate", "telepathy")

We will then determine, for each pokemon, if its ability has a word that is one of these 6 words.

pkmn = pkmn %>%
  mutate(leg_abil = ifelse(word(abilities,1)%in%x|word(abilities,2)%in%x|word(abilities,3)%in%x|word(abilities,4)%in%x|word(abilities,5)%in%x|word(abilities,6)%in%x|word(abilities,7)%in%x|word(abilities,8)%in%x, "1", "0"))
pkmn$leg_abil[1:100]
##   [1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
##  [18] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
##  [35] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
##  [52] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
##  [69] "0" "0" "0" "1" "1" "0" "0" "0" "1" "1" "0" "0" "0" "0" "0" "0" "0"
##  [86] "1" "1" "0" "0" "0" "0" "1" "1" "1" "0" "0" "0" "0" "0" "0"

For pokemon typing, we have previously established that Psychic and Dragon types have the highest proportion of legendary pokemon. We can create a column called type3 that indicates whether a pokemon has Psychic or Dragon as one of its typings.

pkmn[is.na(pkmn$type2), "type2"] = "None"
pkmn = pkmn %>%
  mutate(type3 = ifelse(type1=="psychic"|type2=="psychic"|type1=="dragon"|type2=="dragon", 1, 0))
pkmn$type3[1:100]
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
##  [71] 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0

We shall now remove drop columns which are not useful and convert the column types appropriately. We can also explore each feature and how it relates to whether a pokemon is legendary. This will give us an idea of whether a feature will be useful.

pkmn2 = pkmn[, c(21,22,23,24,27,39,41:44)]
pkmn2$capture_rate = as.numeric(pkmn2$capture_rate)
pkmn2$is_legendary = as.factor(pkmn2$is_legendary)
pkmn2$leg_abil = as.factor(pkmn2$leg_abil)
pkmn2$type3 = as.factor(pkmn2$type3)
str(pkmn2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    801 obs. of  10 variables:
##  $ base_egg_steps   : num  5120 5120 5120 5120 5120 5120 5120 5120 5120 3840 ...
##  $ base_happiness   : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ base_total       : num  318 405 625 309 405 634 314 405 630 195 ...
##  $ capture_rate     : num  45 45 45 45 45 45 45 45 45 255 ...
##  $ experience_growth: num  1059860 1059860 1059860 1059860 1059860 ...
##  $ weight_kg        : num  6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
##  $ is_legendary     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tot_resist       : num  19.2 19.2 19.2 18 18 ...
##  $ leg_abil         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ type3            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
ggplot(pkmn2) +
  geom_boxplot(aes(is_legendary, base_egg_steps))

ggplot(pkmn2) +
  geom_boxplot(aes(is_legendary, base_happiness))

ggplot(pkmn2) +
  geom_boxplot(aes(is_legendary, base_total))

ggplot(pkmn2) +
  geom_boxplot(aes(is_legendary, capture_rate))

ggplot(pkmn2) +
  geom_boxplot(aes(is_legendary, experience_growth)) 

ggplot(pkmn2) +
  geom_boxplot(aes(is_legendary, weight_kg))

ggplot(pkmn2) +
  geom_boxplot(aes(is_legendary, tot_resist)) 

ggplot(pkmn2) +
  geom_bar(aes(is_legendary, fill=leg_abil)) 

ggplot(pkmn2) +
  geom_bar(aes(is_legendary, fill=type3))

From the boxplots we can see that base_egg_steps, base_total and capture_rate have a distinct separation between legendary and non-legendary pokemon and are likely to be useful in the model. Conversely, weight_kg and tot_resist are likely to be poor predictors as there is no distinction.

The next step before initialising our model is to visualise the correlation between our variables. This is important in identifying multicollinearity in our model.

#For continuous variables
library(heatmaply)
library(grDevices)
heatmaply(abs(cor(pkmn2[, c(1:6,8)])), dendrogram='none', show_grid=T, colors=c("black","yellow"))
chisq.test(pkmn$is_legendary, pkmn$leg_abil)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  pkmn$is_legendary and pkmn$leg_abil
## X-squared = 59.951, df = 1, p-value = 9.724e-15
chisq.test(pkmn$is_legendary, pkmn$type3)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  pkmn$is_legendary and pkmn$type3
## X-squared = 41.674, df = 1, p-value = 1.079e-10
chisq.test(pkmn$type3, pkmn$leg_abil)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  pkmn$type3 and pkmn$leg_abil
## X-squared = 31.301, df = 1, p-value = 2.21e-08

The heatmap shows good results where most of the variables are unrelated, except for capture_rate and base_total. Hence one of these variables will likely be dropped during variable selection. As for the categorical variables, all p-values are extremely small, indicating a strong correlation between them.

We are now ready to build our classifier. Since this is a binary problem, we will use logistic regression as our model. In our dataset, we have only 70 legendary pokemon out of 801, therefore we should use stratified splitting into training and test sets.

set.seed(1234)
library(caret)
index = createDataPartition(pkmn2$is_legendary, p = 0.7, list=F)
train = pkmn2[index,]
test = pkmn2[-index,]

model0 = glm(is_legendary ~ 1, data=train, family=binomial(link='logit'))
model1 = glm(is_legendary ~ ., data=train, family = binomial(link='logit'))
summary(model1)
## 
## Call:
## glm(formula = is_legendary ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.49426  -0.06723  -0.02697  -0.00382   2.55068  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.456e+01  5.907e+00  -2.464 0.013733 *  
## base_egg_steps     3.370e-04  9.069e-05   3.717 0.000202 ***
## base_happiness     3.637e-02  2.715e-02   1.340 0.180256    
## base_total         9.070e-03  4.210e-03   2.154 0.031205 *  
## capture_rate      -2.888e-02  9.033e-03  -3.197 0.001387 ** 
## experience_growth  5.104e-06  2.744e-06   1.860 0.062921 .  
## weight_kg         -3.979e-03  3.040e-03  -1.309 0.190652    
## tot_resist        -2.110e-01  2.189e-01  -0.964 0.335219    
## leg_abil1          2.666e+00  9.461e-01   2.817 0.004840 ** 
## type31             3.087e-01  9.954e-01   0.310 0.756463    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 332.504  on 560  degrees of freedom
## Residual deviance:  52.458  on 551  degrees of freedom
## AIC: 72.458
## 
## Number of Fisher Scoring iterations: 9

We observe that some good predictors are base_egg_steps, base_total, capture_rate and leg_abil1. Next, we perform variable selection using Akaike’s Information Criterion (AIC).

step_model1 = step(model0, list(lower = formula(model0),
                                upper = formula(model1),
                                direction="forward",
                                trace=1)) 
## Start:  AIC=334.5
## is_legendary ~ 1
## 
##                     Df Deviance    AIC
## + base_egg_steps     1   104.07 108.07
## + capture_rate       1   148.09 152.09
## + base_total         1   177.60 181.60
## + experience_growth  1   271.51 275.51
## + leg_abil           1   296.84 300.84
## + base_happiness     1   297.02 301.02
## + weight_kg          1   303.57 307.57
## + type3              1   312.44 316.44
## + tot_resist         1   329.32 333.32
## <none>                   332.50 334.50
## 
## Step:  AIC=108.07
## is_legendary ~ base_egg_steps
## 
##                     Df Deviance    AIC
## + capture_rate       1    83.12  89.12
## + leg_abil           1    84.73  90.73
## + base_total         1    86.24  92.24
## + experience_growth  1    96.36 102.36
## + base_happiness     1   101.15 107.15
## <none>                   104.07 108.07
## + weight_kg          1   102.81 108.81
## + tot_resist         1   103.03 109.03
## + type3              1   104.06 110.06
## - base_egg_steps     1   332.50 334.50
## 
## Step:  AIC=89.12
## is_legendary ~ base_egg_steps + capture_rate
## 
##                     Df Deviance     AIC
## + leg_abil           1   65.359  73.359
## + experience_growth  1   75.372  83.372
## + base_total         1   75.470  83.470
## <none>                   83.123  89.123
## + tot_resist         1   81.446  89.446
## + base_happiness     1   81.553  89.553
## + type3              1   83.051  91.051
## + weight_kg          1   83.123  91.123
## - capture_rate       1  104.073 108.073
## - base_egg_steps     1  148.090 152.090
## 
## Step:  AIC=73.36
## is_legendary ~ base_egg_steps + capture_rate + leg_abil
## 
##                     Df Deviance     AIC
## + base_total         1   60.590  70.590
## + experience_growth  1   61.943  71.943
## + base_happiness     1   62.398  72.398
## <none>                   65.359  73.359
## + type3              1   64.687  74.687
## + tot_resist         1   64.932  74.932
## + weight_kg          1   65.158  75.158
## - leg_abil           1   83.123  89.123
## - capture_rate       1   84.725  90.725
## - base_egg_steps     1  142.967 148.967
## 
## Step:  AIC=70.59
## is_legendary ~ base_egg_steps + capture_rate + leg_abil + base_total
## 
##                     Df Deviance     AIC
## + experience_growth  1   57.467  69.467
## + base_happiness     1   57.982  69.982
## <none>                   60.590  70.590
## + weight_kg          1   58.935  70.935
## + tot_resist         1   60.173  72.173
## + type3              1   60.407  72.407
## - base_total         1   65.359  73.359
## - capture_rate       1   71.171  79.171
## - leg_abil           1   75.470  83.470
## - base_egg_steps     1  119.603 127.603
## 
## Step:  AIC=69.47
## is_legendary ~ base_egg_steps + capture_rate + leg_abil + base_total + 
##     experience_growth
## 
##                     Df Deviance     AIC
## + base_happiness     1   54.440  68.440
## <none>                   57.467  69.467
## + weight_kg          1   55.627  69.627
## - experience_growth  1   60.590  70.590
## + tot_resist         1   57.121  71.121
## + type3              1   57.313  71.313
## - base_total         1   61.943  71.943
## - capture_rate       1   67.646  77.646
## - leg_abil           1   69.086  79.086
## - base_egg_steps     1  109.570 119.570
## 
## Step:  AIC=68.44
## is_legendary ~ base_egg_steps + capture_rate + leg_abil + base_total + 
##     experience_growth + base_happiness
## 
##                     Df Deviance     AIC
## <none>                   54.440  68.440
## + weight_kg          1   53.437  69.437
## - base_happiness     1   57.467  69.467
## - experience_growth  1   57.982  69.982
## + tot_resist         1   53.999  69.999
## - base_total         1   58.293  70.293
## + type3              1   54.436  70.436
## - capture_rate       1   64.193  76.193
## - leg_abil           1   65.880  77.880
## - base_egg_steps     1  108.722 120.722
step_model2 = step(model1, list(lower = formula(model0),
                                upper = formula(model1),
                                direction="backward",
                                trace=0))
## Start:  AIC=72.46
## is_legendary ~ base_egg_steps + base_happiness + base_total + 
##     capture_rate + experience_growth + weight_kg + tot_resist + 
##     leg_abil + type3
## 
##                     Df Deviance     AIC
## - type3              1   52.553  70.553
## - tot_resist         1   53.431  71.431
## - weight_kg          1   53.994  71.994
## <none>                   52.458  72.458
## - base_happiness     1   54.725  72.725
## - experience_growth  1   56.113  74.113
## - base_total         1   57.697  75.697
## - leg_abil           1   62.505  80.505
## - capture_rate       1   63.273  81.273
## - base_egg_steps     1  106.984 124.984
## 
## Step:  AIC=70.55
## is_legendary ~ base_egg_steps + base_happiness + base_total + 
##     capture_rate + experience_growth + weight_kg + tot_resist + 
##     leg_abil
## 
##                     Df Deviance     AIC
## - tot_resist         1   53.437  69.437
## - weight_kg          1   53.999  69.999
## <none>                   52.553  70.553
## - base_happiness     1   54.734  70.734
## - experience_growth  1   56.169  72.169
## + type3              1   52.458  72.458
## - base_total         1   57.731  73.731
## - leg_abil           1   62.513  78.513
## - capture_rate       1   63.307  79.307
## - base_egg_steps     1  107.453 123.453
## 
## Step:  AIC=69.44
## is_legendary ~ base_egg_steps + base_happiness + base_total + 
##     capture_rate + experience_growth + weight_kg + leg_abil
## 
##                     Df Deviance     AIC
## - weight_kg          1   54.440  68.440
## <none>                   53.437  69.437
## - base_happiness     1   55.627  69.627
## + tot_resist         1   52.553  70.553
## - experience_growth  1   56.913  70.913
## + type3              1   53.431  71.431
## - base_total         1   58.262  72.262
## - capture_rate       1   63.988  77.988
## - leg_abil           1   65.533  79.533
## - base_egg_steps     1  108.123 122.123
## 
## Step:  AIC=68.44
## is_legendary ~ base_egg_steps + base_happiness + base_total + 
##     capture_rate + experience_growth + leg_abil
## 
##                     Df Deviance     AIC
## <none>                   54.440  68.440
## + weight_kg          1   53.437  69.437
## - base_happiness     1   57.467  69.467
## - experience_growth  1   57.982  69.982
## + tot_resist         1   53.999  69.999
## - base_total         1   58.293  70.293
## + type3              1   54.436  70.436
## - capture_rate       1   64.193  76.193
## - leg_abil           1   65.880  77.880
## - base_egg_steps     1  108.722 120.722
step_model3 = step(model1, list(lower = formula(model0),
                                upper = formula(model1),
                                direction="both",
                                trace=0))
## Start:  AIC=72.46
## is_legendary ~ base_egg_steps + base_happiness + base_total + 
##     capture_rate + experience_growth + weight_kg + tot_resist + 
##     leg_abil + type3
## 
##                     Df Deviance     AIC
## - type3              1   52.553  70.553
## - tot_resist         1   53.431  71.431
## - weight_kg          1   53.994  71.994
## <none>                   52.458  72.458
## - base_happiness     1   54.725  72.725
## - experience_growth  1   56.113  74.113
## - base_total         1   57.697  75.697
## - leg_abil           1   62.505  80.505
## - capture_rate       1   63.273  81.273
## - base_egg_steps     1  106.984 124.984
## 
## Step:  AIC=70.55
## is_legendary ~ base_egg_steps + base_happiness + base_total + 
##     capture_rate + experience_growth + weight_kg + tot_resist + 
##     leg_abil
## 
##                     Df Deviance     AIC
## - tot_resist         1   53.437  69.437
## - weight_kg          1   53.999  69.999
## <none>                   52.553  70.553
## - base_happiness     1   54.734  70.734
## - experience_growth  1   56.169  72.169
## + type3              1   52.458  72.458
## - base_total         1   57.731  73.731
## - leg_abil           1   62.513  78.513
## - capture_rate       1   63.307  79.307
## - base_egg_steps     1  107.453 123.453
## 
## Step:  AIC=69.44
## is_legendary ~ base_egg_steps + base_happiness + base_total + 
##     capture_rate + experience_growth + weight_kg + leg_abil
## 
##                     Df Deviance     AIC
## - weight_kg          1   54.440  68.440
## <none>                   53.437  69.437
## - base_happiness     1   55.627  69.627
## + tot_resist         1   52.553  70.553
## - experience_growth  1   56.913  70.913
## + type3              1   53.431  71.431
## - base_total         1   58.262  72.262
## - capture_rate       1   63.988  77.988
## - leg_abil           1   65.533  79.533
## - base_egg_steps     1  108.123 122.123
## 
## Step:  AIC=68.44
## is_legendary ~ base_egg_steps + base_happiness + base_total + 
##     capture_rate + experience_growth + leg_abil
## 
##                     Df Deviance     AIC
## <none>                   54.440  68.440
## + weight_kg          1   53.437  69.437
## - base_happiness     1   57.467  69.467
## - experience_growth  1   57.982  69.982
## + tot_resist         1   53.999  69.999
## - base_total         1   58.293  70.293
## + type3              1   54.436  70.436
## - capture_rate       1   64.193  76.193
## - leg_abil           1   65.880  77.880
## - base_egg_steps     1  108.722 120.722

All 3 models yield the same 6 variables - base_egg_steps, base_happiness, base_total, capture_rate, experience_growth and leg_abil. We can now rebuild the model using these 6 features.

model2 = glm(is_legendary ~ base_egg_steps+base_happiness+base_total+capture_rate+experience_growth+leg_abil, data=train, family=binomial(link='logit'))

summary(model2)
## 
## Call:
## glm(formula = is_legendary ~ base_egg_steps + base_happiness + 
##     base_total + capture_rate + experience_growth + leg_abil, 
##     family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.30903  -0.07574  -0.03740  -0.00537   2.79209  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.741e+01  4.510e+00  -3.860 0.000114 ***
## base_egg_steps     3.265e-04  8.291e-05   3.938 8.23e-05 ***
## base_happiness     3.583e-02  2.437e-02   1.470 0.141564    
## base_total         7.133e-03  3.935e-03   1.813 0.069862 .  
## capture_rate      -2.568e-02  8.712e-03  -2.948 0.003195 ** 
## experience_growth  4.787e-06  2.591e-06   1.848 0.064671 .  
## leg_abil1          2.725e+00  9.178e-01   2.969 0.002989 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 332.50  on 560  degrees of freedom
## Residual deviance:  54.44  on 554  degrees of freedom
## AIC: 68.44
## 
## Number of Fisher Scoring iterations: 9
library(HH)
vif(model2)
##    base_egg_steps    base_happiness        base_total      capture_rate 
##          1.591137          1.239335          2.452600          2.151801 
## experience_growth         leg_abil1 
##          1.199614          1.065231

Variance Inflation Factor(VIF) is one way to detect multicollinearity - large values indicate imprecise coefficients estimates and are thus bad for prediction power of the model. Generally, VIF values of 5 and below are considered acceptable.

Prediction and scoring:

library(InformationValue)
library(MLmetrics)
library(ROCR)
pred.train = predict(model2, train[,-7], type='response')
(optC = optimalCutoff(train$is_legendary, pred.train)) #0.210)
## [1] 0.2099229
pred.test = predict(model2, test[,-7], type='response')
pred.test2 = ifelse(pred.test<optC, 0, 1)
table(test$is_legendary, pred.test2, dnn=c("actual", "predicted"))
##       predicted
## actual   0   1
##      0 217   2
##      1   0  21
(score = mean(test$is_legendary==pred.test2)) #0.992
## [1] 0.9916667
(F1 = F1_Score(test$is_legendary, pred.test2)) #0.995
## [1] 0.9954128
pr <- prediction(pred.test, test$is_legendary)
perf <- performance(pr, measure = "tpr", x.measure = "fpr")
(auc <- performance(pr, measure = "auc")) #0.991
## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.9991302
## 
## 
## Slot "alpha.values":
## list()
plot(perf)

The OptimalCutoff function is a very useful tool to help us identify the threshold for separation. While the standard cutoff is 0.5, this sometimes may not be accurate. The function optimises the threshold by minimising the misclassification error. In this case, 0.21 is our threshold. Applying this threshold in predicting the test dataset yields excellent result. The model managed to accurately predict all 21 legendary pokemon while only misclassifying 2 non-legendary ones.

The score function tells us what percentage of the data did the model accurately predict, but this can be misleading if the classes are unbalanced. For example, a model which always predicts “0” will naturally yield a fantastic 90% score! However this model is clearly useless as it is unable to predict the “1” cases. Therefore, we use the F1 scoring function which takes into account the proportion of each class that was correctly identified. In this case, the model which always predicts “0” would score poorly on the F1 metric.

Hence, our logistic regression model has been very robust and the features provided in the dataset allowed us to classify legendary pokemon with pinpoint accuracy. A disadvantage of this model is that it does not tell us clearly which feature is important in prediction. We can build a decision tree model to help us elucidate better.

library(rpart)
library(rpart.plot)
library(factoextra)
set.seed(1234)
tree = rpart(is_legendary ~ base_egg_steps+base_happiness+base_total+capture_rate+experience_growth+leg_abil, data=pkmn2, method='class')
rpart.plot(tree)

ggplot(pkmn2) +
  geom_point(aes(base_egg_steps, capture_rate, color=is_legendary))

The decision tree model tells us that base_egg_steps and capture_rate can very well classify legendary pokemon. This is evident from the scatter plot where non-legendary and legendary pokemon are neatly separated just by two features. This separation can even be cleaner with extra features.