This article analyses all 801 Pokemon from 7 generations and answers the following questions:
Dataset taken from Kaggle (https://www.kaggle.com/rounakbanik/pokemon)
Let us first read in the dataset and explore the data a little.
library(tidyverse)
pkmn <- read_csv("D:/wiley/pokemon.csv")
names(pkmn)
## [1] "abilities" "against_bug" "against_dark"
## [4] "against_dragon" "against_electric" "against_fairy"
## [7] "against_fight" "against_fire" "against_flying"
## [10] "against_ghost" "against_grass" "against_ground"
## [13] "against_ice" "against_normal" "against_poison"
## [16] "against_psychic" "against_rock" "against_steel"
## [19] "against_water" "attack" "base_egg_steps"
## [22] "base_happiness" "base_total" "capture_rate"
## [25] "classfication" "defense" "experience_growth"
## [28] "height_m" "hp" "japanese_name"
## [31] "name" "percentage_male" "pokedex_number"
## [34] "sp_attack" "sp_defense" "speed"
## [37] "type1" "type2" "weight_kg"
## [40] "generation" "is_legendary"
We can see that the dataset contains all 801 pokemon and 41 columns containing attributes for each pokemon. There are also some missing values.
sapply(pkmn, function(x) sum(is.na(x)))
## abilities against_bug against_dark against_dragon
## 0 0 0 0
## against_electric against_fairy against_fight against_fire
## 0 0 0 0
## against_flying against_ghost against_grass against_ground
## 0 0 0 0
## against_ice against_normal against_poison against_psychic
## 0 0 0 0
## against_rock against_steel against_water attack
## 0 0 0 0
## base_egg_steps base_happiness base_total capture_rate
## 0 0 0 0
## classfication defense experience_growth height_m
## 0 0 0 20
## hp japanese_name name percentage_male
## 0 0 0 98
## pokedex_number sp_attack sp_defense speed
## 0 0 0 0
## type1 type2 weight_kg generation
## 0 384 20 0
## is_legendary
## 0
pkmn[pkmn$name=="Rattata",]$height_m = 0.3
pkmn[pkmn$name=="Rattata",]$weight_kg = 3.5
pkmn[20, "height_m"] = 0.7
pkmn[20, "weight_kg"] = 18.5
pkmn[26, "height_m"] = 0.8
pkmn[26, "weight_kg"] = 30
pkmn[27, "height_m"] = 0.6
pkmn[27, "weight_kg"] = 12
pkmn[28, "height_m"] = 1
pkmn[28, "weight_kg"] = 29.5
pkmn[37, "height_m"] = 0.6
pkmn[37, "weight_kg"] = 9.9
pkmn[38, "height_m"] = 1.1
pkmn[38, "weight_kg"] = 19.9
pkmn[50, "height_m"] = 0.2
pkmn[50, "weight_kg"] = 0.8
pkmn[51, "height_m"] = 0.7
pkmn[51, "weight_kg"] = 33.3
pkmn[52, "height_m"] = 0.4
pkmn[52, "weight_kg"] = 4.2
pkmn[53, "height_m"] = 1
pkmn[53, "weight_kg"] = 32
pkmn[74, "height_m"] = 0.4
pkmn[74, "weight_kg"] = 20
pkmn[75, "height_m"] = 1
pkmn[75, "weight_kg"] = 105
pkmn[76, "height_m"] = 1.4
pkmn[76, "weight_kg"] = 300
pkmn[88, "height_m"] = 0.9
pkmn[88, "weight_kg"] = 30
pkmn[89, "height_m"] = 1.2
pkmn[89, "weight_kg"] = 30
pkmn[103, "height_m"] = 2
pkmn[103, "weight_kg"] = 120
pkmn[105, "height_m"] = 1
pkmn[105, "weight_kg"] = 45
pkmn[720, "height_m"] = 0.5
pkmn[720, "weight_kg"] = 9
pkmn[745, "height_m"] = 0.8
pkmn[745, "weight_kg"] = 25
pkmn[774, "capture_rate"] = "30"
There are missing values in the height_m, weight_kg, percentage_male, and type2 columns. For percentage_male, the dataset description tells us that these pokemon do not have a gender, hence there is no need to fix this column. We also know that some pokemon only have 1 primary type, hence these would have NA as their second type. It is interesting to note that about half of all pokemon are single types. As for the missing heights and weights, their data can be easily found from the source of data, serebii.net. We will go ahead and manually search and impute these missing values.
Let us now investigate the relationships between height, weight and stats of the pokemon. We need the library “GGally” to help us do a pair plot. This may take several seconds to run.
library(GGally)
ggpairs(data=pkmn, columns = c("height_m", "weight_kg", "hp", "attack", "defense", "sp_attack", "sp_defense", "speed", "base_total"))
From the pair plots we observe that there isn’t a strong correlation between any stat in particular. However height and weight seems to be correlated, and this make sense as larger pokemon tend to be heavier/taller. All stats have a significant correlation with base_total, this is nothing surprising as base_total is simply the sum of all six stats. Interestingly, the distribution of base_total shows two peaks. Let’s investigate this further.
ggpairs(data=pkmn, columns = c("height_m", "weight_kg", "base_total"), mapping = aes(color=as.factor(is_legendary)))
As this analysis focuses largely on legendary pokemon, this plot shows us that the heights and weights cannot really differentiate whether a pokemon is legendary, but the base_total stats show a clear distinction - legendary pokemon have a much higher base_total than non-legendary ones. This will be useful in building our classifier later on.
In the world of pokemon there isn’t a standard definition of strong pokemon but there are some measures that give us a clue as to whether the game considers a pokemon strong:
pkmn %>%
group_by(type1) %>%
summarise(
avg_hp = mean(hp),
avg_att = mean(attack),
avg_def = mean(defense),
avg_satt = mean(sp_attack),
avg_sdef = mean(sp_defense),
avg_spd = mean(speed)
) %>%
gather(`avg_hp`,`avg_att`,`avg_def`,`avg_satt`,`avg_sdef`,`avg_spd`, key="attribute", value="value") %>%
ggplot() +
geom_col(aes(x=type1, fill=attribute, y=value), position='dodge', width=0.5)
pkmn %>%
group_by(type1) %>%
summarise(
avg_base = mean(base_total),
) %>%
ggplot() +
geom_col(aes(x=type1, y=avg_base), width=0.7)
The first plot compares the 6 different attributes amongst the primary types. We see that Dragon types have the highest offensive power, Flying types have the fastest speed while Steel types have extremely high defense. Looking at the second plot and considering all stats, we notice that Dragon and Steel types have the highest average base stats while Bug types have the lowest.
pkmn$capture_rate = as.numeric(pkmn$capture_rate)
pkmn %>%
group_by(type1) %>%
summarise(
avg_capture = mean(capture_rate)
) %>%
ggplot() +
geom_col(aes(x=type1, y=avg_capture), width=0.7)
In this dataset, capture_rate is coded as a character variable and we should first convert it to a numeric variable. From the plot, we see that Dragon and Steel types have the lowest capture rates and this is an indication that they are stronger in the game. On the other hand, Bug and Poison types have the highest capture rates.
A simple way to aggregate resistance for each pokemon is to add up the 18 columns of “against_X”. Since there are a total of 18 different types of attacks, a pokemon which takes neutral damage from all will have a total resistance of 18. Hence, the lower the total resistance, the more defensive a pokemon is, since it will take lesser damage from attacks in general.
pkmn = pkmn %>%
mutate(tot_resist = against_bug+against_dark+against_dragon+against_electric+against_fairy+against_fight+against_fire+against_flying+against_ghost+against_grass+against_ground+against_ice+against_normal+against_poison+against_psychic+against_rock+against_steel+against_water)
Analysing resistance is a little more complex as we are unable to aggregate by type, since pokemon typing can synergise with each other to confer different resistances to a pokemon. We can however perform some other analyses to have a rough sense.
boxplot(pkmn$tot_resist)
pkmn %>%
arrange(min_rank(tot_resist)) %>%
dplyr::select(name, tot_resist, type1, type2) %>%
print(n=20)
## # A tibble: 801 x 4
## name tot_resist type1 type2
## <chr> <dbl> <chr> <chr>
## 1 Mawile 13.2 steel fairy
## 2 Klefki 13.2 steel fairy
## 3 Magearna 13.2 steel fairy
## 4 Skarmory 13.5 steel flying
## 5 Celesteela 13.5 steel flying
## 6 Dialga 14.2 steel dragon
## 7 Honedge 14.2 steel ghost
## 8 Doublade 14.2 steel ghost
## 9 Aegislash 14.2 steel ghost
## 10 Empoleon 14.5 water steel
## 11 Registeel 15 steel <NA>
## 12 Klink 15 steel <NA>
## 13 Klang 15 steel <NA>
## 14 Klinklang 15 steel <NA>
## 15 Sableye 15.5 dark ghost
## 16 Spiritomb 15.5 ghost dark
## 17 Lucario 15.5 fighting steel
## 18 Cobalion 15.5 steel fighting
## 19 Forretress 15.8 bug steel
## 20 Steelix 15.8 steel ground
## # ... with 781 more rows
pkmn %>%
arrange(min_rank(desc(tot_resist))) %>%
dplyr::select(name, tot_resist, type1, type2) %>%
print(n=20)
## # A tibble: 801 x 4
## name tot_resist type1 type2
## <chr> <dbl> <chr> <chr>
## 1 Amaura 26 rock ice
## 2 Aurorus 26 rock ice
## 3 Paras 25 bug grass
## 4 Parasect 25 bug grass
## 5 Wormadam 25 bug grass
## 6 Snover 25 grass ice
## 7 Abomasnow 25 grass ice
## 8 Sewaddle 25 bug grass
## 9 Swadloon 25 bug grass
## 10 Leavanny 25 bug grass
## 11 Geodude 24.2 rock ground
## 12 Graveler 24.2 rock ground
## 13 Golem 24.2 rock ground
## 14 Onix 24.2 rock ground
## 15 Rhyhorn 24.2 ground rock
## 16 Rhydon 24.2 ground rock
## 17 Larvitar 24.2 rock ground
## 18 Pupitar 24.2 rock ground
## 19 Rhyperior 24.2 ground rock
## 20 Exeggcute 24 grass psychic
## # ... with 781 more rows
The boxplot shows that most pokemon have total resistances of 18 to 20. Ordering the pokemon by their total resistances reveals that the top 20 pokemon with the highest resistance (lowest tot_resist value) overwhelmingly have Steel as one of their types. On the other hand, pokemon with either Rock, Ground, Bug, Ice or Grass often take increased damage and are thus weaker.
In conclusion, from the three indicators we can tell that Dragon and Steel types are generally stronger while Bug types are weaker.
Another way to look at the question will be - which typing has the highest proportion of legendary pokemon? To answer this we will need to consider both the primary and secondary typing of the pokemon.
table_1 = table(pkmn$is_legendary, pkmn$type1)
table_1
##
## bug dark dragon electric fairy fighting fire flying ghost grass ground
## 0 69 26 20 34 17 28 47 2 26 74 30
## 1 3 3 7 5 1 0 5 1 1 4 2
##
## ice normal poison psychic rock steel water
## 0 21 102 32 36 41 18 108
## 1 2 3 0 17 4 6 6
table_2 = table(pkmn$is_legendary, pkmn$type2)
table_2
##
## bug dark dragon electric fairy fighting fire flying ghost grass ground
## 0 5 21 13 8 23 19 11 85 12 18 33
## 1 0 0 4 1 6 6 2 10 2 2 1
##
## ice normal poison psychic rock steel water
## 0 14 4 33 25 14 18 16
## 1 1 0 1 4 0 4 1
table_3 = table_1 + table_2
table_3
##
## bug dark dragon electric fairy fighting fire flying ghost grass ground
## 0 74 47 33 42 40 47 58 87 38 92 63
## 1 3 3 11 6 7 6 7 11 3 6 3
##
## ice normal poison psychic rock steel water
## 0 35 106 65 61 55 36 124
## 1 3 3 1 21 4 10 7
To interpret table_3, we can say that a Bug type pokemon has 3/77 chance to be legendary. Let’s do this visually:
table_3 = as.data.frame(table_1 + table_2)
colnames(table_3) = c("is_legendary", "type", "freq")
table_3 = table_3%>%
mutate(is_legendary = ifelse(is_legendary==0, "no", "yes")) %>%
spread(key=is_legendary, value=freq) %>%
mutate(proportion=yes/no)
table_3
ggplot(table_3) +
geom_col(aes(type, proportion))
The diagram makes it clear that Psychic and Dragon types are the mostly likely to be legendary. In other words, if a pokemon has Psychic as one of its types, it has a 34% chance to be legendary.
In the world of Pokemon, having a “dream team” is more than just assembling 6 “star” pokemon. There are many other factors that affect the performance of a team. For example, abilities and moveset can drastically alter a pokemon’s effectiveness. Some pokemon have specific roles in a team (eg. being a damage dealer, a support or a shield) while others synergise together such that their combined effect is greater than their individual prowess.
However, for the sake of analysis, we shall only focus on each pokemon’s individual attributes; we aim to find pokemon that are all-rounders. Looking at the columns we have, we can analyse the base_total as well as the tot_resist columns. Doing a scatter plot yields the results:
ggplot(pkmn) +
geom_point(aes(base_total, tot_resist, color=is_legendary))
We notice that the points roughly form a circular shape and we should be looking for those which reside around the lower-right corner, where base_total is high and tot_resist is low. Interestingly, we note that while most legendary pokemon have high base_total stats, their resistances toward attack types are largely varied. From our previous typing analysis, we know that Steel types are overwhelmingly resistant against attacks.
pkmn %>%
filter(base_total>=600, tot_resist<17)%>%
arrange(desc(base_total))%>%
dplyr::select(name, base_total, type1, type2, tot_resist)
Going by our lower-right selection criteria, we can visually identify the 6 pokemon: Metagross, Dialga, Solgaleo, Magearna, Lucario, Steelix/Genesect. Once again, we see that all 6 have Steel as one of their types.
This approach of selection ignores team synergy and while these pokemon are individually strong, any experienced pokemon player will immediately point out that such a team will be decimated easily. This is because all of them have a Steel typing and are thus weak to Fire, Ground and Fighting attacks. A good team should ideally consists of pokemon which are able to cover for one another’s weaknesses. Hence, for every possible attack type the team should have at least one pokemon which is able to resist it. We will use a trial and error approach here:
base_total (the higher the better)against_type we have at least one pokemon that resists itagainst_type that the team is unable to resist (we assume that taking normal damage is acceptable)base_total and counter in a dataframeteams = data.frame(
p1 = 0,
p2 = 0,
p3 = 0,
p4 = 0,
p5 = 0,
p6 = 0,
resist = 0,
stat = 0)
set.seed(565)
for (i in 1:10000) {
team = pkmn %>%
dplyr::select(33, 2:19, 23) %>%
sample_n(6)
team_resist = team %>%
dplyr::select(2:19) %>%
t() %>%
as.data.frame() %>%
mutate(V7 = ifelse(
V1 < 1, 0, ifelse(
V2 < 1, 0, ifelse(
V3 < 1, 0, ifelse(
V4 < 1, 0, ifelse(
V5 < 1, 0, ifelse(
V6 < 1, 0, ifelse(
V1 == 1 & V2 == 1 & V3 == 1 & V4 == 1 & V5 == 1 & V6 == 1, 0, 1
)
)
)
)
)
)
)
)
team_resist = sum(team_resist$V7)
team_stat = mean(team$base_total)
new = c(team$pokedex_number[1],
team$pokedex_number[2],
team$pokedex_number[3],
team$pokedex_number[4],
team$pokedex_number[5],
team$pokedex_number[6],
team_resist,
team_stat
)
teams = rbind(teams, new)
}
plot(teams$resist, teams$stat)
Selecting the top 10 teams by base_total shows us the following results:
teams %>%
filter(resist == 0, stat > 0) %>%
arrange(desc(stat)) %>%
top_n(10)
pkmn %>%
slice(612, 635, 681, 254, 485, 382) %>%
dplyr::select(name, type1, type2, base_total, tot_resist, is_legendary)
This team is clearly more all-rounded and is able resist all types of attacks. We have 2 legendary pokemons in the mix and quite a diversity in terms of typing, stats and resistances. It is also important to note that this is a very rudimentary method of selection and ignores many other forms of synergy such as items, abilities, natures and movesets which can vastly affect the effectiveness of a team. The sample space of the search is also extremely small compared to all the combinations out there. We can take a look at one of the top teams in the Masters Division of the 2017 World Pokemon Championships.
pkmn %>%
slice(373,445,787,186,785,376) %>%
dplyr::select(name, type1, type2, base_total, tot_resist, is_legendary)
A pretty solid team indeed! While the described analysis will definitely not outperform a professional team, it serves as a good starting point for a novice looking to try out different combinations.
We have a whole bunch of features to use but clearly not all of them are useful. We have the following features:
names(pkmn)
## [1] "abilities" "against_bug" "against_dark"
## [4] "against_dragon" "against_electric" "against_fairy"
## [7] "against_fight" "against_fire" "against_flying"
## [10] "against_ghost" "against_grass" "against_ground"
## [13] "against_ice" "against_normal" "against_poison"
## [16] "against_psychic" "against_rock" "against_steel"
## [19] "against_water" "attack" "base_egg_steps"
## [22] "base_happiness" "base_total" "capture_rate"
## [25] "classfication" "defense" "experience_growth"
## [28] "height_m" "hp" "japanese_name"
## [31] "name" "percentage_male" "pokedex_number"
## [34] "sp_attack" "sp_defense" "speed"
## [37] "type1" "type2" "weight_kg"
## [40] "generation" "is_legendary" "tot_resist"
We can condense the against_type columns into tot_resist and drop the individual stat columns as these will be highly correlated with base_total. As height and weight are also correlated, we will use only one of them. Features like classfication, name and pokedex_number are unique to each pokemon and do not offer extra information. generation is also not useful as every generation has a certain number of legendary pokemon. Pokemon abilities and typing may possibly be a predictor of legendary status but we will first have to manipulate them in order to use it in our model later on.
To make things simple, we will first identify the abilities of legendary pokemon and determine the most common words in the abilities.
library(textclean)
pkmn$abilities = strip(pkmn$abilities, char.keep=NULL, apostrophe.remove = T)
x = pkmn %>%
filter(is_legendary=="1") %>%
dplyr::select(abilities)
x = x$abilities
x = str_c(x, collapse=" ")
x = unlist(str_split(x, " "))
sort(table(x), decreasing=T)
## x
## pressure body beast boost levitate telepathy
## 14 8 7 7 7 7
## clear justified surge aura focus grace
## 4 4 4 3 3 3
## inner serene absorb cure defiant flame
## 3 3 2 2 2 2
## force metal natural prankster regenerator sturdy
## 2 2 2 2 2 2
## teravolt turboblaze air armor bad break
## 2 2 1 1 1 1
## cloak construct dark download dreams drizzle
## 1 1 1 1 1 1
## drought electric fairy fire flash full
## 1 1 1 1 1 1
## grassy hydration ice intimidate light lock
## 1 1 1 1 1 1
## magician misty multiscale multitype power prism
## 1 1 1 1 1 1
## psychic sand shadow sheer shield slow
## 1 1 1 1 1 1
## snow soulheart star start static synchronize
## 1 1 1 1 1 1
## unaware unnerve victory volt water
## 1 1 1 1 1
x = c("pressure", "body", "beast", "boost", "levitate", "telepathy")
We will then determine, for each pokemon, if its ability has a word that is one of these 6 words.
pkmn = pkmn %>%
mutate(leg_abil = ifelse(word(abilities,1)%in%x|word(abilities,2)%in%x|word(abilities,3)%in%x|word(abilities,4)%in%x|word(abilities,5)%in%x|word(abilities,6)%in%x|word(abilities,7)%in%x|word(abilities,8)%in%x, "1", "0"))
pkmn$leg_abil[1:100]
## [1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## [18] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## [35] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## [52] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## [69] "0" "0" "0" "1" "1" "0" "0" "0" "1" "1" "0" "0" "0" "0" "0" "0" "0"
## [86] "1" "1" "0" "0" "0" "0" "1" "1" "1" "0" "0" "0" "0" "0" "0"
For pokemon typing, we have previously established that Psychic and Dragon types have the highest proportion of legendary pokemon. We can create a column called type3 that indicates whether a pokemon has Psychic or Dragon as one of its typings.
pkmn[is.na(pkmn$type2), "type2"] = "None"
pkmn = pkmn %>%
mutate(type3 = ifelse(type1=="psychic"|type2=="psychic"|type1=="dragon"|type2=="dragon", 1, 0))
pkmn$type3[1:100]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
## [71] 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
We shall now remove drop columns which are not useful and convert the column types appropriately. We can also explore each feature and how it relates to whether a pokemon is legendary. This will give us an idea of whether a feature will be useful.
pkmn2 = pkmn[, c(21,22,23,24,27,39,41:44)]
pkmn2$capture_rate = as.numeric(pkmn2$capture_rate)
pkmn2$is_legendary = as.factor(pkmn2$is_legendary)
pkmn2$leg_abil = as.factor(pkmn2$leg_abil)
pkmn2$type3 = as.factor(pkmn2$type3)
str(pkmn2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 801 obs. of 10 variables:
## $ base_egg_steps : num 5120 5120 5120 5120 5120 5120 5120 5120 5120 3840 ...
## $ base_happiness : num 70 70 70 70 70 70 70 70 70 70 ...
## $ base_total : num 318 405 625 309 405 634 314 405 630 195 ...
## $ capture_rate : num 45 45 45 45 45 45 45 45 45 255 ...
## $ experience_growth: num 1059860 1059860 1059860 1059860 1059860 ...
## $ weight_kg : num 6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
## $ is_legendary : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ tot_resist : num 19.2 19.2 19.2 18 18 ...
## $ leg_abil : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ type3 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
ggplot(pkmn2) +
geom_boxplot(aes(is_legendary, base_egg_steps))
ggplot(pkmn2) +
geom_boxplot(aes(is_legendary, base_happiness))
ggplot(pkmn2) +
geom_boxplot(aes(is_legendary, base_total))
ggplot(pkmn2) +
geom_boxplot(aes(is_legendary, capture_rate))
ggplot(pkmn2) +
geom_boxplot(aes(is_legendary, experience_growth))
ggplot(pkmn2) +
geom_boxplot(aes(is_legendary, weight_kg))
ggplot(pkmn2) +
geom_boxplot(aes(is_legendary, tot_resist))
ggplot(pkmn2) +
geom_bar(aes(is_legendary, fill=leg_abil))
ggplot(pkmn2) +
geom_bar(aes(is_legendary, fill=type3))
From the boxplots we can see that base_egg_steps, base_total and capture_rate have a distinct separation between legendary and non-legendary pokemon and are likely to be useful in the model. Conversely, weight_kg and tot_resist are likely to be poor predictors as there is no distinction.
The next step before initialising our model is to visualise the correlation between our variables. This is important in identifying multicollinearity in our model.
#For continuous variables
library(heatmaply)
library(grDevices)
heatmaply(abs(cor(pkmn2[, c(1:6,8)])), dendrogram='none', show_grid=T, colors=c("black","yellow"))
chisq.test(pkmn$is_legendary, pkmn$leg_abil)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: pkmn$is_legendary and pkmn$leg_abil
## X-squared = 59.951, df = 1, p-value = 9.724e-15
chisq.test(pkmn$is_legendary, pkmn$type3)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: pkmn$is_legendary and pkmn$type3
## X-squared = 41.674, df = 1, p-value = 1.079e-10
chisq.test(pkmn$type3, pkmn$leg_abil)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: pkmn$type3 and pkmn$leg_abil
## X-squared = 31.301, df = 1, p-value = 2.21e-08
The heatmap shows good results where most of the variables are unrelated, except for capture_rate and base_total. Hence one of these variables will likely be dropped during variable selection. As for the categorical variables, all p-values are extremely small, indicating a strong correlation between them.
We are now ready to build our classifier. Since this is a binary problem, we will use logistic regression as our model. In our dataset, we have only 70 legendary pokemon out of 801, therefore we should use stratified splitting into training and test sets.
set.seed(1234)
library(caret)
index = createDataPartition(pkmn2$is_legendary, p = 0.7, list=F)
train = pkmn2[index,]
test = pkmn2[-index,]
model0 = glm(is_legendary ~ 1, data=train, family=binomial(link='logit'))
model1 = glm(is_legendary ~ ., data=train, family = binomial(link='logit'))
summary(model1)
##
## Call:
## glm(formula = is_legendary ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.49426 -0.06723 -0.02697 -0.00382 2.55068
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.456e+01 5.907e+00 -2.464 0.013733 *
## base_egg_steps 3.370e-04 9.069e-05 3.717 0.000202 ***
## base_happiness 3.637e-02 2.715e-02 1.340 0.180256
## base_total 9.070e-03 4.210e-03 2.154 0.031205 *
## capture_rate -2.888e-02 9.033e-03 -3.197 0.001387 **
## experience_growth 5.104e-06 2.744e-06 1.860 0.062921 .
## weight_kg -3.979e-03 3.040e-03 -1.309 0.190652
## tot_resist -2.110e-01 2.189e-01 -0.964 0.335219
## leg_abil1 2.666e+00 9.461e-01 2.817 0.004840 **
## type31 3.087e-01 9.954e-01 0.310 0.756463
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 332.504 on 560 degrees of freedom
## Residual deviance: 52.458 on 551 degrees of freedom
## AIC: 72.458
##
## Number of Fisher Scoring iterations: 9
We observe that some good predictors are base_egg_steps, base_total, capture_rate and leg_abil1. Next, we perform variable selection using Akaike’s Information Criterion (AIC).
step_model1 = step(model0, list(lower = formula(model0),
upper = formula(model1),
direction="forward",
trace=1))
## Start: AIC=334.5
## is_legendary ~ 1
##
## Df Deviance AIC
## + base_egg_steps 1 104.07 108.07
## + capture_rate 1 148.09 152.09
## + base_total 1 177.60 181.60
## + experience_growth 1 271.51 275.51
## + leg_abil 1 296.84 300.84
## + base_happiness 1 297.02 301.02
## + weight_kg 1 303.57 307.57
## + type3 1 312.44 316.44
## + tot_resist 1 329.32 333.32
## <none> 332.50 334.50
##
## Step: AIC=108.07
## is_legendary ~ base_egg_steps
##
## Df Deviance AIC
## + capture_rate 1 83.12 89.12
## + leg_abil 1 84.73 90.73
## + base_total 1 86.24 92.24
## + experience_growth 1 96.36 102.36
## + base_happiness 1 101.15 107.15
## <none> 104.07 108.07
## + weight_kg 1 102.81 108.81
## + tot_resist 1 103.03 109.03
## + type3 1 104.06 110.06
## - base_egg_steps 1 332.50 334.50
##
## Step: AIC=89.12
## is_legendary ~ base_egg_steps + capture_rate
##
## Df Deviance AIC
## + leg_abil 1 65.359 73.359
## + experience_growth 1 75.372 83.372
## + base_total 1 75.470 83.470
## <none> 83.123 89.123
## + tot_resist 1 81.446 89.446
## + base_happiness 1 81.553 89.553
## + type3 1 83.051 91.051
## + weight_kg 1 83.123 91.123
## - capture_rate 1 104.073 108.073
## - base_egg_steps 1 148.090 152.090
##
## Step: AIC=73.36
## is_legendary ~ base_egg_steps + capture_rate + leg_abil
##
## Df Deviance AIC
## + base_total 1 60.590 70.590
## + experience_growth 1 61.943 71.943
## + base_happiness 1 62.398 72.398
## <none> 65.359 73.359
## + type3 1 64.687 74.687
## + tot_resist 1 64.932 74.932
## + weight_kg 1 65.158 75.158
## - leg_abil 1 83.123 89.123
## - capture_rate 1 84.725 90.725
## - base_egg_steps 1 142.967 148.967
##
## Step: AIC=70.59
## is_legendary ~ base_egg_steps + capture_rate + leg_abil + base_total
##
## Df Deviance AIC
## + experience_growth 1 57.467 69.467
## + base_happiness 1 57.982 69.982
## <none> 60.590 70.590
## + weight_kg 1 58.935 70.935
## + tot_resist 1 60.173 72.173
## + type3 1 60.407 72.407
## - base_total 1 65.359 73.359
## - capture_rate 1 71.171 79.171
## - leg_abil 1 75.470 83.470
## - base_egg_steps 1 119.603 127.603
##
## Step: AIC=69.47
## is_legendary ~ base_egg_steps + capture_rate + leg_abil + base_total +
## experience_growth
##
## Df Deviance AIC
## + base_happiness 1 54.440 68.440
## <none> 57.467 69.467
## + weight_kg 1 55.627 69.627
## - experience_growth 1 60.590 70.590
## + tot_resist 1 57.121 71.121
## + type3 1 57.313 71.313
## - base_total 1 61.943 71.943
## - capture_rate 1 67.646 77.646
## - leg_abil 1 69.086 79.086
## - base_egg_steps 1 109.570 119.570
##
## Step: AIC=68.44
## is_legendary ~ base_egg_steps + capture_rate + leg_abil + base_total +
## experience_growth + base_happiness
##
## Df Deviance AIC
## <none> 54.440 68.440
## + weight_kg 1 53.437 69.437
## - base_happiness 1 57.467 69.467
## - experience_growth 1 57.982 69.982
## + tot_resist 1 53.999 69.999
## - base_total 1 58.293 70.293
## + type3 1 54.436 70.436
## - capture_rate 1 64.193 76.193
## - leg_abil 1 65.880 77.880
## - base_egg_steps 1 108.722 120.722
step_model2 = step(model1, list(lower = formula(model0),
upper = formula(model1),
direction="backward",
trace=0))
## Start: AIC=72.46
## is_legendary ~ base_egg_steps + base_happiness + base_total +
## capture_rate + experience_growth + weight_kg + tot_resist +
## leg_abil + type3
##
## Df Deviance AIC
## - type3 1 52.553 70.553
## - tot_resist 1 53.431 71.431
## - weight_kg 1 53.994 71.994
## <none> 52.458 72.458
## - base_happiness 1 54.725 72.725
## - experience_growth 1 56.113 74.113
## - base_total 1 57.697 75.697
## - leg_abil 1 62.505 80.505
## - capture_rate 1 63.273 81.273
## - base_egg_steps 1 106.984 124.984
##
## Step: AIC=70.55
## is_legendary ~ base_egg_steps + base_happiness + base_total +
## capture_rate + experience_growth + weight_kg + tot_resist +
## leg_abil
##
## Df Deviance AIC
## - tot_resist 1 53.437 69.437
## - weight_kg 1 53.999 69.999
## <none> 52.553 70.553
## - base_happiness 1 54.734 70.734
## - experience_growth 1 56.169 72.169
## + type3 1 52.458 72.458
## - base_total 1 57.731 73.731
## - leg_abil 1 62.513 78.513
## - capture_rate 1 63.307 79.307
## - base_egg_steps 1 107.453 123.453
##
## Step: AIC=69.44
## is_legendary ~ base_egg_steps + base_happiness + base_total +
## capture_rate + experience_growth + weight_kg + leg_abil
##
## Df Deviance AIC
## - weight_kg 1 54.440 68.440
## <none> 53.437 69.437
## - base_happiness 1 55.627 69.627
## + tot_resist 1 52.553 70.553
## - experience_growth 1 56.913 70.913
## + type3 1 53.431 71.431
## - base_total 1 58.262 72.262
## - capture_rate 1 63.988 77.988
## - leg_abil 1 65.533 79.533
## - base_egg_steps 1 108.123 122.123
##
## Step: AIC=68.44
## is_legendary ~ base_egg_steps + base_happiness + base_total +
## capture_rate + experience_growth + leg_abil
##
## Df Deviance AIC
## <none> 54.440 68.440
## + weight_kg 1 53.437 69.437
## - base_happiness 1 57.467 69.467
## - experience_growth 1 57.982 69.982
## + tot_resist 1 53.999 69.999
## - base_total 1 58.293 70.293
## + type3 1 54.436 70.436
## - capture_rate 1 64.193 76.193
## - leg_abil 1 65.880 77.880
## - base_egg_steps 1 108.722 120.722
step_model3 = step(model1, list(lower = formula(model0),
upper = formula(model1),
direction="both",
trace=0))
## Start: AIC=72.46
## is_legendary ~ base_egg_steps + base_happiness + base_total +
## capture_rate + experience_growth + weight_kg + tot_resist +
## leg_abil + type3
##
## Df Deviance AIC
## - type3 1 52.553 70.553
## - tot_resist 1 53.431 71.431
## - weight_kg 1 53.994 71.994
## <none> 52.458 72.458
## - base_happiness 1 54.725 72.725
## - experience_growth 1 56.113 74.113
## - base_total 1 57.697 75.697
## - leg_abil 1 62.505 80.505
## - capture_rate 1 63.273 81.273
## - base_egg_steps 1 106.984 124.984
##
## Step: AIC=70.55
## is_legendary ~ base_egg_steps + base_happiness + base_total +
## capture_rate + experience_growth + weight_kg + tot_resist +
## leg_abil
##
## Df Deviance AIC
## - tot_resist 1 53.437 69.437
## - weight_kg 1 53.999 69.999
## <none> 52.553 70.553
## - base_happiness 1 54.734 70.734
## - experience_growth 1 56.169 72.169
## + type3 1 52.458 72.458
## - base_total 1 57.731 73.731
## - leg_abil 1 62.513 78.513
## - capture_rate 1 63.307 79.307
## - base_egg_steps 1 107.453 123.453
##
## Step: AIC=69.44
## is_legendary ~ base_egg_steps + base_happiness + base_total +
## capture_rate + experience_growth + weight_kg + leg_abil
##
## Df Deviance AIC
## - weight_kg 1 54.440 68.440
## <none> 53.437 69.437
## - base_happiness 1 55.627 69.627
## + tot_resist 1 52.553 70.553
## - experience_growth 1 56.913 70.913
## + type3 1 53.431 71.431
## - base_total 1 58.262 72.262
## - capture_rate 1 63.988 77.988
## - leg_abil 1 65.533 79.533
## - base_egg_steps 1 108.123 122.123
##
## Step: AIC=68.44
## is_legendary ~ base_egg_steps + base_happiness + base_total +
## capture_rate + experience_growth + leg_abil
##
## Df Deviance AIC
## <none> 54.440 68.440
## + weight_kg 1 53.437 69.437
## - base_happiness 1 57.467 69.467
## - experience_growth 1 57.982 69.982
## + tot_resist 1 53.999 69.999
## - base_total 1 58.293 70.293
## + type3 1 54.436 70.436
## - capture_rate 1 64.193 76.193
## - leg_abil 1 65.880 77.880
## - base_egg_steps 1 108.722 120.722
All 3 models yield the same 6 variables - base_egg_steps, base_happiness, base_total, capture_rate, experience_growth and leg_abil. We can now rebuild the model using these 6 features.
model2 = glm(is_legendary ~ base_egg_steps+base_happiness+base_total+capture_rate+experience_growth+leg_abil, data=train, family=binomial(link='logit'))
summary(model2)
##
## Call:
## glm(formula = is_legendary ~ base_egg_steps + base_happiness +
## base_total + capture_rate + experience_growth + leg_abil,
## family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.30903 -0.07574 -0.03740 -0.00537 2.79209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.741e+01 4.510e+00 -3.860 0.000114 ***
## base_egg_steps 3.265e-04 8.291e-05 3.938 8.23e-05 ***
## base_happiness 3.583e-02 2.437e-02 1.470 0.141564
## base_total 7.133e-03 3.935e-03 1.813 0.069862 .
## capture_rate -2.568e-02 8.712e-03 -2.948 0.003195 **
## experience_growth 4.787e-06 2.591e-06 1.848 0.064671 .
## leg_abil1 2.725e+00 9.178e-01 2.969 0.002989 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 332.50 on 560 degrees of freedom
## Residual deviance: 54.44 on 554 degrees of freedom
## AIC: 68.44
##
## Number of Fisher Scoring iterations: 9
library(HH)
vif(model2)
## base_egg_steps base_happiness base_total capture_rate
## 1.591137 1.239335 2.452600 2.151801
## experience_growth leg_abil1
## 1.199614 1.065231
Variance Inflation Factor(VIF) is one way to detect multicollinearity - large values indicate imprecise coefficients estimates and are thus bad for prediction power of the model. Generally, VIF values of 5 and below are considered acceptable.
Prediction and scoring:
library(InformationValue)
library(MLmetrics)
library(ROCR)
pred.train = predict(model2, train[,-7], type='response')
(optC = optimalCutoff(train$is_legendary, pred.train)) #0.210)
## [1] 0.2099229
pred.test = predict(model2, test[,-7], type='response')
pred.test2 = ifelse(pred.test<optC, 0, 1)
table(test$is_legendary, pred.test2, dnn=c("actual", "predicted"))
## predicted
## actual 0 1
## 0 217 2
## 1 0 21
(score = mean(test$is_legendary==pred.test2)) #0.992
## [1] 0.9916667
(F1 = F1_Score(test$is_legendary, pred.test2)) #0.995
## [1] 0.9954128
pr <- prediction(pred.test, test$is_legendary)
perf <- performance(pr, measure = "tpr", x.measure = "fpr")
(auc <- performance(pr, measure = "auc")) #0.991
## An object of class "performance"
## Slot "x.name":
## [1] "None"
##
## Slot "y.name":
## [1] "Area under the ROC curve"
##
## Slot "alpha.name":
## [1] "none"
##
## Slot "x.values":
## list()
##
## Slot "y.values":
## [[1]]
## [1] 0.9991302
##
##
## Slot "alpha.values":
## list()
plot(perf)
The OptimalCutoff function is a very useful tool to help us identify the threshold for separation. While the standard cutoff is 0.5, this sometimes may not be accurate. The function optimises the threshold by minimising the misclassification error. In this case, 0.21 is our threshold. Applying this threshold in predicting the test dataset yields excellent result. The model managed to accurately predict all 21 legendary pokemon while only misclassifying 2 non-legendary ones.
The score function tells us what percentage of the data did the model accurately predict, but this can be misleading if the classes are unbalanced. For example, a model which always predicts “0” will naturally yield a fantastic 90% score! However this model is clearly useless as it is unable to predict the “1” cases. Therefore, we use the F1 scoring function which takes into account the proportion of each class that was correctly identified. In this case, the model which always predicts “0” would score poorly on the F1 metric.
Hence, our logistic regression model has been very robust and the features provided in the dataset allowed us to classify legendary pokemon with pinpoint accuracy. A disadvantage of this model is that it does not tell us clearly which feature is important in prediction. We can build a decision tree model to help us elucidate better.
library(rpart)
library(rpart.plot)
library(factoextra)
set.seed(1234)
tree = rpart(is_legendary ~ base_egg_steps+base_happiness+base_total+capture_rate+experience_growth+leg_abil, data=pkmn2, method='class')
rpart.plot(tree)
ggplot(pkmn2) +
geom_point(aes(base_egg_steps, capture_rate, color=is_legendary))
The decision tree model tells us that base_egg_steps and capture_rate can very well classify legendary pokemon. This is evident from the scatter plot where non-legendary and legendary pokemon are neatly separated just by two features. This separation can even be cleaner with extra features.