The Serebii website provides a list, called the National Pokedex, of all the Pokemon in all the games. The table breaks down each Pokemon with a unique ID number, name, type, abilities and base stats such as attack and defense.
Data Sources: Pokemon Stats
Lets determine a frequency chart of which Pokemon fall into which types they are.
library(tidyverse)
library(rvest)
library(xml2)
library(janitor)
At first, when the table was web scraped, it brought in the values
into a Data Frame, however, I quickly realized that the abilities listed
multiple ones and the rvest() package cleaned it up by adding a
space between them. This causes issues as some abilities are multiple
words so to string manipulate them, I needed to account for the built-in
HTML line breaks (
). To do this, I used XML to add an intentional
new line character (“\n”), so that I could be able to break them apart
later.
url <- "https://www.serebii.net/pokemon/nationalpokedex.shtml"
web_table <- read_html(url)
# use XML to account for <br> with abilities and add '\n'
xml_find_all(web_table, ".//br") |>
xml_add_sibling("p", "\n")
xml_find_all(web_table, ".//br") |>
xml_remove()
web_table <-
web_table |>
html_element('.dextable') |>
html_table()
pokemon_stats <- as.data.frame(web_table)
knitr::kable(head(pokemon_stats))
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 |
---|---|---|---|---|---|---|---|---|---|---|---|
No. | Pic | Name | Type | Abilities | Base Stats | Base Stats | Base Stats | Base Stats | Base Stats | Base Stats | NA |
No. | Pic | Name | Type | Abilities | HP | Att | Def | S.Att | S.Def | Spd | NA |
#00001 | Bulbasaur | Overgrow | |||||||||
Chlorophyll | 45 | 49 | 49 | 65 | 65 | 45 | |||||
NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
#00002 | Ivysaur | Overgrow | |||||||||
Chlorophyll | 60 | 62 | 63 | 80 | 80 | 60 | |||||
NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Every other row that was brought into the Data Frame had N/A values. I removed these rows where if any Pokemon didn’t have a name, it was dropped. The columns between Name and the Base Stats also were shifted to the right by one column. Using a vector, I moved the columns to the appropriate alignment and followed it by using the janitor package to clean up column names. To account for multiple abilities, the data frame was then unpivoted into a longer format for each Pokemon.
Note: Pokemon types are missing and the column was dropped. It will be explained in the following section.
# drop null values if Pokemon name is N/A
stats_df <-
pokemon_stats |>
drop_na(4)
# drop first row (duplicate header) and second column (pic)
stats_df <- stats_df[-1,-2]
# set column headers from first row and clean names
stats_df <-
stats_df |>
row_to_names(row_number = 1) |>
clean_names()
# shift pokemon names, etc to left by 1 column
stats_df[c(2:10)] = stats_df[, c(3:11)]
# drop 'na' column
stats_df <-
stats_df |>
select(!c(na, type))
# split multiple abilities into long format based on created '\n'
stats_df <-
stats_df |>
separate_longer_delim(abilities, delim = "\n")
# change to pokemon number
stats_df$no <-
parse_number(stats_df$no)
knitr::kable(head(stats_df))
no | name | abilities | hp | att | def | s_att | s_def | spd |
---|---|---|---|---|---|---|---|---|
1 | Bulbasaur | Overgrow | 45 | 49 | 49 | 65 | 65 | 45 |
1 | Bulbasaur | Chlorophyll | 45 | 49 | 49 | 65 | 65 | 45 |
2 | Ivysaur | Overgrow | 60 | 62 | 63 | 80 | 80 | 60 |
2 | Ivysaur | Chlorophyll | 60 | 62 | 63 | 80 | 80 | 60 |
3 | Venusaur | Overgrow | 80 | 82 | 83 | 100 | 100 | 80 |
3 | Venusaur | Chlorophyll | 80 | 82 | 83 | 100 | 100 | 80 |
While working on the Data Frame, I realized the Pokemon types were missing. Looking into the HTML code, the types were not text, but images with no conventional tag name for each one. Because of this, I used another website that provides that information and web scraped similarly into another Data Frame.
url <- "https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number"
web_table <- read_html(url)
# use XML to account for <br> and replace with '\n'
xml_find_all(web_table, ".//br") |>
xml_add_sibling("p", "\n")
xml_find_all(web_table, ".//br") |>
xml_remove()
web_table <-
web_table |>
html_element('body') |>
html_table()
pokemon_types <- as.data.frame(web_table)
knitr::kable(head(pokemon_types))
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 | X14 | X15 | X16 | X17 | X18 | X19 | X20 | X21 | X22 | X23 | X24 | X25 | X26 | X27 | X28 | X29 | X30 | X31 | X32 | X33 | X34 | X35 | X36 | X37 | X38 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Shortcuts |
Ndex Olddex Natdex |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA | |Ndex |MS |Pokémon |Type |Type |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA | |#0001 | |Bulbasaur |Grass |Poison |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA | |#0002 | |Ivysaur |Grass |Poison |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA | |#0003 | |Venusaur |Grass |Poison |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA | |#0004 | |Charmander |Fire |Fire |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |
Since each Pokemon has a unique identifier, I created the data frame to relate this ID to its types.
# drop null values if Pokemon name is N/A
types_df <-
pokemon_types |>
drop_na(2)
# drop unnecessary columns
types_df <-
types_df[, 1:5]
# set column headers from first row and clean names
types_df <-
types_df |>
row_to_names(row_number = 1) |>
clean_names()
# change to pokemon number
types_df$ndex <-
parse_number(types_df$ndex)
# drop N/A or zero (0) while keeping only distinct pokemon numbers
types_df <-
types_df |>
drop_na() |>
filter(ndex != 0) |>
distinct(ndex, .keep_all=TRUE)
# within same pokemon number, replace repeated types with N/A
types_df <-
types_df |>
mutate(type_2 = if_else(type_2 != type, type_2, NA)) |>
select(-c(2)) |>
rename(no = ndex)
# melt both type columns into one column
temp1 <-
types_df |>
select(1:3)
temp2 <-
types_df |>
select(1,2,4) |>
rename(type = type_2)
types_df <-
temp1 |>
full_join(temp2) |>
drop_na() |>
select(!pokemon) |>
arrange(no)
knitr::kable(head(types_df))
no | type |
---|---|
1 | Grass |
1 | Poison |
2 | Grass |
2 | Poison |
3 | Grass |
3 | Poison |
Using a similar way of SQL joins, I merged the Pokemon to its types and stats.
stats_types_df <-
stats_df |>
inner_join(types_df) |>
relocate(type, .after = name)
knitr::kable(head(stats_types_df))
no | name | type | abilities | hp | att | def | s_att | s_def | spd |
---|---|---|---|---|---|---|---|---|---|
1 | Bulbasaur | Grass | Overgrow | 45 | 49 | 49 | 65 | 65 | 45 |
1 | Bulbasaur | Poison | Overgrow | 45 | 49 | 49 | 65 | 65 | 45 |
1 | Bulbasaur | Grass | Chlorophyll | 45 | 49 | 49 | 65 | 65 | 45 |
1 | Bulbasaur | Poison | Chlorophyll | 45 | 49 | 49 | 65 | 65 | 45 |
2 | Ivysaur | Grass | Overgrow | 60 | 62 | 63 | 80 | 80 | 60 |
2 | Ivysaur | Poison | Overgrow | 60 | 62 | 63 | 80 | 80 | 60 |
The frequency chart shows us that Water, Normal and Grass types are the most common among all the Pokemon. One caveat to point out, is that a Pokemon could have up to two different types. This means that a Pokemon can appear twice in different categories as a result in here.
grouped_type <-
stats_types_df |>
distinct(name, type) |>
group_by(type) |>
summarise(count = n()) |>
arrange(desc(count))
grouped_type |>
ggplot(aes(x = count, y = reorder(type, count))) +
geom_bar(stat = 'identity')
knitr::kable(grouped_type)
type | count |
---|---|
Water | 154 |
Normal | 130 |
Grass | 122 |
Flying | 109 |
Psychic | 99 |
Bug | 92 |
Fire | 80 |
Poison | 79 |
Ground | 75 |
Rock | 73 |
Fighting | 72 |
Dark | 69 |
Electric | 68 |
Dragon | 65 |
Fairy | 63 |
Steel | 63 |
Ghost | 62 |
Ice | 48 |