Introduction

The Serebii website provides a list, called the National Pokedex, of all the Pokemon in all the games. The table breaks down each Pokemon with a unique ID number, name, type, abilities and base stats such as attack and defense.

Data Sources: Pokemon Stats

Lets determine a frequency chart of which Pokemon fall into which types they are.

Required Libraries

library(tidyverse)
library(rvest)
library(xml2)
library(janitor)

Web Scrape Table into Data Frame

At first, when the table was web scraped, it brought in the values into a Data Frame, however, I quickly realized that the abilities listed multiple ones and the rvest() package cleaned it up by adding a space between them. This causes issues as some abilities are multiple words so to string manipulate them, I needed to account for the built-in HTML line breaks (
). To do this, I used XML to add an intentional new line character (“\n”), so that I could be able to break them apart later.

Using XML

url <- "https://www.serebii.net/pokemon/nationalpokedex.shtml"

web_table <- read_html(url) 

# use XML to account for <br> with abilities and add '\n'
xml_find_all(web_table, ".//br") |> 
  xml_add_sibling("p", "\n")
xml_find_all(web_table, ".//br") |> 
  xml_remove() 

web_table <- 
  web_table |> 
  html_element('.dextable') |> 
  html_table()

pokemon_stats <- as.data.frame(web_table)

knitr::kable(head(pokemon_stats))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
No. Pic Name Type Abilities Base Stats Base Stats Base Stats Base Stats Base Stats Base Stats NA
No. Pic Name Type Abilities HP Att Def S.Att S.Def Spd NA
#00001 Bulbasaur Overgrow
Chlorophyll 45 49 49 65 65 45
NA NA NA NA NA NA NA NA NA NA NA
#00002 Ivysaur Overgrow
Chlorophyll 60 62 63 80 80 60
NA NA NA NA NA NA NA NA NA NA NA

Tidy Up Data

Every other row that was brought into the Data Frame had N/A values. I removed these rows where if any Pokemon didn’t have a name, it was dropped. The columns between Name and the Base Stats also were shifted to the right by one column. Using a vector, I moved the columns to the appropriate alignment and followed it by using the janitor package to clean up column names. To account for multiple abilities, the data frame was then unpivoted into a longer format for each Pokemon.

Note: Pokemon types are missing and the column was dropped. It will be explained in the following section.

# drop null values if Pokemon name is N/A
stats_df <- 
  pokemon_stats |> 
  drop_na(4)

# drop first row (duplicate header) and second column (pic)
stats_df <- stats_df[-1,-2]

# set column headers from first row and clean names
stats_df <- 
  stats_df |> 
  row_to_names(row_number = 1) |> 
  clean_names()

# shift pokemon names, etc to left by 1 column
stats_df[c(2:10)] = stats_df[, c(3:11)]

# drop 'na' column
stats_df <- 
  stats_df |> 
  select(!c(na, type))

# split multiple abilities into long format based on created '\n'
stats_df <- 
  stats_df |> 
  separate_longer_delim(abilities, delim = "\n")

# change to pokemon number
stats_df$no <-
  parse_number(stats_df$no)

knitr::kable(head(stats_df))
no name abilities hp att def s_att s_def spd
1 Bulbasaur Overgrow 45 49 49 65 65 45
1 Bulbasaur Chlorophyll 45 49 49 65 65 45
2 Ivysaur Overgrow 60 62 63 80 80 60
2 Ivysaur Chlorophyll 60 62 63 80 80 60
3 Venusaur Overgrow 80 82 83 100 100 80
3 Venusaur Chlorophyll 80 82 83 100 100 80

Pokemon Types Missing

While working on the Data Frame, I realized the Pokemon types were missing. Looking into the HTML code, the types were not text, but images with no conventional tag name for each one. Because of this, I used another website that provides that information and web scraped similarly into another Data Frame.

url <- "https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number"

web_table <- read_html(url) 

# use XML to account for <br> and replace with '\n'
xml_find_all(web_table, ".//br") |> 
  xml_add_sibling("p", "\n")
xml_find_all(web_table, ".//br") |> 
  xml_remove() 

web_table <- 
  web_table |> 
  html_element('body') |> 
  html_table()

pokemon_types <- as.data.frame(web_table)

knitr::kable(head(pokemon_types))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38
Shortcuts

Ndex Olddex Natdex |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA | |Ndex |MS |Pokémon |Type |Type |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA | |#0001 | |Bulbasaur |Grass |Poison |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA | |#0002 | |Ivysaur |Grass |Poison |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA | |#0003 | |Venusaur |Grass |Poison |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA | |#0004 | |Charmander |Fire |Fire |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |NA |

Since each Pokemon has a unique identifier, I created the data frame to relate this ID to its types.

# drop null values if Pokemon name is N/A
types_df <- 
  pokemon_types |> 
  drop_na(2)

# drop unnecessary columns
types_df <- 
  types_df[, 1:5]

# set column headers from first row and clean names
types_df <- 
  types_df |> 
  row_to_names(row_number = 1) |> 
  clean_names()

# change to pokemon number
types_df$ndex <-
  parse_number(types_df$ndex)

# drop N/A or zero (0) while keeping only distinct pokemon numbers
types_df <- 
  types_df |> 
  drop_na() |> 
  filter(ndex != 0) |> 
  distinct(ndex, .keep_all=TRUE)

# within same pokemon number, replace repeated types with N/A
types_df <- 
  types_df |> 
  mutate(type_2 = if_else(type_2 != type, type_2, NA)) |> 
  select(-c(2)) |> 
  rename(no = ndex)

# melt both type columns into one column
temp1 <- 
  types_df |> 
  select(1:3)

temp2 <- 
  types_df |> 
  select(1,2,4) |> 
  rename(type = type_2)

types_df <- 
  temp1 |> 
  full_join(temp2) |>
  drop_na() |> 
  select(!pokemon) |> 
  arrange(no)

knitr::kable(head(types_df))
no type
1 Grass
1 Poison
2 Grass
2 Poison
3 Grass
3 Poison

Merge Pokemon Stats and Types Tables

Using a similar way of SQL joins, I merged the Pokemon to its types and stats.

stats_types_df <-  
  stats_df |> 
  inner_join(types_df) |> 
  relocate(type, .after = name)

knitr::kable(head(stats_types_df))
no name type abilities hp att def s_att s_def spd
1 Bulbasaur Grass Overgrow 45 49 49 65 65 45
1 Bulbasaur Poison Overgrow 45 49 49 65 65 45
1 Bulbasaur Grass Chlorophyll 45 49 49 65 65 45
1 Bulbasaur Poison Chlorophyll 45 49 49 65 65 45
2 Ivysaur Grass Overgrow 60 62 63 80 80 60
2 Ivysaur Poison Overgrow 60 62 63 80 80 60

Pokemon Type Frequency

The frequency chart shows us that Water, Normal and Grass types are the most common among all the Pokemon. One caveat to point out, is that a Pokemon could have up to two different types. This means that a Pokemon can appear twice in different categories as a result in here.

grouped_type <-  
  stats_types_df |> 
  distinct(name, type) |> 
  group_by(type) |> 
  summarise(count = n()) |> 
    arrange(desc(count))

grouped_type |> 
  ggplot(aes(x = count, y = reorder(type, count))) +
  geom_bar(stat = 'identity')

knitr::kable(grouped_type)
type count
Water 154
Normal 130
Grass 122
Flying 109
Psychic 99
Bug 92
Fire 80
Poison 79
Ground 75
Rock 73
Fighting 72
Dark 69
Electric 68
Dragon 65
Fairy 63
Steel 63
Ghost 62
Ice 48