Below, the packages required for data analysis and visualization are loaded.
library(tidyverse)
library(magrittr)
library(zoo)
library(DT)
Serebii.net keeps a list of all the Pokémon in the various games in the franchise. The data on this site is in a wide format where each Pokémon gets its own unique row in the table. Here’s a snapshot:
Below, we load the HTML data from the site and extract the table of interest into a data frame. Then we export that data frame to .csv and reload it.
my_url1 <- "https://www.serebii.net/pokemon/nationalpokedex.shtml"
dat <- try(xml2::read_html(my_url1))
if (inherits(dat, "try-error", which = FALSE)){
break
}else{
pokemon <- xml2::xml_find_all(dat,
"//table[contains(@class, 'dextable')]")
pokemon <- rvest::html_table(pokemon)[[1]]
}
write.csv(pokemon, "pokemon.csv", row.names = FALSE)
my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data607_project2/main/pokemon.csv"
pokemon_new <- read.csv(my_url2)
as_tibble(pokemon_new)
## # A tibble: 2,022 × 12
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
## 1 "No." "Pic" "Name" Type "Abi… Base… Base… Base… Base… Base… Base… NA
## 2 "No." "Pic" "Name" Type "Abi… HP Att Def S.Att S.Def Spd NA
## 3 "#00001" "" "" Bulbas… "" Over… 45 49 49 65 65 45
## 4 "" <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA
## 5 "#00002" "" "" Ivysaur "" Over… 60 62 63 80 80 60
## 6 "" <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA
## 7 "#00003" "" "" Venusa… "" Over… 80 82 83 100 100 80
## 8 "" <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA
## 9 "#00004" "" "" Charma… "" Blaz… 39 52 43 60 50 65
## 10 "" <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA
## # … with 2,012 more rows
Because of how merged cells are handled by the function used to extract the table data, some of the info we’re interested in doesn’t come through correctly. We will reprocess some of the data from the site differently later to fix that. First, we clean up the data we do have by removing several columns and rows and shifting things around a bit. We also rename the columns we intend to keep.
pokemon_new <- pokemon_new[-1, ]
pokemon_new <- pokemon_new[, -2]
pokemon_new <- pokemon_new[, -2]
pokemon_new <- pokemon_new[, -3]
cols <- c("Number", "Name", "Abilities", "Base_Hit_Points", "Base_Attack", "Base_Defense", "Base_Special_Attack", "Base_Special_Defense", "Base_Speed")
colnames(pokemon_new) <- cols
pokemon_new <- pokemon_new[-1, ]
pokemon_new %<>%
filter(!is.na(Name))
pokemon_new <- pokemon_new[, -3]
datatable(pokemon_new[1:10, ], rownames = FALSE,
options = list(scrollX = TRUE))
Pokémon Type is the first column of data we’re interested in that did not come through correctly, and Abilities is the second. These pieces of data are both stored as links, so we can retrieve them from the HTML data separately and store them in another data frame. The Pokémon Names are stored as links that precede the Types and Abilities for that Pokémon as well. Later, we’ll combine the two data frames based on the Names.
links <- xml2::xml_find_all(dat, "//td[@class='fooinfo']/a")
attrs <- t(as.data.frame(xml2::xml_attrs(links)))
rownames(attrs) <- NULL
attrs <- cbind(attrs, as.data.frame(matrix(nrow = nrow(attrs), ncol = 2)))
cols <- c("Name", "Types", "Abilities")
colnames(attrs) <- cols
as_tibble(attrs)
## # A tibble: 4,916 × 3
## Name Types Abilities
## <chr> <lgl> <lgl>
## 1 /pokemon/bulbasaur NA NA
## 2 /pokemon/type/grass NA NA
## 3 /pokemon/type/poison NA NA
## 4 /abilitydex/overgrow.shtml NA NA
## 5 /abilitydex/chlorophyll.shtml NA NA
## 6 /pokemon/ivysaur NA NA
## 7 /pokemon/type/grass NA NA
## 8 /pokemon/type/poison NA NA
## 9 /abilitydex/overgrow.shtml NA NA
## 10 /abilitydex/chlorophyll.shtml NA NA
## # … with 4,906 more rows
So we loop through the link text we’ve retrieved and move the type-related link text to the Types column and the ability-related link text to the Abilities column. We leave the name-related link text in the Names column. We put NA in the cells we removed link text from so that we can easily clean up cells without data later.
for (i in 1:nrow(attrs)){
if (any(str_detect(attrs[i, 1], "/type/"))){
attrs[i, 2] <- attrs[i, 1]
attrs[i, 1] <- NA
}else if (any(str_detect(attrs[i, 1], "/abilitydex/"))){
attrs[i, 3] <- attrs[i, 1]
attrs[i, 1] <- NA
}
}
as_tibble(attrs)
## # A tibble: 4,916 × 3
## Name Types Abilities
## <chr> <chr> <chr>
## 1 /pokemon/bulbasaur <NA> <NA>
## 2 <NA> /pokemon/type/grass <NA>
## 3 <NA> /pokemon/type/poison <NA>
## 4 <NA> <NA> /abilitydex/overgrow.shtml
## 5 <NA> <NA> /abilitydex/chlorophyll.shtml
## 6 /pokemon/ivysaur <NA> <NA>
## 7 <NA> /pokemon/type/grass <NA>
## 8 <NA> /pokemon/type/poison <NA>
## 9 <NA> <NA> /abilitydex/overgrow.shtml
## 10 <NA> <NA> /abilitydex/chlorophyll.shtml
## # … with 4,906 more rows
We remove unnecessary link text from the Name, Types, and Abilities columns so only the character values we’re interested in are left.
attrs$Name <- str_replace_all(attrs$Name, "/pokemon/", "")
attrs$Types <- str_replace_all(attrs$Types, "/pokemon/type/", "")
attrs$Abilities <- str_replace_all(attrs$Abilities, "/abilitydex/", "")
attrs$Abilities <- str_replace_all(attrs$Abilities, "\\.shtml", "")
Each Pokémon can have multiple Types and Abilities, and each Pokémon’s Name precedes its Types and Abilities. So we want to carry the last Name observation forward for every NA value in the Names column.
attrs$Name <- na.locf(attrs$Name)
as_tibble(attrs)
## # A tibble: 4,916 × 3
## Name Types Abilities
## <chr> <chr> <chr>
## 1 bulbasaur <NA> <NA>
## 2 bulbasaur grass <NA>
## 3 bulbasaur poison <NA>
## 4 bulbasaur <NA> overgrow
## 5 bulbasaur <NA> chlorophyll
## 6 ivysaur <NA> <NA>
## 7 ivysaur grass <NA>
## 8 ivysaur poison <NA>
## 9 ivysaur <NA> overgrow
## 10 ivysaur <NA> chlorophyll
## # … with 4,906 more rows
Rows with NA in both the Types and Abilities column are unnecessary, so we sum the NA values per row and filter rows with 2 NAs out.
rS <- rowSums(is.na(attrs))
attrs <- cbind(attrs, rS)
attrs %<>%
filter(rS < 2)
attrs <- attrs[, -4]
Now we pivot the Types and Abilities data into a longer format where Type and Ability are considered variables of Attribute and filter out rows with any NAs.
attrs %<>%
pivot_longer(cols = !starts_with("N"),
names_to = "Attribute",
values_to = "Value")
attrs$Attribute <- str_replace_all(attrs$Attribute, "Types", "Type")
attrs$Attribute <- str_replace_all(attrs$Attribute, "Abilities", "Ability")
rS <- rowSums(is.na(attrs))
attrs <- cbind(attrs, rS)
attrs %<>%
filter(rS < 1)
attrs <- attrs[, -4]
We want the Names in the attrs data frame to match the Names in the pokemon_new data frame. So we replace the values in the former with the values from the latter. Most of these involve simple case changes; we handle a few that are more complicated separately.
replacements <- c()
for (i in 1:nrow(pokemon_new)){
proper <- as.character(pokemon_new[i, 2])
improper <- tolower(proper)
if ((improper) %in% attrs$Name){
attrs$Name <- str_replace_all(attrs$Name,
paste("^", improper, "$", sep = ""),
proper)
}else{
replacements <- append(replacements, proper)
next
}
}
not_replaced <- c("nidoranf", "nidoranm", "mr.mime", "mimejr.", "flabebe",
"type:null", "tapukoko", "tapulele", "tapubulu",
"tapufini", "mr.rime", "greattusk", "screamtail",
"brutebonnet", "fluttermane", "slitherwing", "sandyshocks",
"irontreads", "ironbundle", "ironhands", "ironjugulis",
"ironmoth", "ironthorns", "roaringmoon", "ironvaliant",
"walkingwake", "ironleaves")
for (i in 1:length(not_replaced)){
proper <- replacements[i]
improper <- not_replaced[i]
attrs$Name <- str_replace_all(attrs$Name,
paste("^", improper, "$", sep = ""),
proper)
}
datatable(attrs[1:10, ], rownames = FALSE)
We now pivot the pokemon_new data frame into a longer format as well, storing each Base_Stat as a variable of Attribute to match the format of the attrs data frame.
pokemon_new %<>%
mutate_all(as.character) %>%
pivot_longer(cols = !starts_with("N"),
names_to = "Attribute",
values_to = "Value")
datatable(pokemon_new[1:10, ], rownames = FALSE)
Before we join the data frames, we can already observe the frequency of Pokémon by Type for all generations of games.
attrs_analysis <- attrs
attrs_analysis %<>%
filter(Attribute == "Type") %>%
group_by(Value) %>%
summarize(Count = n())
types_colors <- c("#6390F0", "#A8A77A", "#7AC74C",
"#A98FF3", "#F95587", "#A6B91A", "#EE8130", "#A33EA1", "#E2BF65",
"#B6A136", "#C22E28", "#705746", "#F7D02C", "#6F35FC", "#B7B7CE",
"#D685AD", "#735797", "#96D9D6")
names(types_colors) <- c("water", "normal", "grass", "flying",
"psychic", "bug", "fire", "poison",
"ground", "rock", "fighting", "dark",
"electric", "dragon", "steel", "fairy",
"ghost", "ice")
ggplot(attrs_analysis, aes(x = reorder(Value, Count), y = Count,
fill = Value)) +
geom_bar(stat = "identity", show.legend = FALSE) +
coord_flip() +
scale_fill_manual(values = types_colors) +
ggtitle("Number of Pokemon by Type: All Generations") +
xlab("Type") +
ylab("Count")
There are more Water-type Pokémon than any other. This Type of Pokémon has been present since the first generation of games, unlike say Fairy-type Pokémon, which did not exist until the sixth generation of games. However, the Types of some of the Pokémon from the earlier generations have been updated to include the newer Types as the games have evolved.
The first generation of games contained only 151 Pokémon, which we analyze separately after we do a full join on the data frames.
pokemon_new_analysis <- pokemon_new
pokemon_new_analysis$Number <- as.integer(str_replace_all(
pokemon_new_analysis$Number, "#", ""))
pokemon_new_analysis %<>%
full_join(attrs) %>%
arrange(Name)
pokemon_new_analysis$Number <- na.locf(pokemon_new_analysis$Number)
pokemon_new_analysis %<>%
arrange(Number)
datatable(pokemon_new_analysis[1:20, ], rownames = FALSE,
options = list(pageLength = 10))
pokemon_new_analysis_gen1 <- pokemon_new_analysis
pokemon_new_analysis_gen1 %<>%
filter(Number < 152) %>%
filter(Attribute == "Type") %>%
group_by(Value) %>%
summarize(Count = n())
ggplot(pokemon_new_analysis_gen1, aes(x = reorder(Value, Count), y = Count,
fill = Value)) +
geom_bar(stat = "identity", show.legend = FALSE) +
coord_flip() +
scale_fill_manual(values = types_colors) +
ggtitle("Number of Pokemon by Type: Generation 1") +
xlab("Type") +
ylab("Count")
There were technically only 15 Types in Generation 1, but recall that the Types of first generation Pokémon have since been updated. We see here that Poison is the most frequent Type of Generation 1 Pokémon, but Water, which is the most frequent Type across all generations, is still high on the list as the second most frequent Type.
The interesting thing about there being so many Water Pokémon in Generation 1 is that it certainly doesn’t feel that way when you’re playing the games. Many Water Pokémon need to be caught with a fishing rod that you only get later in the game, and the range of Pokémon you can catch with that rod is very limited even mid-game. Hence, a strategy developed among many early Pokémon players to choose the Water Pokémon Squirtle as their starter Pokémon at the beginning of the game. They did this based on three considerations:
So there being lots of Water Pokémon in Generation 1 is on its own not enough data to make an informed decision about choosing a starter Pokémon. Like many game decisions, there’s a little bit of art to it.