This week we’ll be doing a deeper investigation into our documentation and the importance of referencing said documentation for the data we’re using.
For reference, here is the site our data set comes from: Pokemon Data Set | Kaggle.com
Firstly, let’s identify a few columns (or values) in our data that may be unclear until reading the documentation.
## [1] "abilities" "against_bug" "against_dark"
## [4] "against_dragon" "against_electric" "against_fairy"
## [7] "against_fight" "against_fire" "against_flying"
## [10] "against_ghost" "against_grass" "against_ground"
## [13] "against_ice" "against_normal" "against_poison"
## [16] "against_psychic" "against_rock" "against_steel"
## [19] "against_water" "attack" "base_egg_steps"
## [22] "base_happiness" "base_total" "capture_rate"
## [25] "classfication" "defense" "experience_growth"
## [28] "height_m" "hp" "japanese_name"
## [31] "name" "percentage_male" "pokedex_number"
## [34] "sp_attack" "sp_defense" "speed"
## [37] "type1" "type2" "weight_kg"
## [40] "generation" "is_legendary"
Even assuming a reader had some knowledge of the Pokemon franchise, some confusion may be expected looking at the following column names:
The “against_x” columns spark confusion even when looking at the full table. It’s hard to tell what the values for the columns are denoting. Even if you’re knowledgeable about type strengths/weaknesses, it’s difficult to know if these values are conveying ‘strong against’ or ‘weak against’, and how you would read them. The author more than likely encoded these columns in this way to mimic the type matchups on the official Pokemon website: Pokemon Type Matchups | Pokemon.net. The difference on the official site is that there are colors for the different values and an additional key that explains these differences.
“Percentage_male” may be confusing if you expect a “percentage_female”. It can be assumed that the creator of the data set expects the reader to take the difference in total gender from the percent male to determine the female percentage. However, this column altogether may have questionable significance if the reader was unaware that some Pokemon have physical variations depending on gender.
An element of the data set that remains confusing even after investigating the documentation is the ‘experience_growth’ column. With a bit of background knowledge, Pokemon are capable of gaining levels up to level 100 that will affect their stats. Experience points (EXP) are what are used to gauge how much a Pokemon’s level. However, the documentation states that the experience_growth column is “The experience growth of the Pokemon”. It is unclear if these values are tracking the EXP to increase by one level or multiple.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Here in our boxplot, we can identify outliers by generation of experience growth. It is difficult here to identify especially what is causing outliers below our quantiles.
Furthermore, if we’re concerned about growth, we can see if there relationship between the steps it takes to hatch an egg and experience growth. Our prediction here is that Pokemon that require more steps to hatch will also require more experience to grow.
Here we can see that there’s no strong linear correlation visible in our plot, disproving our hypothesis that Pokemon requiring more experience also require more steps. In other words, you can’t say that Pokemon that are hard to hatch are hard to level up!
Two such categorical columns in our data set are ‘type1’ and ‘type2’. The values in these columns are describing a Pokemon’s type affiliation. These types are limited and are not inherently ranked. What readers may notice is that while there are no missing type1 values, there are over 300 missing type2 values. While this may appear to be an oversight at first glance, the context of the documentation and understanding of the game provides the explanation that every Pokemon must have at least one type.
Lastly, looking at one of our continuous columns, ‘weight_kg’ we can attempt to identify an outlier.
To begin with, there are a few missing values across generations, but especially in generations 1, 6 and 7. However, even if we were to visualize this, we would have a few factors that would cause variation. For example, depending on the number of legendaries in a single generation, this would skew our average weight quite a bit (as, on average, legendary pokemon tend to be heavier and generally larger than non-legendaries). What we can say is that different generations will have different outliers when forming our groups. For example, in generation 3 where the average weight is about 67 kg, an outlier (smallest outlier) is above 180kg.