RPubs link: https://rpubs.com/enewington/722829

Presentation link: https://www.loom.com/share/5958c725f69940ffaf8511193808776d

Required packages

# Load Packages 
library(readr) 
library(dplyr)
library(tidyverse)
library(outliers)
library(MVN)
library(magrittr) #pipe operator
library(forecast) # For finding lambda for a Box-Cox transformation 
library(Hmisc) #impute missing

Data

The datasets have been obtained from:

https://data.world/data-society/pokemon-with-stats

https://www.kaggle.com/mylesoneill/pokemon-sun-and-moon-gen-7-stats?select=type-chart.csv NOTE dataset2 source seems to have differed from when the data was originally downloaded

The two data sets includes statistics on 898 Pokemon. The variable descriptions are as follows:

Pokedex: The unique ID for each pokemon

Name: The name of each pokemon

Type 1: Determines weakness/resistance to attacks

Type 2: Some pokemon are dual type and have a second type

Total: Sum of all stats that come after this, a general guide to how strong a pokemon is

HP: Hit points, or health, defines how much damage a pokemon can withstand before fainting

Attack: The base modifier for normal attacks (eg. Scratch, Punch)

Defense: The base damage resistance against normal attacks

SP Atk: Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)

SP Def: Special defense, the base damage resistance against special attacks

Speed: Determines which pokemon attacks first each round

Generation: The generation of games where the pokemon was first introduced

Legendary: Some pokemon are much rarer than others, and are dubbed “legendary”

Species: as per name

Forme: as per name

Ability 1: an effect on a Pokémon that is not an attack. Some will be active all of the time, while some you will need to choose to use.

Ability 2: Some Pokemon have a second ability

Total: total of HP,attack, defense, spattack, spdefense and speed

Weight: weight of the pokemon

Height: height of the pokemon

dex 1: unknown: dex is usually short for ‘pokedex’. the contents of this variable are not understood

dex 2: as above

class: each move is classified as one of the 18 Pokémon types. The effectiveness of a move is dependent on how susceptible the target Pokémon’s type is to the move’s type.

percent-male: percentage of male pokemon of this species/name

percent-female: percentage of female pokemon of this species/name

egg-group: categories which determine which Pokémon are able to interbreed. A Pokémon may belong to either one or two Egg Groups.

egg-group2: as above

Date.first.spawned: date the pokemon was first discovered

Read data

The csv files have been imported using the base R read.csv package, specifying that not all strings should automatically be converted to factors. There are more character variables than factor variables, so the factors will be automatically converted later. There are many empty cells, so na.strings will ensure empty cells are filled with NA, to make tidying the data easier. To inspect the datasets, head() has been use to view the first 6 rows of each dataset and dim() has been used for a quick glance at the dimensions of the datasets. Pokemon1 has 800 rows (observations) and 15 columns (variables) and Pokemon2 has 1060 rows and 26 columns.

#read data
pokemon1 <- read.csv('pokemon data1.csv', stringsAsFactors = FALSE, na.strings="")
pokemon2 <- read.csv('pokemon data2.csv', stringsAsFactors = FALSE, na.strings="")
#view the first 6 rows of data
head(pokemon1)
head(pokemon2)
dim(pokemon1)
[1] 800  15
dim(pokemon2)
[1] 1061   26

Pre-merge data tidying

The datasets will undergo some initial tidying before they are merged, to ensure both datasets contain the same unique. Dataset one contains data on the first 721 pokemon, so dataset 2 will drop any pokemon that have a pokedex(index) number higher than 721.

Tidy dataset 1

#remove the first variable as this is identical to the pokdex number, both are not required
pokemon1 <- pokemon1[ , -(1)]
head(pokemon1)
#CHECK FOR DUPLICATES
any(duplicated(pokemon1$pokedex))
[1] TRUE
#to start with, remove any rows containing 'mega'as these are evolved forms of pokemon and we will only analyse original pokemon
pokemon1[- grep("Mega", pokemon1$Name),]
#remove any other duplicates in pokedex
pokemon1 %<>% distinct(pokedex, .keep_all = TRUE)

#CHECK FOR DUPLICATES - now the list is unique
any(duplicated(pokemon1$pokedex))
[1] FALSE

Tidy dataset 2

#drop pokemon in the pokedex greater than 722 to ensure the datasets contain the same pokemon and remove the unnecessary ID columbn
pokemon2 %<>% 
  filter(pokedex<722) %>% 
  select(-id)
head(pokemon2)
dim(pokemon2)
[1] 936  25
#remove duplicates
pokemon2 %<>% distinct(pokedex, .keep_all = TRUE)

#CHECK FOR DUPLICATES
any(duplicated(pokemon2$pokedex))
[1] FALSE
dim(pokemon2)
[1] 721  25

Merge the datasets

The datasets have been merged using a full join with pokedex as the primary key. The pokedex acts like an index, with each unique pokemon having a reference on the pokedex. The full join has been used to return all rows and all columns from both pokemon1 and pokemon2 datasets. If there there happens to be no matching value, there will be an NA returned for the one missing.

#JOIN DATASETS
pokemon_df <- pokemon1 %>% 
  full_join(pokemon2, by = 'pokedex')
dim(pokemon_df)
[1] 721  38

Understand

The str() functions shows all the types of data. As we used stringasfactors=FALSE when reading the data, there are no factors in this list, however Type should be an ordered factor, as certain types are stronger than others, and date will also need to be converted. The summary() function shows the range of each variable, where it is numeric, the min, max, medial and quartiles 1 and 3. Where it’s a character variable, the class and mode are shown.

str(pokemon_df)
'data.frame':   721 obs. of  38 variables:
 $ pokedex           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Name              : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
 $ Type.1            : chr  "Grass" "Grass" "Grass" "Fire" ...
 $ Type.2            : chr  "Poison" "Poison" "Poison" NA ...
 $ Total             : int  318 405 525 309 405 534 314 405 530 195 ...
 $ HP                : int  45 60 80 39 58 78 44 59 79 45 ...
 $ Attack            : int  49 62 82 52 64 84 48 63 83 30 ...
 $ Defense           : int  49 63 83 43 58 78 65 80 100 35 ...
 $ Sp..Atk           : int  65 80 100 60 80 109 50 65 85 20 ...
 $ Sp..Def           : int  65 80 100 50 65 85 64 80 105 20 ...
 $ Speed             : int  45 60 80 65 80 100 43 58 78 45 ...
 $ Generation        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Legendary         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Date.first.spawned: chr  "12/12/1995" "10/11/2018" "16/11/1987" "30/05/1996" ...
 $ species           : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
 $ forme             : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
 $ type1             : chr  "Grass" "Grass" "Grass" "Fire" ...
 $ type2             : chr  "Poison" "Poison" "Poison" NA ...
 $ ability1          : chr  "Overgrow" "Overgrow" "Overgrow" "Blaze" ...
 $ ability2          : chr  NA NA NA NA ...
 $ abilityH          : chr  "Chlorophyll" "Chlorophyll" "Chlorophyll" "Solar Power" ...
 $ hp                : int  45 60 80 39 58 78 44 59 79 45 ...
 $ attack            : int  49 62 82 52 64 84 48 63 83 30 ...
 $ defense           : int  49 63 83 43 58 78 65 80 100 35 ...
 $ spattack          : int  65 80 100 60 80 109 50 65 85 20 ...
 $ spdefense         : int  65 80 100 50 65 85 64 80 105 20 ...
 $ speed             : int  45 60 80 65 80 100 43 58 78 45 ...
 $ total             : int  318 405 525 309 405 534 314 405 530 195 ...
 $ weight            : chr  "15.2 lbs." "28.7 lbs." "220.5 lbs." "18.7 lbs." ...
 $ height            : chr  "2'04\"" "3'03\"" "6'07\"" "2'00\"" ...
 $ dex1              : chr  NA NA NA NA ...
 $ dex2              : chr  NA NA NA NA ...
 $ class             : chr  "Seed Pokémon" "Seed Pokémon" "Seed Pokémon" "Lizard Pokémon" ...
 $ percent.male      : num  0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.5 ...
 $ percent.female    : num  0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.5 ...
 $ pre.evolution     : chr  NA "Bulbasaur" "Ivysaur" NA ...
 $ egg.group1        : chr  "Monster" "Monster" "Monster" "Monster" ...
 $ egg.group2        : chr  "Grass" "Grass" "Grass" "Dragon" ...
summary(pokemon_df)
    pokedex        Name              Type.1             Type.2              Total             HP        
 Min.   :  1   Length:721         Length:721         Length:721         Min.   :180.0   Min.   :  1.00  
 1st Qu.:181   Class :character   Class :character   Class :character   1st Qu.:320.0   1st Qu.: 50.00  
 Median :361   Mode  :character   Mode  :character   Mode  :character   Median :424.0   Median : 65.00  
 Mean   :361                                                            Mean   :417.9   Mean   : 68.38  
 3rd Qu.:541                                                            3rd Qu.:499.0   3rd Qu.: 80.00  
 Max.   :721                                                            Max.   :720.0   Max.   :255.00  
                                                                                                        
     Attack          Defense         Sp..Atk          Sp..Def           Speed          Generation   
 Min.   :  5.00   Min.   :  5.0   Min.   : 10.00   Min.   : 20.00   Min.   :  5.00   Min.   :1.000  
 1st Qu.: 54.00   1st Qu.: 50.0   1st Qu.: 45.00   1st Qu.: 50.00   1st Qu.: 45.00   1st Qu.:2.000  
 Median : 75.00   Median : 65.0   Median : 65.00   Median : 65.00   Median : 65.00   Median :3.000  
 Mean   : 75.12   Mean   : 70.7   Mean   : 68.85   Mean   : 69.18   Mean   : 65.71   Mean   :3.323  
 3rd Qu.: 95.00   3rd Qu.: 85.0   3rd Qu.: 90.00   3rd Qu.: 85.00   3rd Qu.: 85.00   3rd Qu.:5.000  
 Max.   :165.00   Max.   :230.0   Max.   :154.00   Max.   :230.00   Max.   :160.00   Max.   :6.000  
                                                                                                    
 Legendary       Date.first.spawned   species             forme              type1              type2          
 Mode :logical   Length:721         Length:721         Length:721         Length:721         Length:721        
 FALSE:675       Class :character   Class :character   Class :character   Class :character   Class :character  
 TRUE :46        Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                                               
                                                                                                               
                                                                                                               
                                                                                                               
   ability1           ability2           abilityH               hp             attack         defense      
 Length:721         Length:721         Length:721         Min.   :  1.00   Min.   :  5.0   Min.   :  5.00  
 Class :character   Class :character   Class :character   1st Qu.: 50.00   1st Qu.: 53.0   1st Qu.: 50.00  
 Mode  :character   Mode  :character   Mode  :character   Median : 65.00   Median : 75.0   Median : 66.00  
                                                          Mean   : 68.53   Mean   : 75.1   Mean   : 70.96  
                                                          3rd Qu.: 80.00   3rd Qu.: 95.0   3rd Qu.: 85.00  
                                                          Max.   :255.00   Max.   :165.0   Max.   :230.00  
                                                                                                           
    spattack        spdefense         speed            total          weight             height         
 Min.   : 10.00   Min.   : 20.0   Min.   :  5.00   Min.   :180.0   Length:721         Length:721        
 1st Qu.: 45.00   1st Qu.: 50.0   1st Qu.: 45.00   1st Qu.:320.0   Class :character   Class :character  
 Median : 65.00   Median : 65.0   Median : 65.00   Median :425.0   Mode  :character   Mode  :character  
 Mean   : 68.81   Mean   : 69.4   Mean   : 65.78   Mean   :418.6                                        
 3rd Qu.: 90.00   3rd Qu.: 85.0   3rd Qu.: 85.00   3rd Qu.:500.0                                        
 Max.   :154.00   Max.   :230.0   Max.   :160.00   Max.   :720.0                                        
                                                                                                        
     dex1               dex2              class            percent.male   percent.female  pre.evolution     
 Length:721         Length:721         Length:721         Min.   :0.000   Min.   :0.000   Length:721        
 Class :character   Class :character   Class :character   1st Qu.:0.500   1st Qu.:0.500   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Median :0.500   Median :0.500   Mode  :character  
                                                          Mean   :0.558   Mean   :0.442                     
                                                          3rd Qu.:0.500   3rd Qu.:0.500                     
                                                          Max.   :1.000   Max.   :1.000                     
                                                          NA's   :72      NA's   :72                        
  egg.group1         egg.group2       
 Length:721         Length:721        
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      

apply data conversions

Type 1 has been converted to a factor and then the ordered levels have been applied. Date.first.spawned has been converted from character to date.

#order variables that have a hierarchy
pokemon_df$type1 <-  as.factor(pokemon_df$type1)
str(pokemon_df$type1)
 Ord.factor w/ 18 levels "Dragon"<"Fairy"<..: 15 15 15 8 8 8 3 3 3 17 ...
#order variables that have a hierarchy
pokemon_df$type1 <-  factor(pokemon_df$type1,
                    levels=c('Dragon',
                             'Fairy', 
                             'Water',
                             'Steel',
                             'Fighting',
                             'Dark',
                             'Flying',
                             'Fire',
                             'Ghost',
                             'Ground',
                             'Normal',
                             'Psychic',
                             'Electric',
                             'Poison',
                             'Grass',
                             'Rock',
                             'Bug',
                             'Ice'),
                    ordered=TRUE)

#convert date from character to date
pokemon_df$Date.first.spawned <- as.Date(pokemon_df$Date.first.spawned)
str(pokemon_df)
'data.frame':   721 obs. of  38 variables:
 $ pokedex           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Name              : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
 $ Type.1            : chr  "Grass" "Grass" "Grass" "Fire" ...
 $ Type.2            : chr  "Poison" "Poison" "Poison" NA ...
 $ Total             : int  318 405 525 309 405 534 314 405 530 195 ...
 $ HP                : int  45 60 80 39 58 78 44 59 79 45 ...
 $ Attack            : int  49 62 82 52 64 84 48 63 83 30 ...
 $ Defense           : int  49 63 83 43 58 78 65 80 100 35 ...
 $ Sp..Atk           : int  65 80 100 60 80 109 50 65 85 20 ...
 $ Sp..Def           : int  65 80 100 50 65 85 64 80 105 20 ...
 $ Speed             : int  45 60 80 65 80 100 43 58 78 45 ...
 $ Generation        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Legendary         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Date.first.spawned: Date, format: "0012-12-19" "0010-11-20" "0016-11-19" "0030-05-19" ...
 $ species           : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
 $ forme             : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
 $ type1             : Ord.factor w/ 18 levels "Dragon"<"Fairy"<..: 15 15 15 8 8 8 3 3 3 17 ...
 $ type2             : chr  "Poison" "Poison" "Poison" NA ...
 $ ability1          : chr  "Overgrow" "Overgrow" "Overgrow" "Blaze" ...
 $ ability2          : chr  NA NA NA NA ...
 $ abilityH          : chr  "Chlorophyll" "Chlorophyll" "Chlorophyll" "Solar Power" ...
 $ hp                : int  45 60 80 39 58 78 44 59 79 45 ...
 $ attack            : int  49 62 82 52 64 84 48 63 83 30 ...
 $ defense           : int  49 63 83 43 58 78 65 80 100 35 ...
 $ spattack          : int  65 80 100 60 80 109 50 65 85 20 ...
 $ spdefense         : int  65 80 100 50 65 85 64 80 105 20 ...
 $ speed             : int  45 60 80 65 80 100 43 58 78 45 ...
 $ total             : int  318 405 525 309 405 534 314 405 530 195 ...
 $ weight            : chr  "15.2 lbs." "28.7 lbs." "220.5 lbs." "18.7 lbs." ...
 $ height            : chr  "2'04\"" "3'03\"" "6'07\"" "2'00\"" ...
 $ dex1              : chr  NA NA NA NA ...
 $ dex2              : chr  NA NA NA NA ...
 $ class             : chr  "Seed Pokémon" "Seed Pokémon" "Seed Pokémon" "Lizard Pokémon" ...
 $ percent.male      : num  0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.5 ...
 $ percent.female    : num  0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.5 ...
 $ pre.evolution     : chr  NA "Bulbasaur" "Ivysaur" NA ...
 $ egg.group1        : chr  "Monster" "Monster" "Monster" "Monster" ...
 $ egg.group2        : chr  "Grass" "Grass" "Grass" "Dragon" ...

Tidy & Manipulate Data I

To check if this data is tidy, comparisons between the similar variables have been made. Firstly identical() is used to see if the two colums are identical or not, if not then which() is used to see which observations are different. All of the similar variables have been 2 and 24 variables, so a decision to keep the variables from pokemon2 has been made. Some variables have been renames to avoid confusion (e.g. the variable ‘class’ could be confused with class()) Forme has been changed to form Any instances of ‘Pokémon’ have been dropped from the column class/pokeclass

#compare similar variables
identical(pokemon_df[['Name']],pokemon_df[['species']])
[1] FALSE
pokemon_df[which(pokemon_df$Name != pokemon_df$species), ]

identical(pokemon_df[['Type.1']],pokemon_df[['type1']])
[1] FALSE
pokemon_df[which(pokemon_df$Type.1 != pokemon_df$type1), ]

identical(pokemon_df[['Type.1']],pokemon_df[['type2']])
[1] FALSE
pokemon_df[which(pokemon_df$Type.2 != pokemon_df$type2), ]

identical(pokemon_df[['HP']],pokemon_df[['hp']])
[1] FALSE
pokemon_df[which(pokemon_df$HP != pokemon_df$hp), ]

identical(pokemon_df[['Attack']],pokemon_df[['attack']])
[1] FALSE
pokemon_df[which(pokemon_df$Attack != pokemon_df$attack), ]

identical(pokemon_df[['Defense']],pokemon_df[['defense']])
[1] FALSE
pokemon_df[which(pokemon_df$Defense != pokemon_df$defense), ]

identical(pokemon_df[['Sp..Atk']],pokemon_df[['spattack']])
[1] FALSE
pokemon_df[which(pokemon_df$Sp..Atk != pokemon_df$spattack), ]

identical(pokemon_df[['Sp..Def']],pokemon_df[['spdefense']])
[1] FALSE
pokemon_df[which(pokemon_df$Sp..Def != pokemon_df$spdefense), ]

identical(pokemon_df[['Speed']],pokemon_df[['speed']])
[1] FALSE
pokemon_df[which(pokemon_df$Speed != pokemon_df$speed), ]

identical(pokemon_df[['Total']],pokemon_df[['total']])
[1] FALSE
pokemon_df[which(pokemon_df$Total != pokemon_df$total), ]

#drop unnecessary columns - keep the values from pokemon data2
pokemon_df2 <- select(pokemon_df, -Name, -HP, -Attack, -Defense, -Sp..Atk, -Sp..Def, -Speed, -Total, -dex1, -dex2, -Type.1, -Type.2)

#rename class to avoid confusion with class()
pokemon_df2 %<>% 
  rename(pokeclass = class)

#rename 'forme' to form
pokemon_df2 %<>% 
  rename(form = forme)

#remove special characters from class column
pokemon_df2$pokeclass %<>% str_replace("Pokémon","")
head(pokemon_df2)
dim(pokemon_df2)
[1] 721  26
str(pokemon_df2)
'data.frame':   721 obs. of  26 variables:
 $ pokedex           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Generation        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Legendary         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Date.first.spawned: Date, format: "0012-12-19" "0010-11-20" "0016-11-19" "0030-05-19" ...
 $ species           : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
 $ form              : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
 $ type1             : Ord.factor w/ 18 levels "Dragon"<"Fairy"<..: 15 15 15 8 8 8 3 3 3 17 ...
 $ type2             : chr  "Poison" "Poison" "Poison" NA ...
 $ ability1          : chr  "Overgrow" "Overgrow" "Overgrow" "Blaze" ...
 $ ability2          : chr  NA NA NA NA ...
 $ abilityH          : chr  "Chlorophyll" "Chlorophyll" "Chlorophyll" "Solar Power" ...
 $ hp                : int  45 60 80 39 58 78 44 59 79 45 ...
 $ attack            : int  49 62 82 52 64 84 48 63 83 30 ...
 $ defense           : int  49 63 83 43 58 78 65 80 100 35 ...
 $ spattack          : int  65 80 100 60 80 109 50 65 85 20 ...
 $ spdefense         : int  65 80 100 50 65 85 64 80 105 20 ...
 $ speed             : int  45 60 80 65 80 100 43 58 78 45 ...
 $ total             : int  318 405 525 309 405 534 314 405 530 195 ...
 $ weight            : chr  "15.2 lbs." "28.7 lbs." "220.5 lbs." "18.7 lbs." ...
 $ height            : chr  "2'04\"" "3'03\"" "6'07\"" "2'00\"" ...
 $ pokeclass         : chr  "Seed " "Seed " "Seed " "Lizard " ...
 $ percent.male      : num  0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.5 ...
 $ percent.female    : num  0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.5 ...
 $ pre.evolution     : chr  NA "Bulbasaur" "Ivysaur" NA ...
 $ egg.group1        : chr  "Monster" "Monster" "Monster" "Monster" ...
 $ egg.group2        : chr  "Grass" "Grass" "Grass" "Dragon" ...

Tidy & Manipulate Data II

A new column has been created using mutate() to add together attack and spattack for each pokemon, in a new column called ‘totalattack’

#create total attack points column
mutate(pokemon_df2, totalattack = attack+spattack) 

Scan I

colSums() has been used to see how many missing values are in each column. At this stage, 7 columns have missing values. The percentage of female and male variables have been imputed with the mean, using impute() The variables type2, ability, pre.evolution and egg.group2 have blank values in some obsverations, but this is not missing data, as not all pokemon have these attributes. Where there in an NA in these observations, we will replace the NA with a string to make it clear that the attribute is not applible rather than missing. AbilityH will be replaced with the mode, or most commonly appearing abilityH

#Look for missing values
colSums(is.na(pokemon_df2))
           pokedex         Generation          Legendary Date.first.spawned            species 
                 0                  0                  0                  0                  0 
              form              type1              type2           ability1           ability2 
                 0                  0                372                  0                310 
          abilityH                 hp             attack            defense           spattack 
                83                  0                  0                  0                  0 
         spdefense              speed              total             weight             height 
                 0                  0                  0                  0                  0 
         pokeclass       percent.male     percent.female      pre.evolution         egg.group1 
                 0                 72                 72                365                  0 
        egg.group2 
               530 
#impute missing values with mean
pokemon_df2$percent.male %<>% impute(pokemon_df2$percent.male, fun = mean)
pokemon_df2$percent.female %<>% impute(pokemon_df2$percent.female, fun = mean)

#type 2, ability 2, pre.evolution and and egg group 2 all have over half the observations missing however this is not missing data, these pokemon do not have these attributes and these columns should not be dropped. Cells will be filled with a new string
pokemon_df2$type2 %<>% replace_na("no type 2")
pokemon_df2$ability2 %<>% replace_na("no ability 2")
pokemon_df2$pre.evolution %<>% replace_na("no pre-evolution")
pokemon_df2$egg.group2 %<>% replace_na("no egg group 2")

#ability H will be replaced with the mode
pokemon_df2$abilityH %<>% impute(pokemon_df2$abilityH, fun= mode)

#Look again for missing values
colSums(is.na(pokemon_df2))
           pokedex         Generation          Legendary Date.first.spawned            species 
                 0                  0                  0                  0                  0 
              form              type1              type2           ability1           ability2 
                 0                  0                  0                  0                  0 
          abilityH                 hp             attack            defense           spattack 
                 0                  0                  0                  0                  0 
         spdefense              speed              total             weight             height 
                 0                  0                  0                  0                  0 
         pokeclass       percent.male     percent.female      pre.evolution         egg.group1 
                 0                  0                  0                  0                  0 
        egg.group2 
                 0 

Scan II

As there are many numeric variables, a multivariate analysis has been chosen. Firstly, a subset of only numeric variables is made. As all missing values have been taken care of, the addition step of removing missing values is not needed. MVN is used to find and exclude the mutlivariate outliers and return the observation numbers of the outliers. The data is then subset again to exclude the observations that contain the outliers.

# Subset the dataset to relevant numeric variables
pokemon_sub <- pokemon_df2 %>% select(hp, attack,defense, spattack, spdefense, speed, total) 
dim(pokemon_sub)
[1] 721   7
# Find and exclude the multivariate outliers 
pokeresults <- pokemon_sub %>% 
  MVN::mvn(multivariateOutlierMethod = "quan", 
           showOutliers = TRUE)
The covariance matrix has become singular during
the iterations of the MCD algorithm.
There are 721 observations (in the entire dataset of 721 obs.) lying on the hyperplane with equation
a_1*(x_i1 - m_1) + ... + a_p*(x_ip - m_p) = 0 with (m_1, ..., m_p) the mean of these observations
and coefficients a_i from the vector a <- c(-0.3779645, -0.3779645, -0.3779645, -0.3779645,
-0.3779645, -0.3779645, 0.3779645)

pokeresults$multivariateOutliers
#returns the observation numbers of the outliers and then converts from character to numeric
pokemon_outliers <- pokemon_sub[c(as.numeric(pokeresults$multivariateOutliers[["Observation"]])), ]

# # subset the dataset by including rows contain in the specified observation numbers 
pokemon_clean <- pokemon_sub[-c(as.numeric(pokeresults$multivariateOutliers[["Observation"]])), ]
dim(pokemon_clean)
[1] 713   7
dim(pokemon_outliers)
[1] 8 7
dim(pokemon_sub)
[1] 721   7

Transform

The histogram of hp is very right skewed. A box-cox transformation has been applied, where the lamba is automatically found and applied. After the transformation, the shape of the histogram is slightly more normalise, however there is nowa slight left skew.

#create a histogram
hist(pokemon_df2$hp)

#Apply Box-Cox transformation 
boxcox_hp <- BoxCox(pokemon_df2$hp, lambda = "auto")
attr(boxcox_hp, which = "lambda")
[1] 0.1990125
hist(boxcox_hp)

##References

Biostars. (2020, 02 01). Question: Find mismatch in two columns in a data frame in R. Retrieved from Biostars; bioinformatics explained: https://www.biostars.org/p/180451/

DataNovia. (2021, 01 02). Identify and Remove Duplicate Data in R . Retrieved from DataNovia: https://www.datanovia.com/en/lessons/identify-and-remove-duplicate-data-in-r/

Fontes, R. (2021, 02 12). Pokémon: Every Elemental Type, Officially Ranked. Retrieved from Thegamer: https://www.thegamer.com/pokemon-elemental-types-ranked-officially/

intelliPaat. (2021, 02 11). Change the Blank Cells to “NA”. Retrieved from intellipaat: https://intellipaat.com/community/27103/change-the-blank-cells-to-na

Wickham, H. (2021, 02 11). Replace NAs with specified values. Retrieved from tidyr: https://tidyr.tidyverse.org/reference/replace_na.html

---
title: "Data Wrangling Assessment Task 3: Dataset challenge"
author: "Erin Newington | s3884614"
subtitle: 
output:
  html_notebook: default
---

RPubs link: https://rpubs.com/enewington/722829

Presentation link: https://www.loom.com/share/5958c725f69940ffaf8511193808776d

## Required packages 

```{r}
# Load Packages 
library(readr) 
library(dplyr)
library(tidyverse)
library(outliers)
library(MVN)
library(magrittr) #pipe operator
library(forecast) # For finding lambda for a Box-Cox transformation 
library(Hmisc) #impute missing
```

## Data 
The datasets have been obtained from:

https://data.world/data-society/pokemon-with-stats

https://www.kaggle.com/mylesoneill/pokemon-sun-and-moon-gen-7-stats?select=type-chart.csv
*NOTE dataset2 source seems to have differed from when the data was originally downloaded*

The two data sets includes statistics on 898 Pokemon. The variable descriptions are as follows:

**Pokedex:** The unique ID for each pokemon

**Name:** The name of each pokemon

**Type 1:** Determines weakness/resistance to attacks

**Type 2:** Some pokemon are dual type and have a second type

**Total:** Sum of all stats that come after this, a general guide to how strong a pokemon is

**HP:** Hit points, or health, defines how much damage a pokemon can withstand before fainting

**Attack:** The base modifier for normal attacks (eg. Scratch, Punch)

**Defense:** The base damage resistance against normal attacks

**SP Atk:** Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)

**SP Def:** Special defense, the base damage resistance against special attacks

**Speed:** Determines which pokemon attacks first each round

**Generation:** The generation of games where the pokemon was first introduced

**Legendary:** Some pokemon are much rarer than others, and are dubbed "legendary"

**Species:** as per name

**Forme:** as per name

**Ability 1**: an effect on a Pokémon that is not an attack. Some will be active all of the time, while some you will need to choose to use.

**Ability 2:** Some Pokemon have a second ability

**Total:** total of HP,attack, defense, spattack, spdefense and speed

**Weight:** weight of the pokemon

**Height:** height of the pokemon

**dex 1:** unknown: dex is usually short for 'pokedex'. the contents of this variable are not understood

**dex 2:** as above

**class:** each move is classified as one of the 18 Pokémon types. The effectiveness of a move is dependent on how susceptible the target Pokémon's type is to the move's type.

**percent-male:** percentage of male pokemon of this species/name

**percent-female:** percentage of female pokemon of this species/name

**egg-group:** categories which determine which Pokémon are able to interbreed. A Pokémon may belong to either one or two Egg Groups. 

**egg-group2:** as above

**Date.first.spawned:** date the pokemon was first discovered

### Read data
The csv files have been imported using the base R *read.csv* package, specifying that not all strings should automatically be converted to factors. There are more character variables than factor variables, so the factors will be automatically converted later. There are many empty cells, so *na.strings* will ensure empty cells are filled with NA, to make tidying the data easier. To inspect the datasets, *head()* has been use to view the first 6 rows of each dataset and *dim()* has been used for a quick glance at the dimensions of the datasets. Pokemon1 has 800 rows (observations) and 15 columns (variables) and Pokemon2 has 1060 rows and 26 columns.
```{r}
#read data
pokemon1 <- read.csv('pokemon data1.csv', stringsAsFactors = FALSE, na.strings="")
pokemon2 <- read.csv('pokemon data2.csv', stringsAsFactors = FALSE, na.strings="")
#view the first 6 rows of data
head(pokemon1)
head(pokemon2)
dim(pokemon1)
dim(pokemon2)
```

### Pre-merge data tidying
The datasets will undergo some initial tidying before they are merged, to ensure both datasets contain the same unique. Dataset one contains data on the first 721 pokemon, so dataset 2 will drop any pokemon that have a pokedex(index) number higher than 721.

### Tidy dataset 1

```{r TIDY POKEMON 1}
#remove the first variable as this is identical to the pokdex number, both are not required
pokemon1 <- pokemon1[ , -(1)]
head(pokemon1)
#CHECK FOR DUPLICATES
any(duplicated(pokemon1$pokedex))
#to start with, remove any rows containing 'mega'as these are evolved forms of pokemon and we will only analyse original pokemon
pokemon1[- grep("Mega", pokemon1$Name),]
#remove any other duplicates in pokedex
pokemon1 %<>% distinct(pokedex, .keep_all = TRUE)

#CHECK FOR DUPLICATES - now the list is unique
any(duplicated(pokemon1$pokedex))
```
### Tidy dataset 2
```{r TIDY POKEMON 2}
#drop pokemon in the pokedex greater than 722 to ensure the datasets contain the same pokemon and remove the unnecessary ID columbn
pokemon2 %<>% 
  filter(pokedex<722) %>% 
  select(-id)
head(pokemon2)
dim(pokemon2)
#remove duplicates
pokemon2 %<>% distinct(pokedex, .keep_all = TRUE)

#CHECK FOR DUPLICATES
any(duplicated(pokemon2$pokedex))
dim(pokemon2)
```
## Merge the datasets
The datasets have been merged using a full join with pokedex as the primary key. The pokedex acts like an index, with each unique pokemon having a reference on the pokedex. The full join has been used to return all rows and all columns from both pokemon1 and pokemon2 datasets. If there there happens to be no matching value, there will be an NA returned for the one missing.
```{r MERGE}
#JOIN DATASETS
pokemon_df <- pokemon1 %>% 
  full_join(pokemon2, by = 'pokedex')
dim(pokemon_df)
```
## Understand 

The *str()* functions shows all the types of data. As we used *stringasfactors=FALSE* when reading the data, there are no factors in this list, however Type should be an ordered factor, as certain types are stronger than others, and date will also need to be converted. The *summary()* function shows the range of each variable, where it is numeric, the min, max, medial and quartiles 1 and 3. Where it's a character variable, the class and mode are shown.

```{r}
str(pokemon_df)
summary(pokemon_df)
```
## apply data conversions
Type 1 has been converted to a factor and then the ordered levels have been applied. Date.first.spawned has been converted from character to date. 
```{r}
#order variables that have a hierarchy
pokemon_df$type1 <-  as.factor(pokemon_df$type1)
str(pokemon_df$type1)

#order variables that have a hierarchy
pokemon_df$type1 <-  factor(pokemon_df$type1,
                    levels=c('Dragon',
                             'Fairy', 
                             'Water',
                             'Steel',
                             'Fighting',
                             'Dark',
                             'Flying',
                             'Fire',
                             'Ghost',
                             'Ground',
                             'Normal',
                             'Psychic',
                             'Electric',
                             'Poison',
                             'Grass',
                             'Rock',
                             'Bug',
                             'Ice'),
                    ordered=TRUE)

#convert date from character to date
pokemon_df$Date.first.spawned <- as.Date(pokemon_df$Date.first.spawned)
str(pokemon_df)


```
##	Tidy & Manipulate Data I 

To check if this data is tidy, comparisons between the similar variables have been made. Firstly *identical()* is used to see if the two colums are identical or not, if not then *which()* is used to see which observations are different. All of the similar variables have been 2 and 24 variables, so a decision to keep the variables from pokemon2 has been made.
Some variables have been renames to avoid confusion (e.g. the variable 'class' could be confused with *class()*)
Forme has been changed to form 
Any instances of 'PokÃ©mon' have been dropped from the column class/pokeclass

```{r}
#compare similar variables
identical(pokemon_df[['Name']],pokemon_df[['species']])
pokemon_df[which(pokemon_df$Name != pokemon_df$species), ]

identical(pokemon_df[['Type.1']],pokemon_df[['type1']])
pokemon_df[which(pokemon_df$Type.1 != pokemon_df$type1), ]

identical(pokemon_df[['Type.1']],pokemon_df[['type2']])
pokemon_df[which(pokemon_df$Type.2 != pokemon_df$type2), ]

identical(pokemon_df[['HP']],pokemon_df[['hp']])
pokemon_df[which(pokemon_df$HP != pokemon_df$hp), ]

identical(pokemon_df[['Attack']],pokemon_df[['attack']])
pokemon_df[which(pokemon_df$Attack != pokemon_df$attack), ]

identical(pokemon_df[['Defense']],pokemon_df[['defense']])
pokemon_df[which(pokemon_df$Defense != pokemon_df$defense), ]

identical(pokemon_df[['Sp..Atk']],pokemon_df[['spattack']])
pokemon_df[which(pokemon_df$Sp..Atk != pokemon_df$spattack), ]

identical(pokemon_df[['Sp..Def']],pokemon_df[['spdefense']])
pokemon_df[which(pokemon_df$Sp..Def != pokemon_df$spdefense), ]

identical(pokemon_df[['Speed']],pokemon_df[['speed']])
pokemon_df[which(pokemon_df$Speed != pokemon_df$speed), ]

identical(pokemon_df[['Total']],pokemon_df[['total']])
pokemon_df[which(pokemon_df$Total != pokemon_df$total), ]

#drop unnecessary columns - keep the values from pokemon data2
pokemon_df2 <- select(pokemon_df, -Name, -HP, -Attack, -Defense, -Sp..Atk, -Sp..Def, -Speed, -Total, -dex1, -dex2, -Type.1, -Type.2)

#rename class to avoid confusion with class()
pokemon_df2 %<>% 
  rename(pokeclass = class)

#rename 'forme' to form
pokemon_df2 %<>% 
  rename(form = forme)

#remove special characters from class column
pokemon_df2$pokeclass %<>% str_replace("PokÃ©mon","")
head(pokemon_df2)
dim(pokemon_df2)
str(pokemon_df2)
```

##	Tidy & Manipulate Data II 
A new column has been created using *mutate()* to add together attack and spattack for each pokemon, in a new column called 'totalattack'
```{r}
#create total attack points column
mutate(pokemon_df2, totalattack = attack+spattack) 
```

##	Scan I 

*colSums()* has been used to see how many missing values are in each column. At this stage, 7 columns have missing values. The percentage of female and male variables have been imputed with the mean, using *impute()*
The variables type2, ability, pre.evolution and egg.group2 have blank values in some obsverations, but this is not missing data, as not all pokemon have these attributes. Where there in an NA in these observations, we will replace the NA with a string to make it clear that the attribute is not applible rather than missing. 
AbilityH will be replaced with the mode, or most commonly appearing abilityH
```{r}
#Look for missing values
colSums(is.na(pokemon_df2))

#impute missing values with mean
pokemon_df2$percent.male %<>% impute(pokemon_df2$percent.male, fun = mean)
pokemon_df2$percent.female %<>% impute(pokemon_df2$percent.female, fun = mean)

#type 2, ability 2, pre.evolution and and egg group 2 all have over half the observations missing however this is not missing data, these pokemon do not have these attributes and these columns should not be dropped. Cells will be filled with a new string
pokemon_df2$type2 %<>% replace_na("no type 2")
pokemon_df2$ability2 %<>% replace_na("no ability 2")
pokemon_df2$pre.evolution %<>% replace_na("no pre-evolution")
pokemon_df2$egg.group2 %<>% replace_na("no egg group 2")

#ability H will be replaced with the mode
pokemon_df2$abilityH %<>% impute(pokemon_df2$abilityH, fun= mode)

#Look again for missing values
colSums(is.na(pokemon_df2))
```


##	Scan II
As there are many numeric variables, a multivariate analysis has been chosen. Firstly, a subset of only numeric variables is made. As all missing values have been taken care of, the addition step of removing missing values is not needed. *MVN* is used to find and exclude the mutlivariate outliers and return the observation numbers of the outliers. The data is then subset again to exclude the observations that contain the outliers.
```{r}
# Subset the dataset to relevant numeric variables
pokemon_sub <- pokemon_df2 %>% select(hp, attack,defense, spattack, spdefense, speed, total) 
dim(pokemon_sub)
# Find and exclude the multivariate outliers 
pokeresults <- pokemon_sub %>% 
  MVN::mvn(multivariateOutlierMethod = "quan", 
           showOutliers = TRUE)
pokeresults$multivariateOutliers
#returns the observation numbers of the outliers and then converts from character to numeric
pokemon_outliers <- pokemon_sub[c(as.numeric(pokeresults$multivariateOutliers[["Observation"]])), ]

# # subset the dataset by including rows contain in the specified observation numbers 
pokemon_clean <- pokemon_sub[-c(as.numeric(pokeresults$multivariateOutliers[["Observation"]])), ]
dim(pokemon_clean)
dim(pokemon_outliers)
dim(pokemon_sub)
```

##	Transform 
The histogram of hp is very right skewed. A box-cox transformation has been applied, where the lamba is automatically found and applied. After the transformation, the shape of the histogram is slightly more normalise, however there is nowa slight left skew.
```{r}
#create a histogram
hist(pokemon_df2$hp)
#Apply Box-Cox transformation 
boxcox_hp <- BoxCox(pokemon_df2$hp, lambda = "auto")
attr(boxcox_hp, which = "lambda")
hist(boxcox_hp)
```

##References

Biostars. (2020, 02 01). Question: Find mismatch in two columns in a data frame in R. Retrieved from Biostars; bioinformatics explained: https://www.biostars.org/p/180451/

DataNovia. (2021, 01 02). Identify and Remove Duplicate Data in R . Retrieved from DataNovia: https://www.datanovia.com/en/lessons/identify-and-remove-duplicate-data-in-r/

Fontes, R. (2021, 02 12). Pokémon: Every Elemental Type, Officially Ranked. Retrieved from Thegamer: https://www.thegamer.com/pokemon-elemental-types-ranked-officially/

intelliPaat. (2021, 02 11). Change the Blank Cells to “NA”. Retrieved from intellipaat: https://intellipaat.com/community/27103/change-the-blank-cells-to-na

Wickham, H. (2021, 02 11). Replace NAs with specified values. Retrieved from tidyr: https://tidyr.tidyverse.org/reference/replace_na.html

