RPubs link: https://rpubs.com/enewington/722829
Presentation link: https://www.loom.com/share/5958c725f69940ffaf8511193808776d
# Load Packages
library(readr)
library(dplyr)
library(tidyverse)
library(outliers)
library(MVN)
library(magrittr) #pipe operator
library(forecast) # For finding lambda for a Box-Cox transformation
library(Hmisc) #impute missing
The datasets have been obtained from:
https://data.world/data-society/pokemon-with-stats
https://www.kaggle.com/mylesoneill/pokemon-sun-and-moon-gen-7-stats?select=type-chart.csv NOTE dataset2 source seems to have differed from when the data was originally downloaded
The two data sets includes statistics on 898 Pokemon. The variable descriptions are as follows:
Pokedex: The unique ID for each pokemon
Name: The name of each pokemon
Type 1: Determines weakness/resistance to attacks
Type 2: Some pokemon are dual type and have a second type
Total: Sum of all stats that come after this, a general guide to how strong a pokemon is
HP: Hit points, or health, defines how much damage a pokemon can withstand before fainting
Attack: The base modifier for normal attacks (eg. Scratch, Punch)
Defense: The base damage resistance against normal attacks
SP Atk: Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
SP Def: Special defense, the base damage resistance against special attacks
Speed: Determines which pokemon attacks first each round
Generation: The generation of games where the pokemon was first introduced
Legendary: Some pokemon are much rarer than others, and are dubbed “legendary”
Species: as per name
Forme: as per name
Ability 1: an effect on a Pokémon that is not an attack. Some will be active all of the time, while some you will need to choose to use.
Ability 2: Some Pokemon have a second ability
Total: total of HP,attack, defense, spattack, spdefense and speed
Weight: weight of the pokemon
Height: height of the pokemon
dex 1: unknown: dex is usually short for ‘pokedex’. the contents of this variable are not understood
dex 2: as above
class: each move is classified as one of the 18 Pokémon types. The effectiveness of a move is dependent on how susceptible the target Pokémon’s type is to the move’s type.
percent-male: percentage of male pokemon of this species/name
percent-female: percentage of female pokemon of this species/name
egg-group: categories which determine which Pokémon are able to interbreed. A Pokémon may belong to either one or two Egg Groups.
egg-group2: as above
Date.first.spawned: date the pokemon was first discovered
The csv files have been imported using the base R read.csv package, specifying that not all strings should automatically be converted to factors. There are more character variables than factor variables, so the factors will be automatically converted later. There are many empty cells, so na.strings will ensure empty cells are filled with NA, to make tidying the data easier. To inspect the datasets, head() has been use to view the first 6 rows of each dataset and dim() has been used for a quick glance at the dimensions of the datasets. Pokemon1 has 800 rows (observations) and 15 columns (variables) and Pokemon2 has 1060 rows and 26 columns.
#read data
pokemon1 <- read.csv('pokemon data1.csv', stringsAsFactors = FALSE, na.strings="")
pokemon2 <- read.csv('pokemon data2.csv', stringsAsFactors = FALSE, na.strings="")
#view the first 6 rows of data
head(pokemon1)
head(pokemon2)
dim(pokemon1)
[1] 800 15
dim(pokemon2)
[1] 1061 26
The datasets will undergo some initial tidying before they are merged, to ensure both datasets contain the same unique. Dataset one contains data on the first 721 pokemon, so dataset 2 will drop any pokemon that have a pokedex(index) number higher than 721.
#remove the first variable as this is identical to the pokdex number, both are not required
pokemon1 <- pokemon1[ , -(1)]
head(pokemon1)
#CHECK FOR DUPLICATES
any(duplicated(pokemon1$pokedex))
[1] TRUE
#to start with, remove any rows containing 'mega'as these are evolved forms of pokemon and we will only analyse original pokemon
pokemon1[- grep("Mega", pokemon1$Name),]
#remove any other duplicates in pokedex
pokemon1 %<>% distinct(pokedex, .keep_all = TRUE)
#CHECK FOR DUPLICATES - now the list is unique
any(duplicated(pokemon1$pokedex))
[1] FALSE
#drop pokemon in the pokedex greater than 722 to ensure the datasets contain the same pokemon and remove the unnecessary ID columbn
pokemon2 %<>%
filter(pokedex<722) %>%
select(-id)
head(pokemon2)
dim(pokemon2)
[1] 936 25
#remove duplicates
pokemon2 %<>% distinct(pokedex, .keep_all = TRUE)
#CHECK FOR DUPLICATES
any(duplicated(pokemon2$pokedex))
[1] FALSE
dim(pokemon2)
[1] 721 25
The datasets have been merged using a full join with pokedex as the primary key. The pokedex acts like an index, with each unique pokemon having a reference on the pokedex. The full join has been used to return all rows and all columns from both pokemon1 and pokemon2 datasets. If there there happens to be no matching value, there will be an NA returned for the one missing.
#JOIN DATASETS
pokemon_df <- pokemon1 %>%
full_join(pokemon2, by = 'pokedex')
dim(pokemon_df)
[1] 721 38
The str() functions shows all the types of data. As we used stringasfactors=FALSE when reading the data, there are no factors in this list, however Type should be an ordered factor, as certain types are stronger than others, and date will also need to be converted. The summary() function shows the range of each variable, where it is numeric, the min, max, medial and quartiles 1 and 3. Where it’s a character variable, the class and mode are shown.
str(pokemon_df)
'data.frame': 721 obs. of 38 variables:
$ pokedex : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
$ Type.1 : chr "Grass" "Grass" "Grass" "Fire" ...
$ Type.2 : chr "Poison" "Poison" "Poison" NA ...
$ Total : int 318 405 525 309 405 534 314 405 530 195 ...
$ HP : int 45 60 80 39 58 78 44 59 79 45 ...
$ Attack : int 49 62 82 52 64 84 48 63 83 30 ...
$ Defense : int 49 63 83 43 58 78 65 80 100 35 ...
$ Sp..Atk : int 65 80 100 60 80 109 50 65 85 20 ...
$ Sp..Def : int 65 80 100 50 65 85 64 80 105 20 ...
$ Speed : int 45 60 80 65 80 100 43 58 78 45 ...
$ Generation : int 1 1 1 1 1 1 1 1 1 1 ...
$ Legendary : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Date.first.spawned: chr "12/12/1995" "10/11/2018" "16/11/1987" "30/05/1996" ...
$ species : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
$ forme : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
$ type1 : chr "Grass" "Grass" "Grass" "Fire" ...
$ type2 : chr "Poison" "Poison" "Poison" NA ...
$ ability1 : chr "Overgrow" "Overgrow" "Overgrow" "Blaze" ...
$ ability2 : chr NA NA NA NA ...
$ abilityH : chr "Chlorophyll" "Chlorophyll" "Chlorophyll" "Solar Power" ...
$ hp : int 45 60 80 39 58 78 44 59 79 45 ...
$ attack : int 49 62 82 52 64 84 48 63 83 30 ...
$ defense : int 49 63 83 43 58 78 65 80 100 35 ...
$ spattack : int 65 80 100 60 80 109 50 65 85 20 ...
$ spdefense : int 65 80 100 50 65 85 64 80 105 20 ...
$ speed : int 45 60 80 65 80 100 43 58 78 45 ...
$ total : int 318 405 525 309 405 534 314 405 530 195 ...
$ weight : chr "15.2 lbs." "28.7 lbs." "220.5 lbs." "18.7 lbs." ...
$ height : chr "2'04\"" "3'03\"" "6'07\"" "2'00\"" ...
$ dex1 : chr NA NA NA NA ...
$ dex2 : chr NA NA NA NA ...
$ class : chr "Seed Pokémon" "Seed Pokémon" "Seed Pokémon" "Lizard Pokémon" ...
$ percent.male : num 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.5 ...
$ percent.female : num 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.5 ...
$ pre.evolution : chr NA "Bulbasaur" "Ivysaur" NA ...
$ egg.group1 : chr "Monster" "Monster" "Monster" "Monster" ...
$ egg.group2 : chr "Grass" "Grass" "Grass" "Dragon" ...
summary(pokemon_df)
pokedex Name Type.1 Type.2 Total HP
Min. : 1 Length:721 Length:721 Length:721 Min. :180.0 Min. : 1.00
1st Qu.:181 Class :character Class :character Class :character 1st Qu.:320.0 1st Qu.: 50.00
Median :361 Mode :character Mode :character Mode :character Median :424.0 Median : 65.00
Mean :361 Mean :417.9 Mean : 68.38
3rd Qu.:541 3rd Qu.:499.0 3rd Qu.: 80.00
Max. :721 Max. :720.0 Max. :255.00
Attack Defense Sp..Atk Sp..Def Speed Generation
Min. : 5.00 Min. : 5.0 Min. : 10.00 Min. : 20.00 Min. : 5.00 Min. :1.000
1st Qu.: 54.00 1st Qu.: 50.0 1st Qu.: 45.00 1st Qu.: 50.00 1st Qu.: 45.00 1st Qu.:2.000
Median : 75.00 Median : 65.0 Median : 65.00 Median : 65.00 Median : 65.00 Median :3.000
Mean : 75.12 Mean : 70.7 Mean : 68.85 Mean : 69.18 Mean : 65.71 Mean :3.323
3rd Qu.: 95.00 3rd Qu.: 85.0 3rd Qu.: 90.00 3rd Qu.: 85.00 3rd Qu.: 85.00 3rd Qu.:5.000
Max. :165.00 Max. :230.0 Max. :154.00 Max. :230.00 Max. :160.00 Max. :6.000
Legendary Date.first.spawned species forme type1 type2
Mode :logical Length:721 Length:721 Length:721 Length:721 Length:721
FALSE:675 Class :character Class :character Class :character Class :character Class :character
TRUE :46 Mode :character Mode :character Mode :character Mode :character Mode :character
ability1 ability2 abilityH hp attack defense
Length:721 Length:721 Length:721 Min. : 1.00 Min. : 5.0 Min. : 5.00
Class :character Class :character Class :character 1st Qu.: 50.00 1st Qu.: 53.0 1st Qu.: 50.00
Mode :character Mode :character Mode :character Median : 65.00 Median : 75.0 Median : 66.00
Mean : 68.53 Mean : 75.1 Mean : 70.96
3rd Qu.: 80.00 3rd Qu.: 95.0 3rd Qu.: 85.00
Max. :255.00 Max. :165.0 Max. :230.00
spattack spdefense speed total weight height
Min. : 10.00 Min. : 20.0 Min. : 5.00 Min. :180.0 Length:721 Length:721
1st Qu.: 45.00 1st Qu.: 50.0 1st Qu.: 45.00 1st Qu.:320.0 Class :character Class :character
Median : 65.00 Median : 65.0 Median : 65.00 Median :425.0 Mode :character Mode :character
Mean : 68.81 Mean : 69.4 Mean : 65.78 Mean :418.6
3rd Qu.: 90.00 3rd Qu.: 85.0 3rd Qu.: 85.00 3rd Qu.:500.0
Max. :154.00 Max. :230.0 Max. :160.00 Max. :720.0
dex1 dex2 class percent.male percent.female pre.evolution
Length:721 Length:721 Length:721 Min. :0.000 Min. :0.000 Length:721
Class :character Class :character Class :character 1st Qu.:0.500 1st Qu.:0.500 Class :character
Mode :character Mode :character Mode :character Median :0.500 Median :0.500 Mode :character
Mean :0.558 Mean :0.442
3rd Qu.:0.500 3rd Qu.:0.500
Max. :1.000 Max. :1.000
NA's :72 NA's :72
egg.group1 egg.group2
Length:721 Length:721
Class :character Class :character
Mode :character Mode :character
Type 1 has been converted to a factor and then the ordered levels have been applied. Date.first.spawned has been converted from character to date.
#order variables that have a hierarchy
pokemon_df$type1 <- as.factor(pokemon_df$type1)
str(pokemon_df$type1)
Ord.factor w/ 18 levels "Dragon"<"Fairy"<..: 15 15 15 8 8 8 3 3 3 17 ...
#order variables that have a hierarchy
pokemon_df$type1 <- factor(pokemon_df$type1,
levels=c('Dragon',
'Fairy',
'Water',
'Steel',
'Fighting',
'Dark',
'Flying',
'Fire',
'Ghost',
'Ground',
'Normal',
'Psychic',
'Electric',
'Poison',
'Grass',
'Rock',
'Bug',
'Ice'),
ordered=TRUE)
#convert date from character to date
pokemon_df$Date.first.spawned <- as.Date(pokemon_df$Date.first.spawned)
str(pokemon_df)
'data.frame': 721 obs. of 38 variables:
$ pokedex : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
$ Type.1 : chr "Grass" "Grass" "Grass" "Fire" ...
$ Type.2 : chr "Poison" "Poison" "Poison" NA ...
$ Total : int 318 405 525 309 405 534 314 405 530 195 ...
$ HP : int 45 60 80 39 58 78 44 59 79 45 ...
$ Attack : int 49 62 82 52 64 84 48 63 83 30 ...
$ Defense : int 49 63 83 43 58 78 65 80 100 35 ...
$ Sp..Atk : int 65 80 100 60 80 109 50 65 85 20 ...
$ Sp..Def : int 65 80 100 50 65 85 64 80 105 20 ...
$ Speed : int 45 60 80 65 80 100 43 58 78 45 ...
$ Generation : int 1 1 1 1 1 1 1 1 1 1 ...
$ Legendary : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Date.first.spawned: Date, format: "0012-12-19" "0010-11-20" "0016-11-19" "0030-05-19" ...
$ species : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
$ forme : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
$ type1 : Ord.factor w/ 18 levels "Dragon"<"Fairy"<..: 15 15 15 8 8 8 3 3 3 17 ...
$ type2 : chr "Poison" "Poison" "Poison" NA ...
$ ability1 : chr "Overgrow" "Overgrow" "Overgrow" "Blaze" ...
$ ability2 : chr NA NA NA NA ...
$ abilityH : chr "Chlorophyll" "Chlorophyll" "Chlorophyll" "Solar Power" ...
$ hp : int 45 60 80 39 58 78 44 59 79 45 ...
$ attack : int 49 62 82 52 64 84 48 63 83 30 ...
$ defense : int 49 63 83 43 58 78 65 80 100 35 ...
$ spattack : int 65 80 100 60 80 109 50 65 85 20 ...
$ spdefense : int 65 80 100 50 65 85 64 80 105 20 ...
$ speed : int 45 60 80 65 80 100 43 58 78 45 ...
$ total : int 318 405 525 309 405 534 314 405 530 195 ...
$ weight : chr "15.2 lbs." "28.7 lbs." "220.5 lbs." "18.7 lbs." ...
$ height : chr "2'04\"" "3'03\"" "6'07\"" "2'00\"" ...
$ dex1 : chr NA NA NA NA ...
$ dex2 : chr NA NA NA NA ...
$ class : chr "Seed Pokémon" "Seed Pokémon" "Seed Pokémon" "Lizard Pokémon" ...
$ percent.male : num 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.5 ...
$ percent.female : num 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.5 ...
$ pre.evolution : chr NA "Bulbasaur" "Ivysaur" NA ...
$ egg.group1 : chr "Monster" "Monster" "Monster" "Monster" ...
$ egg.group2 : chr "Grass" "Grass" "Grass" "Dragon" ...
To check if this data is tidy, comparisons between the similar variables have been made. Firstly identical() is used to see if the two colums are identical or not, if not then which() is used to see which observations are different. All of the similar variables have been 2 and 24 variables, so a decision to keep the variables from pokemon2 has been made. Some variables have been renames to avoid confusion (e.g. the variable ‘class’ could be confused with class()) Forme has been changed to form Any instances of ‘Pokémon’ have been dropped from the column class/pokeclass
#compare similar variables
identical(pokemon_df[['Name']],pokemon_df[['species']])
[1] FALSE
pokemon_df[which(pokemon_df$Name != pokemon_df$species), ]
identical(pokemon_df[['Type.1']],pokemon_df[['type1']])
[1] FALSE
pokemon_df[which(pokemon_df$Type.1 != pokemon_df$type1), ]
identical(pokemon_df[['Type.1']],pokemon_df[['type2']])
[1] FALSE
pokemon_df[which(pokemon_df$Type.2 != pokemon_df$type2), ]
identical(pokemon_df[['HP']],pokemon_df[['hp']])
[1] FALSE
pokemon_df[which(pokemon_df$HP != pokemon_df$hp), ]
identical(pokemon_df[['Attack']],pokemon_df[['attack']])
[1] FALSE
pokemon_df[which(pokemon_df$Attack != pokemon_df$attack), ]
identical(pokemon_df[['Defense']],pokemon_df[['defense']])
[1] FALSE
pokemon_df[which(pokemon_df$Defense != pokemon_df$defense), ]
identical(pokemon_df[['Sp..Atk']],pokemon_df[['spattack']])
[1] FALSE
pokemon_df[which(pokemon_df$Sp..Atk != pokemon_df$spattack), ]
identical(pokemon_df[['Sp..Def']],pokemon_df[['spdefense']])
[1] FALSE
pokemon_df[which(pokemon_df$Sp..Def != pokemon_df$spdefense), ]
identical(pokemon_df[['Speed']],pokemon_df[['speed']])
[1] FALSE
pokemon_df[which(pokemon_df$Speed != pokemon_df$speed), ]
identical(pokemon_df[['Total']],pokemon_df[['total']])
[1] FALSE
pokemon_df[which(pokemon_df$Total != pokemon_df$total), ]
#drop unnecessary columns - keep the values from pokemon data2
pokemon_df2 <- select(pokemon_df, -Name, -HP, -Attack, -Defense, -Sp..Atk, -Sp..Def, -Speed, -Total, -dex1, -dex2, -Type.1, -Type.2)
#rename class to avoid confusion with class()
pokemon_df2 %<>%
rename(pokeclass = class)
#rename 'forme' to form
pokemon_df2 %<>%
rename(form = forme)
#remove special characters from class column
pokemon_df2$pokeclass %<>% str_replace("Pokémon","")
head(pokemon_df2)
dim(pokemon_df2)
[1] 721 26
str(pokemon_df2)
'data.frame': 721 obs. of 26 variables:
$ pokedex : int 1 2 3 4 5 6 7 8 9 10 ...
$ Generation : int 1 1 1 1 1 1 1 1 1 1 ...
$ Legendary : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Date.first.spawned: Date, format: "0012-12-19" "0010-11-20" "0016-11-19" "0030-05-19" ...
$ species : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
$ form : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
$ type1 : Ord.factor w/ 18 levels "Dragon"<"Fairy"<..: 15 15 15 8 8 8 3 3 3 17 ...
$ type2 : chr "Poison" "Poison" "Poison" NA ...
$ ability1 : chr "Overgrow" "Overgrow" "Overgrow" "Blaze" ...
$ ability2 : chr NA NA NA NA ...
$ abilityH : chr "Chlorophyll" "Chlorophyll" "Chlorophyll" "Solar Power" ...
$ hp : int 45 60 80 39 58 78 44 59 79 45 ...
$ attack : int 49 62 82 52 64 84 48 63 83 30 ...
$ defense : int 49 63 83 43 58 78 65 80 100 35 ...
$ spattack : int 65 80 100 60 80 109 50 65 85 20 ...
$ spdefense : int 65 80 100 50 65 85 64 80 105 20 ...
$ speed : int 45 60 80 65 80 100 43 58 78 45 ...
$ total : int 318 405 525 309 405 534 314 405 530 195 ...
$ weight : chr "15.2 lbs." "28.7 lbs." "220.5 lbs." "18.7 lbs." ...
$ height : chr "2'04\"" "3'03\"" "6'07\"" "2'00\"" ...
$ pokeclass : chr "Seed " "Seed " "Seed " "Lizard " ...
$ percent.male : num 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.5 ...
$ percent.female : num 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.5 ...
$ pre.evolution : chr NA "Bulbasaur" "Ivysaur" NA ...
$ egg.group1 : chr "Monster" "Monster" "Monster" "Monster" ...
$ egg.group2 : chr "Grass" "Grass" "Grass" "Dragon" ...
A new column has been created using mutate() to add together attack and spattack for each pokemon, in a new column called ‘totalattack’
#create total attack points column
mutate(pokemon_df2, totalattack = attack+spattack)
colSums() has been used to see how many missing values are in each column. At this stage, 7 columns have missing values. The percentage of female and male variables have been imputed with the mean, using impute() The variables type2, ability, pre.evolution and egg.group2 have blank values in some obsverations, but this is not missing data, as not all pokemon have these attributes. Where there in an NA in these observations, we will replace the NA with a string to make it clear that the attribute is not applible rather than missing. AbilityH will be replaced with the mode, or most commonly appearing abilityH
#Look for missing values
colSums(is.na(pokemon_df2))
pokedex Generation Legendary Date.first.spawned species
0 0 0 0 0
form type1 type2 ability1 ability2
0 0 372 0 310
abilityH hp attack defense spattack
83 0 0 0 0
spdefense speed total weight height
0 0 0 0 0
pokeclass percent.male percent.female pre.evolution egg.group1
0 72 72 365 0
egg.group2
530
#impute missing values with mean
pokemon_df2$percent.male %<>% impute(pokemon_df2$percent.male, fun = mean)
pokemon_df2$percent.female %<>% impute(pokemon_df2$percent.female, fun = mean)
#type 2, ability 2, pre.evolution and and egg group 2 all have over half the observations missing however this is not missing data, these pokemon do not have these attributes and these columns should not be dropped. Cells will be filled with a new string
pokemon_df2$type2 %<>% replace_na("no type 2")
pokemon_df2$ability2 %<>% replace_na("no ability 2")
pokemon_df2$pre.evolution %<>% replace_na("no pre-evolution")
pokemon_df2$egg.group2 %<>% replace_na("no egg group 2")
#ability H will be replaced with the mode
pokemon_df2$abilityH %<>% impute(pokemon_df2$abilityH, fun= mode)
#Look again for missing values
colSums(is.na(pokemon_df2))
pokedex Generation Legendary Date.first.spawned species
0 0 0 0 0
form type1 type2 ability1 ability2
0 0 0 0 0
abilityH hp attack defense spattack
0 0 0 0 0
spdefense speed total weight height
0 0 0 0 0
pokeclass percent.male percent.female pre.evolution egg.group1
0 0 0 0 0
egg.group2
0
As there are many numeric variables, a multivariate analysis has been chosen. Firstly, a subset of only numeric variables is made. As all missing values have been taken care of, the addition step of removing missing values is not needed. MVN is used to find and exclude the mutlivariate outliers and return the observation numbers of the outliers. The data is then subset again to exclude the observations that contain the outliers.
# Subset the dataset to relevant numeric variables
pokemon_sub <- pokemon_df2 %>% select(hp, attack,defense, spattack, spdefense, speed, total)
dim(pokemon_sub)
[1] 721 7
# Find and exclude the multivariate outliers
pokeresults <- pokemon_sub %>%
MVN::mvn(multivariateOutlierMethod = "quan",
showOutliers = TRUE)
The covariance matrix has become singular during
the iterations of the MCD algorithm.
There are 721 observations (in the entire dataset of 721 obs.) lying on the hyperplane with equation
a_1*(x_i1 - m_1) + ... + a_p*(x_ip - m_p) = 0 with (m_1, ..., m_p) the mean of these observations
and coefficients a_i from the vector a <- c(-0.3779645, -0.3779645, -0.3779645, -0.3779645,
-0.3779645, -0.3779645, 0.3779645)
pokeresults$multivariateOutliers
#returns the observation numbers of the outliers and then converts from character to numeric
pokemon_outliers <- pokemon_sub[c(as.numeric(pokeresults$multivariateOutliers[["Observation"]])), ]
# # subset the dataset by including rows contain in the specified observation numbers
pokemon_clean <- pokemon_sub[-c(as.numeric(pokeresults$multivariateOutliers[["Observation"]])), ]
dim(pokemon_clean)
[1] 713 7
dim(pokemon_outliers)
[1] 8 7
dim(pokemon_sub)
[1] 721 7
The histogram of hp is very right skewed. A box-cox transformation has been applied, where the lamba is automatically found and applied. After the transformation, the shape of the histogram is slightly more normalise, however there is nowa slight left skew.
#create a histogram
hist(pokemon_df2$hp)
#Apply Box-Cox transformation
boxcox_hp <- BoxCox(pokemon_df2$hp, lambda = "auto")
attr(boxcox_hp, which = "lambda")
[1] 0.1990125
hist(boxcox_hp)
##References
Biostars. (2020, 02 01). Question: Find mismatch in two columns in a data frame in R. Retrieved from Biostars; bioinformatics explained: https://www.biostars.org/p/180451/
DataNovia. (2021, 01 02). Identify and Remove Duplicate Data in R . Retrieved from DataNovia: https://www.datanovia.com/en/lessons/identify-and-remove-duplicate-data-in-r/
Fontes, R. (2021, 02 12). Pokémon: Every Elemental Type, Officially Ranked. Retrieved from Thegamer: https://www.thegamer.com/pokemon-elemental-types-ranked-officially/
intelliPaat. (2021, 02 11). Change the Blank Cells to “NA”. Retrieved from intellipaat: https://intellipaat.com/community/27103/change-the-blank-cells-to-na
Wickham, H. (2021, 02 11). Replace NAs with specified values. Retrieved from tidyr: https://tidyr.tidyverse.org/reference/replace_na.html