# Loading Necessary packages
library(dplyr)
library(lubridate)
library(tidyr)
library(Hmisc)
library(editrules)
library(deducorrect)
library(magrittr)
library(outliers)
library(forecast)
library(mlr)
library(MVN)
Data Pre-processing refers to the the process of transforming raw/ unprocessed data into a more refined and structured format before using it for any analysis. In this assignment we will be showcasing the complete process of data cleaning process on superhero dataset from Kraggle which will include all essential steps such as merging two datasets, removing unrequired observations and variables, tidy dataset according to tidy principles, checking for missing values checking for outliners and appling other processes as required in the assignment.
We merged the the dataset heroes_information.csv and super_hero_powers.csv on common variable Name and remove unrequired variables.
After analysing all the variables in the merged dataset hero_stats, we will convert required variables into their suitable data format.
Before tidy up the merged dataset it would be better to deal with missing value and obvious errors as less iterations will be required.
We will check for data validations and inconsistencies, incase of presence of any violation we will deal with them using proper rules.
After dealing with missing values and inconsistencies we will proceed towards tidy up the data.
We will create new variable BMI by mutation of height and Weight of every record. For this we will first convert Height and Weight variable in Meters and Kg format respectivily.
Further we will also check for outliers presence in height and weight and handle them using suitable method.
We tranformed varibles such as height and weight using BoxCox transformation to reduce the skewness of the distribution.
Assignment is based on Super Heroes Dataset from Kaggle. Dataset includes information of super heros and villain of various publisher comic house such as DC, Marvel and many others. Dataset comprise of two datasets one containing personal information of each character and other having ability/ superpower information of each character.
heroes_information.csv:
name[Character] : Name of Hero/ Villain
Gender[Factor] : Male or Female
Eyecolor[Factor]: Color of Eye
Race [Factor] : Species the Hero/ villain belongs to
HairColor[Factor]: Color of Hair
Height [numeric]: Height Measurement in cm
Publisher[Factor]: Comic Publisher House
Skincolor[Factor]: Colour of Skin
Alignment[Factor]: Deposition of the person
Weight[numeric] : Weight of person in pound
super_hero_powers.csv :
hero_name[character]: Name of Hero/ Villain
column 2 - 168 (Agility - omniscient)[logical] - Superpower exist for the character
Hero_info <- read.csv('heroes_information.csv', stringsAsFactors = FALSE )
Hero_powers<- read.csv('super_hero_powers.csv', stringsAsFactors = FALSE)
dim(Hero_info)
[1] 734 11
head(Hero_info[,1:6],3)
head(Hero_info[,6:11],3)
dim(Hero_powers)
[1] 667 168
head(Hero_powers[,1:6],3)
head(Hero_powers[,163:168],3)
Hero_info Dataset comprise of 734 observations and 11 Variables.
Hero_power Dataset comprise of 667 observations and 168 Variables.
hero_stats <- inner_join(Hero_info,Hero_powers, c("name"= "hero_names"))
# Removing unrequired S No. Column
hero_stats <- hero_stats[,-1]
# Dimention of hero_stats
dim(hero_stats)
[1] 660 177
head(hero_stats[1:8],3)
head(hero_stats[9:15],3)
head(hero_stats[173:177],3)
As most of the string variables such as Gender, Eyecolor, Race, Haircolor, skincolor, publisher and and alignment are reqired to be in catagorical format. It is required to first introduce NA, where “-” or any other spercial character is present before catogorising it. Once NA’s are induces we will proceed towards type convertion.
Since all the abilities for each superhero is bianary choice, we will convert all superpower variables( column 11 to column 177 ) from character to logical format.
# introducting NA's for special characters before Data Type Conversion
hero_stats[hero_stats == '-']<- NA
# Proper Data Type COnversion
## converting from character to logical format
hero_stats[,11:177] <- sapply(hero_stats[,11:177], as.logical)
## convertion from character to factor
hero_stats$Gender <- factor(hero_stats$Gender)
hero_stats$Eye.color <- factor(hero_stats$Eye.color)
hero_stats$Race <- factor(hero_stats$Race)
hero_stats$Hair.color <- factor(hero_stats$Hair.color)
hero_stats$Publisher <- factor(hero_stats$Publisher)
hero_stats$Skin.color <- factor(hero_stats$Skin.color)
hero_stats$Alignment <- factor(hero_stats$Alignment)
# displaying structure of data set
str(hero_stats) %>% head(20)
'data.frame': 660 obs. of 177 variables:
$ name : chr "A-Bomb" "Abe Sapien" "Abin Sur" "Abomination" ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ Eye.color : Factor w/ 22 levels "amber","black",..: 19 3 3 8 3 3 3 3 6 NA ...
$ Race : Factor w/ 60 levels "Alien","Alpha",..: 23 32 54 31 11 23 NA 23 23 NA ...
$ Hair.color : Factor w/ 29 levels "Auburn","black",..: 17 17 17 17 3 17 6 6 8 NA ...
$ Height : num 203 191 185 203 -99 193 -99 185 178 191 ...
$ Publisher : Factor w/ 25 levels "","ABC Studios",..: 13 3 4 13 13 13 15 4 13 13 ...
$ Skin.color : Factor w/ 16 levels "black","blue",..: NA 2 12 NA NA NA NA NA NA NA ...
$ Alignment : Factor w/ 3 levels "bad","good","neutral": 2 2 2 1 1 1 2 2 2 2 ...
$ Weight : num 441 65 90 441 -99 122 -99 88 81 104 ...
$ Agility : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ Accelerated.Healing : logi TRUE TRUE FALSE TRUE FALSE FALSE ...
$ Lantern.Power.Ring : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ Dimensional.Awareness : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ Cold.Resistance : logi FALSE TRUE FALSE FALSE FALSE TRUE ...
$ Durability : logi TRUE TRUE FALSE FALSE FALSE TRUE ...
$ Stealth : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Energy.Absorption : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ Flight : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ Danger.Sense : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Underwater.breathing : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ Marksmanship : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ Weapons.Master : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ Power.Augmentation : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Animal.Attributes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Longevity : logi TRUE TRUE FALSE FALSE FALSE FALSE ...
$ Intelligence : logi FALSE TRUE FALSE TRUE TRUE FALSE ...
$ Super.Strength : logi TRUE TRUE FALSE TRUE TRUE TRUE ...
$ Cryokinesis : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Telepathy : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ Energy.Armor : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Energy.Blasts : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Duplication : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Size.Changing : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ Density.Control : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Stamina : logi TRUE TRUE FALSE TRUE FALSE FALSE ...
$ Astral.Travel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Audio.Control : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Dexterity : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Omnitrix : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Super.Speed : logi FALSE FALSE FALSE TRUE TRUE FALSE ...
$ Possession : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Animal.Oriented.Powers : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Weapon.based.Powers : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Electrokinesis : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Darkforce.Manipulation : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Death.Touch : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Teleportation : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ Enhanced.Senses : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Telekinesis : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Energy.Beams : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Magic : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ Hyperkinesis : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Jump : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Clairvoyance : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Dimensional.Travel : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ Power.Sense : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Shapeshifting : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Peak.Human.Condition : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Immortality : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
$ Camouflage : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ Element.Control : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Phasing : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Astral.Projection : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Electrical.Transport : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Fire.Control : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Projection : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Summoning : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Enhanced.Memory : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Reflexes : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ Invulnerability : logi FALSE FALSE FALSE TRUE TRUE TRUE ...
$ Energy.Constructs : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Force.Fields : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Self.Sustenance : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ Anti.Gravity : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Empathy : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Power.Nullifier : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Radiation.Control : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Psionic.Powers : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Elasticity : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Substance.Secretion : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Elemental.Transmogrification: logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ Technopath.Cyberpath : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Photographic.Reflexes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Seismic.Power : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Animation : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
$ Precognition : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Mind.Control : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Fire.Resistance : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ Power.Absorption : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Enhanced.Hearing : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Nova.Force : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Insanity : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Hypnokinesis : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Animal.Control : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Natural.Armor : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ Intangibility : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Enhanced.Sight : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ Molecular.Manipulation : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
[list output truncated]
NULL
Scan the data for missing values, special values and obvious errors (i.e. inconsistencies).
Before tidy up the dataset it would be better to deal with missing value and obvious errors we will doing so by first identifying special values and missing values.
# Check for missing values for every variables
hero_stats %>% is.na() %>% colSums() %>% head(30)
name Gender Eye.color Race Hair.color Height
0 18 131 247 132 0
Publisher Skin.color Alignment Weight Agility Accelerated.Healing
0 588 7 2 0 0
Lantern.Power.Ring Dimensional.Awareness Cold.Resistance Durability Stealth Energy.Absorption
0 0 0 0 0 0
Flight Danger.Sense Underwater.breathing Marksmanship Weapons.Master Power.Augmentation
0 0 0 0 0 0
Animal.Attributes Longevity Intelligence Super.Strength Cryokinesis Telepathy
0 0 0 0 0 0
# Check for Special Values
##checking for NaN values
sapply(hero_stats, function(x) sum(is.nan(x))) %>% sum()
[1] 0
##checking for Infite values
sapply(hero_stats, function(x) sum( is.infinite(x))) %>% sum()
[1] 0
We can Observe that all of the missing values are present in 7 variables, where 6 of them are catagorical and 1 is numerical. We will deal with each variable is the most suitable way possible.
For Gender, Eye Color, Hair Color, Skin color we will replace the missing value with the most occuring value.
For variable Race since many of these characters are from space, the characters with missing race can be considered to be Alien.
Since Alignment of a perticular character can be good, bad or neutral, we will consider that the individual with missing alignment are neutral .
for two missing weight observations we will replace them with the mean weight for all the observation
# For Gender we will observe using table function that about 72 percent of character are Male.
hero_stats$Gender %>% table() %>% prop.table()
.
Female Male
0.2772586 0.7227414
# We are safely assuming that 13 missing gender characters are male as well
hero_stats$Gender[is.na(hero_stats$Gender)]<- 'Male'
# For Eye Color
hero_stats$Eye.color <- Hmisc::impute(hero_stats$Eye.color, fun = mode)
# For Hair Color
hero_stats$Hair.color <- Hmisc::impute(hero_stats$Hair.color, fun = mode)
# for skin Color
hero_stats$Skin.color <- Hmisc::impute(hero_stats$Skin.color, fun = mode)
# For Race
hero_stats$Race[is.na(hero_stats$Race)] <- 'Alien'
# For Alignment
hero_stats$Alignment[is.na(hero_stats$Alignment)] <- 'neutral'
# For Weight
hero_stats$Weight <- Hmisc::impute(hero_stats$Weight, fun = mean)
## check for Na's
hero_stats %>% is.na() %>% sum()
[1] 0
For checking these common inconsistencies in data set we will be using edithrules and deducorrect packages.
Height > 0 If Height < 0 multiply it by -1
Weight > 0 If Weight < 0 multiply it by -1
Check for Violation of rules:
Rule1 <- editset(c("Weight > 0", "Height > 0"))
Rule1
Edit set:
num1 : 0 < Weight
num2 : 0 < Height
violations <- violatedEdits(Rule1,hero_stats)
violations %>% summary()
Edit violations, 660 observations, 0 completely missing (0%):
Edit violations per record:
Setting Rules
# Loading Rules from File
Rules <- correctionRules('Rules.txt')
Rules
Object of class 'correctionRules'
## 1-------
if (Height < 0) {
Height <- Height * -1
}
## 2-------
if (Weight < 0) {
Weight <- Weight * -1
}
Corecting with Rules
cor <- correctWithRules(Rules, hero_stats)
hero_stats <- cor$corrected
Re-check for violations
violations <- violatedEdits(Rule1,hero_stats)
violations %>% summary()
No violations detected, 0 checks evaluated to NA
NULL
In dataset hero_stats, column 11 to column 177 comprise of different abilities a hero/ villain can have. since all these variables are more of a catagories of superpower/ abilities rather than a variable themself, It violates the tidy data principle.
For tidy-up the data we will use gather method on column 11 to column 177, where key will be Ability Type and value will be Exist.
hero_stats <- hero_stats %>% gather(11:177,key = 'Ability Type', value = 'Exist')
# Dimention of re-organised Dataset
dim(hero_stats)
[1] 110220 12
head(hero_stats[,1:8])
head(hero_stats[,8:12])
Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step.
# Height in meters
hero_stats$Height <- hero_stats$Height * 0.01
# Weight in Kg
hero_stats$Weight <- hero_stats$Weight * 0.453592
# Adding variable BMI
hero_stats <- hero_stats %>% mutate(BMI = Weight/(Height*Height))
range(hero_stats$BMI)
[1] 3.692235e-02 1.138692e+03
# Checking for normality
par(mfrow=c(1,2))
hero_stats$Height %>% hist(main='Height distribution', xlab = 'Height in meters')
hero_stats$Weight %>% hist(main='Weight distribution', xlab = 'Weight in Kg')
# Outliners Detection using Tukey's method
par(mfrow=c(1,2))
hero_stats$Height %>% boxplot(main =' Boxplot for Height')
hero_stats$Weight %>% as.numeric() %>% boxplot(main =' Boxplot for Weight')
# Defining function for capping outliners to nearest quantile
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
hero_stats$Height <- hero_stats$Height %>% cap()
hero_stats$Weight <- hero_stats$Weight %>% cap()
hero_stats$Height %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.152 0.990 1.780 1.611 1.880 3.050
hero_stats$Weight %>% summary()
2 values imputed to 23.69915
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.33 32.66 44.91 49.03 44.91 122.47
hero_stats$BMI <- hero_stats$Weight/(hero_stats$Height^2)
range(hero_stats$BMI)
[1] 4.620109 1138.691828
# check for outlinears in BMI
par(mfrow=c(1,2))
hero_stats$BMI %>% as.numeric() %>% boxplot(main ='BMI Boxplot with outliners', ylab='Body Mass Index')
hero_stats$BMI %>% as.numeric() %>% boxplot(main ='BMI Boxplot without outliners', ylab='Body Mass Index',outline = FALSE)
Since there still exist some outlinears in BMI lets explore these cases.
# check cases for cases with high BMI
hero_stats[hero_stats$BMI >100,]$name %>% unique()
[1] "Anti-Monitor" "Bloodwraith" "Giganta" "King Kong" "T-800" "T-850" "T-X" "Utgard-Loki"
hero_stats[hero_stats$BMI >100,]$Race %>% unique()
[1] God / Eternal Alien Animal Cyborg Frost Giant
60 Levels: Alien Alpha Amazon Android Animal Asgardian Atlantean Bizarro Bolovaxian Clone Cosmic Entity Cyborg ... Zombie
plot(hero_stats$Height, hero_stats$Weight)
As we can see the Height and Weight variable do have linearity specially for height between 1.5 meters to 2.2 meters to , but can be further improved.
We will be using Boxcox transformation for normalising the height and weight variable data
par(mfrow=c(1,2))
height <- BoxCox(hero_stats$Height, lambda = 'auto')
height %>% hist(main='Transformed Height Data')
weight <- BoxCox(hero_stats$Weight, lambda = 'auto')
weight %>% hist(main='Transformed Weight Data')
plot(height,weight)
# Filtering observation
hero_stats<- hero_stats %>% filter(Exist == TRUE)
# Removing variable Exist
hero_stats <- hero_stats[-12]
# dimention of Dataset
dim(hero_stats)
[1] 5966 12
# dataset head
head(hero_stats[,1:8])
head(hero_stats[,8:12])
# Barplot for superpower count
Power_stats<- hero_stats %>% group_by(`Ability Type`) %>% summarise( PowerUsers = n()) %>% arrange(desc(PowerUsers) )
barplot(Power_stats$PowerUsers[1:5], names.arg = Power_stats$`Ability Type`[1:5], main = '5 Most Common Ability')