Required packages
library(readr)
library(dplyr)
library(tidyr)
library(forecast)
library(Hmisc)
library(outliers)
library(ggplot2)
library(editrules)
Executive Summary
The report involves two datasets that provides an overview about heroes and their physical as well as power characteristics.The data sets were merged together by using an appropriate matching column for performing preprocessing tasks on the final dataset.The original dataset describing the stats of the superheroes was subsetted to avoid duplicate columns.The structure of the dataset and variable class type was analysed and it was observed that there were 9 numerical and 8 character variables.Certain character variables were then converted into factors by defining the levels and assigning appropriate labels.The dataset conforms to the tidy data principles and is already in a tidy format so no tidy tasks were performed.A new variable ’ BMI’ was created using the mutate function that determines how fit a superhero is.The data set was then scanned for any missing values and special values along with any obvious errors or consistencies using appropriate functions.The missing values for categorical variables was classified as ‘unknown/not specified’ whereas the missing values in the numerical variables were imputed by the mean of the corresponding variables. The numeric variables in the dataset is scanned for outliers by using boxplot function and the outlier values are replaced using a cap function.The outlier values lying beyond the fences are replaced with the 5th or 95th percentile value. Finally for the transformation task,z-score standardization was performed on certain variables to make the distribution more normal and scale the data range. A linear relationship was established between the height and the weight of the superhero using a scatterplot.The graph showed a positive correlation between the two variables.
Data
Superheroes Dataset
The two datasets taken for performing data preprocessing contains information about different superheroes and their stats. The first dataset contains various information about the superhero’s appearance,their body measurements etc.There are about 734 observations and 10 variables in this dataset. Variable descriptions are as follows:
Name: Name of the superhero
Gender: Gender of the superhero
Eye color: Eye color of the superhero
Race: Species of the superhero
Height: Height of the superhero(in cms)
Publisher: Comic category of the superhero
Skin color: Skin color of the superhero
Alignment: Superhero’s nature.
Weight: Weight of the superhero(in kgs)
The second dataset records the superhero attributes such as their strength level,speed level etc.There are about 611 observations and 9 variables in the dataset. Variable descriptions are as follows:
Name: Name of the superhero
Alignment: Superhero’s nature.
Intelligence:Intelligence stats of the superhero
Strength:Strength stats of the superhero
Speed: Speed stats of the superhero
Durability: Durability stats of the superhero
Power: Power level of the superhero
Combat: Combat level of the superhero
Total: Total combined stats of the superhero
Data source:
https://www.kaggle.com/claudiodavi/superhero-set
https://www.kaggle.com/magshimimsummercamp/superheroes-info-and-stats#superheroes_stats.csv
The superhero stats dataset also contains the ‘alignment’ variable and hence has been excluded from the original dataset.The two datasets have been merged on the superhero name by using the mutating join function from the dplyr package bringing together all the details regarding the superheroes.
heroes_info <-read_csv("heroes_information.csv")
head(heroes_info)
heroes_stats <- read_csv("superheroes_stats.csv")
head(heroes_stats)
heroes_stats1 <- subset(heroes_stats[,c(1,3:9)])
heroes_combined <- inner_join(heroes_info, heroes_stats1, by = "Name")
head(heroes_combined)
Understand
- The structure and the type of variable has been identified using the ‘str’ function.The dataset comprised of 8 character and 9 numeric variables. Certain character variables such as gender,hair color etc. were converted into factors by defining the levels and appropriate labels using the factor function.
str(heroes_combined)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 600 obs. of 17 variables:
$ Name : chr "A-Bomb" "Abe Sapien" "Abin Sur" "Abomination" ...
$ Gender : chr "Male" "Male" "Male" "Male" ...
$ Eye color : chr "yellow" "blue" "blue" "green" ...
$ Race : chr "Human" "Icthyo Sapien" "Ungaran" "Human / Radiation" ...
$ Hair color : chr "No Hair" "No Hair" "No Hair" "No Hair" ...
$ Height : num 203 191 185 203 -99 -99 185 178 191 188 ...
$ Publisher : chr "Marvel Comics" "Dark Horse Comics" "DC Comics" "Marvel Comics" ...
$ Skin color : chr "-" "blue" "red" "-" ...
$ Alignment : chr "good" "good" "good" "bad" ...
$ Weight : num 441 65 90 441 -99 -99 88 81 104 108 ...
$ Intelligence: num 38 88 50 63 88 63 NA 10 75 50 ...
$ Strength : num 100 14 90 80 100 10 NA 8 28 85 ...
$ Speed : num 17 35 53 53 83 12 NA 13 38 100 ...
$ Durability : num 80 42 64 90 99 100 NA 5 80 85 ...
$ Power : num 17 35 84 55 100 71 NA 5 72 100 ...
$ Combat : num 64 85 65 95 56 64 NA 20 95 40 ...
$ Total : num 316 299 406 436 526 320 NA 61 388 460 ...
table(sapply(heroes_combined, class))
character numeric
8 9
attributes(heroes_combined[1:17, ])
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
$class
[1] "tbl_df" "tbl" "data.frame"
$names
[1] "Name" "Gender" "Eye color" "Race" "Hair color" "Height" "Publisher" "Skin color" "Alignment"
[10] "Weight" "Intelligence" "Strength" "Speed" "Durability" "Power" "Combat" "Total"
heroes_combined$Gender <- factor(heroes_combined$Gender,levels = c('Male','Female'))
heroes_combined$`Eye color` <- factor(heroes_combined$`Eye color`,levels = c('amber','black','blue','blue / white','brown','gold','green','green / blue','grey','hazel','indigo','purple','red','silver','violet','white','white / red','yellow'
,'yellow (without irises)','yellow / blue','yellow / red'),labels=c('amber','black','blue','blue&white','brown','gold','green','green & blue', 'grey','hazel','indigo','purple','red','silver','violet','white','white & red','yellow'
,'yellow(No iris)','yellow & blue','yellow & red'))
heroes_combined$Race <- factor(heroes_combined$Race,levels=c('Alien','Alpha','Amazon','Android','Animal','Asgardian','Atlantean','Bizarro','Bolovaxian','Clone','Cosmic Entity','Cyborg','Czarnian','Dathomirian Zabrak','Demi-God','Demon','Eternal','Flora Colossus','Frost Giant','God / Eternal','Gorilla','Gungan','Human','Human / Altered', 'Human / Clone','Human / Cosmic','Human / Radiation','Human-Kree','Human-Spartoi','Human-Vulcan','Human-Vuldarian','Icthyo Sapien','Inhuman','Kaiju','Kakarantharaian','Korugaran','Kryptonian','Luphomoid','Maiar','Martian','Metahuman','Mutant','Mutant / Clone','New God','Neyaphem','Parademon','Planet','Rodian','Saiyan','Spartoi','Strontian','Symbiote','Talokite','Tamaranean','Ungaran','Vampire','Xenomorph XX121','Yautja','Yoda species','Zen-Whoberian','Zombie'))
heroes_combined$`Hair color` <- factor(heroes_combined$`Hair color`,levels=c('Auburn','Black / Blue','Blond','Blue','Brown','Brown / Black','Brown / White','Gold','Green','Grey','Indigo','Magenta','No Hair','Orange','Orange / White','Pink','Purple','Red','Red / Grey','Red / Orange','Red / White','Silver','Strawberry Blond','White','Yellow'))
heroes_combined$Publisher <- factor(heroes_combined$Publisher,levels=c('ABC Studios','Dark Horse Comics','DC Comics','George Lucas','Hanna-Barbera','HarperCollins','Icon Comics','IDW Publishing','Image Comics','J. K. Rowling','J. R. R. Tolkien','Marvel Comics','Microsoft','NBC - Heroes','Rebellion','Shueisha','Sony Pictures','South Park','Star Trek','SyFy','Team Epic TV','Titan Books','Universal Studios','Wildstorm'))
heroes_combined$`Skin color` <- factor(heroes_combined$`Skin color`,levels=c('black','blue','blue-white','gold','gray','green','grey','orange','orange / white','pink','purple','red','red / black','silver','white','yellow'))
heroes_combined$Alignment <- factor(heroes_combined$Alignment,levels=c('good','bad','neutral'))
str(heroes_combined)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 600 obs. of 17 variables:
$ Name : chr "A-Bomb" "Abe Sapien" "Abin Sur" "Abomination" ...
$ Gender : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
$ Eye color : Factor w/ 21 levels "amber","black",..: 18 3 3 7 3 3 3 5 NA 3 ...
$ Race : Factor w/ 61 levels "Alien","Alpha",..: 23 32 55 27 11 NA 23 23 NA NA ...
$ Hair color : Factor w/ 25 levels "Auburn","Black / Blue",..: 13 13 13 13 NA 3 3 5 NA 24 ...
$ Height : num 203 191 185 203 -99 -99 185 178 191 188 ...
$ Publisher : Factor w/ 24 levels "ABC Studios",..: 12 2 3 12 12 14 3 12 12 12 ...
$ Skin color : Factor w/ 16 levels "black","blue",..: NA 2 12 NA NA NA NA NA NA NA ...
$ Alignment : Factor w/ 3 levels "good","bad","neutral": 1 1 1 2 2 1 1 1 1 2 ...
$ Weight : num 441 65 90 441 -99 -99 88 81 104 108 ...
$ Intelligence: num 38 88 50 63 88 63 NA 10 75 50 ...
$ Strength : num 100 14 90 80 100 10 NA 8 28 85 ...
$ Speed : num 17 35 53 53 83 12 NA 13 38 100 ...
$ Durability : num 80 42 64 90 99 100 NA 5 80 85 ...
$ Power : num 17 35 84 55 100 71 NA 5 72 100 ...
$ Combat : num 64 85 65 95 56 64 NA 20 95 40 ...
$ Total : num 316 299 406 436 526 320 NA 61 388 460 ...
Tidy & Manipulate Data I
- For the dataset to be tidy,the dataset should satisfy the tidy data principles i.e
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
From the dataset,we can observe that each of the rules have been satisfied and hence no tidy functions need to be applied on the dataset.
Tidy & Manipulate Data II
- A new variable ‘BMI’ has been created for each superhero which involves division of weight by height variable.The BMI describes how fit the superhero is.
heroes_combined <- mutate(heroes_combined,BMI=heroes_combined$Weight/heroes_combined$Height)
head(heroes_combined)
Scan I
- The dataset is scanned for any missing values,special values and obvious errors/inconsistencies by using appropriate functions.We observe that there are multiple missing values in the factored and numeric variables with the total count being 249. From the dataset,it can also be observed that there are certain missing values coded as -99 in the height and weight variables.These values have been converted into null values by using the na_if function.
#Checking for NULL values in the dataset
colSums(is.na(heroes_combined))
Name Gender Eye color Race Hair color Height Publisher Skin color Alignment Weight Intelligence
0 24 139 299 271 0 7 556 3 0 170
Strength Speed Durability Power Combat Total BMI
169 171 172 172 170 172 0
sum(is.na(heroes_combined))
[1] 2495
#Checking for infinite and NaN values
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(heroes_combined, function(x) sum(is.special(x)))
Name Gender Eye color Race Hair color Height Publisher Skin color Alignment Weight Intelligence
0 0 0 0 0 0 0 0 0 0 0
Strength Speed Durability Power Combat Total BMI
0 0 0 0 0 0 0
#Checking for obvious errors and inconsistencies
(Rule1 <- editset(c("Height >= 0","Weight >=0")))
Edit set:
num1 : 0 <= Height
num2 : 0 <= Weight
Violated <- violatedEdits(Rule1,heroes_combined)
summary(Violated)
Edit violations, 600 observations, 0 completely missing (0%):
Edit violations per record:
#Recoding -99 value as NAs
heroes_combined$Weight <- heroes_combined$Weight %>% na_if(-99)
heroes_combined$Height <- heroes_combined$Height %>% na_if(-99)
- In order to deal with the missing values in the categorical variables,the missing values have been classified as ‘unknown/not specified’ by defining a new level.As for the missing values in the numeric variables,the number of missing values is quite large and excluding the missing values would cause discrepancies in the dataset.
- The missing values have been imputed with the mean value for the numeric variables in this case using the impute function from the ‘hmisc’ package so that there is consistency in the data. We use the colsums(is.na) & sum(is.na) function to verify that there are no missing values in the dataset.
# Handling missing values for the variables
sum(is.na(heroes_combined$Gender))
[1] 24
levels(heroes_combined$Gender) = c(levels(heroes_combined$Gender), "Not Specified")
heroes_combined$Gender[is.na(heroes_combined$Gender)] <- "Not Specified"
sum(is.na(heroes_combined$`Eye color`))
[1] 139
levels(heroes_combined$`Eye color`) = c(levels(heroes_combined$`Eye color`), "Unknown")
heroes_combined$`Eye color`[is.na(heroes_combined$`Eye color`)] <- "Unknown"
sum(is.na(heroes_combined$`Race`))
[1] 299
levels(heroes_combined$`Race`) = c(levels(heroes_combined$`Race`), "Unknown")
heroes_combined$`Race`[is.na(heroes_combined$`Race`)] <- "Unknown"
sum(is.na(heroes_combined$`Hair color`))
[1] 271
levels(heroes_combined$`Hair color`) = c(levels(heroes_combined$`Hair color`), "Unknown")
heroes_combined$`Hair color`[is.na(heroes_combined$`Hair color`)] <- "Unknown"
sum(is.na(heroes_combined$`Publisher`))
[1] 7
levels(heroes_combined$`Publisher`) = c(levels(heroes_combined$`Publisher`), "Unknown")
heroes_combined$`Publisher`[is.na(heroes_combined$`Publisher`)] <- "Unknown"
sum(is.na(heroes_combined$`Skin color`))
[1] 556
levels(heroes_combined$`Skin color`) = c(levels(heroes_combined$`Skin color`), "Unknown")
heroes_combined$`Skin color`[is.na(heroes_combined$`Skin color`)] <- "Unknown"
sum(is.na(heroes_combined$`Alignment`))
[1] 3
levels(heroes_combined$`Alignment`) = c(levels(heroes_combined$`Alignment`), "Unknown")
heroes_combined$`Alignment`[is.na(heroes_combined$`Alignment`)] <- "Unknown"
#Imputing the values with Mean for numeric variables
heroes_combined$Weight<-impute(heroes_combined$Weight,fun = mean)
heroes_combined$Height<-impute(heroes_combined$Height,fun = mean)
heroes_combined$Intelligence<-impute(heroes_combined$Intelligence,fun = mean)
heroes_combined$Strength<-impute(heroes_combined$Strength,fun = mean)
heroes_combined$Speed<-impute(heroes_combined$Speed,fun = mean)
heroes_combined$Durability<-impute(heroes_combined$Durability,fun = mean)
heroes_combined$Power<-impute(heroes_combined$Power,fun = mean)
heroes_combined$Combat<-impute(heroes_combined$Combat,fun = mean)
heroes_combined$Total<-impute(heroes_combined$Total,fun = mean)
heroes_combined$BMI<-impute(heroes_combined$BMI,fun = mean)
colSums(is.na(heroes_combined))
Name Gender Eye color Race Hair color Height Publisher Skin color Alignment Weight Intelligence
0 0 0 0 0 0 0 0 0 0 0
Strength Speed Durability Power Combat Total BMI
0 0 0 0 0 0 0
sum(is.na(heroes_combined))
[1] 0
Scan II
- In order to check for outliers in the dataset,boxplots have been plotted for each numeric variable.With missing values handled effectively in the previous step,the Tukey’s method of outlier detection can be used to detect any outliers in the boxplot.From the boxplot method we observe that there are outliers in 7 numeric variables in the dataset.
- The cap function has been used to replace the outlier values that lie below the value of the lower fence (Q1 - 1.5 x IQR) with the value of 5th percentile and replace the outlier values that lie above the value of the upper fence (Q3 + 1.5 x IQR) with the value of the 95th percentile. Summary function has been used to check the summary statistics of the numeric variables before and after capping.
boxplot(as.numeric(heroes_combined$Weight),main="Boxplot of Superhero's weight",ylab="Weight",col="grey")

boxplot(as.numeric(heroes_combined$Height),main="Boxplot of Superhero's height",ylab="Height",col="grey")

boxplot(as.numeric(heroes_combined$Intelligence),main="Boxplot of Superhero's intelligence",ylab="Intelligence",col="grey")

boxplot(as.numeric(heroes_combined$Strength),main="Boxplot of Superhero's strength",ylab="Strength",col="grey")

boxplot(as.numeric(heroes_combined$Speed),main="Boxplot of Superhero's speed",ylab="Speed",col="grey")

boxplot(as.numeric(heroes_combined$Durability),main="Boxplot of Superhero's durability",ylab="Durability",col="grey")

boxplot(as.numeric(heroes_combined$Power),main="Boxplot of Superhero's power",ylab="Power",col="grey")

boxplot(as.numeric(heroes_combined$Combat),main="Boxplot of Superhero's combat",ylab="Combat",col="grey")

boxplot(as.numeric(heroes_combined$Total),main="Boxplot of Superhero's total stats",ylab="Total stats",col="grey")

boxplot(as.numeric(heroes_combined$BMI),main="Boxplot of Superhero's BMI",ylab="BMI",col="grey")

# Capping outliers for the 7 numeric variables that have outliers
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
heroes_combined$Weight <- heroes_combined$Weight %>% cap()
boxplot(as.numeric(heroes_combined$Weight),main="Boxplot of Superhero's weight",ylab="Weight",col="blue")

heroes_combined$Height <- heroes_combined$Height %>% cap()
boxplot(as.numeric(heroes_combined$Height ),main="Boxplot of Superhero's height",ylab="Height",col="blue")

heroes_combined$Intelligence <- heroes_combined$Intelligence %>% cap()
boxplot(as.numeric(heroes_combined$Intelligence),main="Boxplot of Superhero's intelligence",ylab="Intelligence",col="blue")

heroes_combined$Speed <- heroes_combined$Speed %>% cap()
boxplot(as.numeric(heroes_combined$Speed),main="Boxplot of Superhero's Speed",ylab="Speed",col="blue")

heroes_combined$Combat <- heroes_combined$Combat %>% cap()
boxplot(as.numeric(heroes_combined$Combat),main="Boxplot of Superhero's Combat",ylab="Combat",col="blue")

heroes_combined$Total <- heroes_combined$Total %>% cap()
boxplot(as.numeric(heroes_combined$Total),main="Boxplot of Superhero's Total stats",ylab="Total Stats",col="blue")

heroes_combined$BMI <- heroes_combined$BMI %>% cap()
boxplot(as.numeric(heroes_combined$BMI),main="Boxplot of Superhero's BMI",ylab="BMI",col="blue")

#Summarizing the numeric variables after capping
summary(heroes_combined[, c(6, 10:18)])
172 values imputed to 187.8131
180 values imputed to 114.2619
170 values imputed to 63.03721
169 values imputed to 41.35499
171 values imputed to 38.70629
172 values imputed to 59.89019
172 values imputed to 57.62617
170 values imputed to 60.77674
172 values imputed to 321.9346
Height Weight Intelligence Strength Speed Durability Power Combat
Min. :163.0 Min. : 16.0 Min. : 25.00 Min. : 4.00 Min. : 8.00 Min. : 5.00 Min. : 5.00 Min. :28.00
1st Qu.:178.0 1st Qu.: 74.0 1st Qu.: 50.00 1st Qu.: 12.00 1st Qu.:25.00 1st Qu.: 42.00 1st Qu.: 43.00 1st Qu.:56.00
Median :187.8 Median :101.0 Median : 63.04 Median : 41.35 Median :38.71 Median : 59.89 Median : 57.63 Median :60.78
Mean :184.5 Mean :106.3 Mean : 63.54 Mean : 41.35 Mean :38.32 Mean : 59.89 Mean : 57.63 Mean :60.98
3rd Qu.:188.0 3rd Qu.:114.3 3rd Qu.: 75.00 3rd Qu.: 55.00 3rd Qu.:42.00 3rd Qu.: 80.00 3rd Qu.: 69.00 3rd Qu.:70.00
Max. :211.0 Max. :249.1 Max. :100.00 Max. :100.00 Max. :83.00 Max. :120.00 Max. :100.00 Max. :95.00
Total BMI
Min. :147.0 Min. :-0.4648
1st Qu.:271.8 1st Qu.: 0.4000
Median :321.9 Median : 0.5277
Mean :320.7 Mean : 0.6718
3rd Qu.:359.0 3rd Qu.: 1.0000
Max. :491.1 Max. : 1.8182
