Required packages

library(readr)
library(dplyr)
library(tidyr)
library(forecast)
library(Hmisc)
library(outliers)
library(ggplot2)
library(editrules)

Executive Summary

The report involves two datasets that provides an overview about heroes and their physical as well as power characteristics.The data sets were merged together by using an appropriate matching column for performing preprocessing tasks on the final dataset.The original dataset describing the stats of the superheroes was subsetted to avoid duplicate columns.The structure of the dataset and variable class type was analysed and it was observed that there were 9 numerical and 8 character variables.Certain character variables were then converted into factors by defining the levels and assigning appropriate labels.The dataset conforms to the tidy data principles and is already in a tidy format so no tidy tasks were performed.A new variable ’ BMI’ was created using the mutate function that determines how fit a superhero is.The data set was then scanned for any missing values and special values along with any obvious errors or consistencies using appropriate functions.The missing values for categorical variables was classified as ‘unknown/not specified’ whereas the missing values in the numerical variables were imputed by the mean of the corresponding variables. The numeric variables in the dataset is scanned for outliers by using boxplot function and the outlier values are replaced using a cap function.The outlier values lying beyond the fences are replaced with the 5th or 95th percentile value. Finally for the transformation task,z-score standardization was performed on certain variables to make the distribution more normal and scale the data range. A linear relationship was established between the height and the weight of the superhero using a scatterplot.The graph showed a positive correlation between the two variables.

Data

Superheroes Dataset
The two datasets taken for performing data preprocessing contains information about different superheroes and their stats. The first dataset contains various information about the superhero’s appearance,their body measurements etc.There are about 734 observations and 10 variables in this dataset. Variable descriptions are as follows:
Name: Name of the superhero
Gender: Gender of the superhero
Eye color: Eye color of the superhero
Race: Species of the superhero
Height: Height of the superhero(in cms)
Publisher: Comic category of the superhero
Skin color: Skin color of the superhero
Alignment: Superhero’s nature.
Weight: Weight of the superhero(in kgs)

The second dataset records the superhero attributes such as their strength level,speed level etc.There are about 611 observations and 9 variables in the dataset. Variable descriptions are as follows:
Name: Name of the superhero
Alignment: Superhero’s nature.
Intelligence:Intelligence stats of the superhero
Strength:Strength stats of the superhero
Speed: Speed stats of the superhero
Durability: Durability stats of the superhero
Power: Power level of the superhero
Combat: Combat level of the superhero
Total: Total combined stats of the superhero

Data source:
https://www.kaggle.com/claudiodavi/superhero-set
https://www.kaggle.com/magshimimsummercamp/superheroes-info-and-stats#superheroes_stats.csv

The superhero stats dataset also contains the ‘alignment’ variable and hence has been excluded from the original dataset.The two datasets have been merged on the superhero name by using the mutating join function from the dplyr package bringing together all the details regarding the superheroes.

heroes_info <-read_csv("heroes_information.csv") 
head(heroes_info)
heroes_stats <- read_csv("superheroes_stats.csv")
head(heroes_stats)
heroes_stats1 <- subset(heroes_stats[,c(1,3:9)])
heroes_combined <- inner_join(heroes_info, heroes_stats1, by = "Name")
head(heroes_combined)

Understand

str(heroes_combined)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    600 obs. of  17 variables:
 $ Name        : chr  "A-Bomb" "Abe Sapien" "Abin Sur" "Abomination" ...
 $ Gender      : chr  "Male" "Male" "Male" "Male" ...
 $ Eye color   : chr  "yellow" "blue" "blue" "green" ...
 $ Race        : chr  "Human" "Icthyo Sapien" "Ungaran" "Human / Radiation" ...
 $ Hair color  : chr  "No Hair" "No Hair" "No Hair" "No Hair" ...
 $ Height      : num  203 191 185 203 -99 -99 185 178 191 188 ...
 $ Publisher   : chr  "Marvel Comics" "Dark Horse Comics" "DC Comics" "Marvel Comics" ...
 $ Skin color  : chr  "-" "blue" "red" "-" ...
 $ Alignment   : chr  "good" "good" "good" "bad" ...
 $ Weight      : num  441 65 90 441 -99 -99 88 81 104 108 ...
 $ Intelligence: num  38 88 50 63 88 63 NA 10 75 50 ...
 $ Strength    : num  100 14 90 80 100 10 NA 8 28 85 ...
 $ Speed       : num  17 35 53 53 83 12 NA 13 38 100 ...
 $ Durability  : num  80 42 64 90 99 100 NA 5 80 85 ...
 $ Power       : num  17 35 84 55 100 71 NA 5 72 100 ...
 $ Combat      : num  64 85 65 95 56 64 NA 20 95 40 ...
 $ Total       : num  316 299 406 436 526 320 NA 61 388 460 ...
table(sapply(heroes_combined, class))

character   numeric 
        8         9 
attributes(heroes_combined[1:17, ])
$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

$class
[1] "tbl_df"     "tbl"        "data.frame"

$names
 [1] "Name"         "Gender"       "Eye color"    "Race"         "Hair color"   "Height"       "Publisher"    "Skin color"   "Alignment"   
[10] "Weight"       "Intelligence" "Strength"     "Speed"        "Durability"   "Power"        "Combat"       "Total"       
heroes_combined$Gender <- factor(heroes_combined$Gender,levels = c('Male','Female'))
heroes_combined$`Eye color` <- factor(heroes_combined$`Eye color`,levels = c('amber','black','blue','blue / white','brown','gold','green','green / blue','grey','hazel','indigo','purple','red','silver','violet','white','white / red','yellow'
                                                                      ,'yellow (without irises)','yellow / blue','yellow / red'),labels=c('amber','black','blue','blue&white','brown','gold','green','green & blue',    'grey','hazel','indigo','purple','red','silver','violet','white','white & red','yellow'
                                                                      ,'yellow(No iris)','yellow & blue','yellow & red'))
heroes_combined$Race <- factor(heroes_combined$Race,levels=c('Alien','Alpha','Amazon','Android','Animal','Asgardian','Atlantean','Bizarro','Bolovaxian','Clone','Cosmic Entity','Cyborg','Czarnian','Dathomirian Zabrak','Demi-God','Demon','Eternal','Flora Colossus','Frost Giant','God / Eternal','Gorilla','Gungan','Human','Human / Altered', 'Human / Clone','Human / Cosmic','Human / Radiation','Human-Kree','Human-Spartoi','Human-Vulcan','Human-Vuldarian','Icthyo Sapien','Inhuman','Kaiju','Kakarantharaian','Korugaran','Kryptonian','Luphomoid','Maiar','Martian','Metahuman','Mutant','Mutant / Clone','New God','Neyaphem','Parademon','Planet','Rodian','Saiyan','Spartoi','Strontian','Symbiote','Talokite','Tamaranean','Ungaran','Vampire','Xenomorph XX121','Yautja','Yoda species','Zen-Whoberian','Zombie'))
heroes_combined$`Hair color` <- factor(heroes_combined$`Hair color`,levels=c('Auburn','Black / Blue','Blond','Blue','Brown','Brown / Black','Brown / White','Gold','Green','Grey','Indigo','Magenta','No Hair','Orange','Orange / White','Pink','Purple','Red','Red / Grey','Red / Orange','Red / White','Silver','Strawberry Blond','White','Yellow'))
heroes_combined$Publisher <- factor(heroes_combined$Publisher,levels=c('ABC Studios','Dark Horse Comics','DC Comics','George Lucas','Hanna-Barbera','HarperCollins','Icon Comics','IDW Publishing','Image Comics','J. K. Rowling','J. R. R. Tolkien','Marvel Comics','Microsoft','NBC - Heroes','Rebellion','Shueisha','Sony Pictures','South Park','Star Trek','SyFy','Team Epic TV','Titan Books','Universal Studios','Wildstorm'))
heroes_combined$`Skin color` <- factor(heroes_combined$`Skin color`,levels=c('black','blue','blue-white','gold','gray','green','grey','orange','orange / white','pink','purple','red','red / black','silver','white','yellow'))  

heroes_combined$Alignment <- factor(heroes_combined$Alignment,levels=c('good','bad','neutral'))

str(heroes_combined)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    600 obs. of  17 variables:
 $ Name        : chr  "A-Bomb" "Abe Sapien" "Abin Sur" "Abomination" ...
 $ Gender      : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
 $ Eye color   : Factor w/ 21 levels "amber","black",..: 18 3 3 7 3 3 3 5 NA 3 ...
 $ Race        : Factor w/ 61 levels "Alien","Alpha",..: 23 32 55 27 11 NA 23 23 NA NA ...
 $ Hair color  : Factor w/ 25 levels "Auburn","Black / Blue",..: 13 13 13 13 NA 3 3 5 NA 24 ...
 $ Height      : num  203 191 185 203 -99 -99 185 178 191 188 ...
 $ Publisher   : Factor w/ 24 levels "ABC Studios",..: 12 2 3 12 12 14 3 12 12 12 ...
 $ Skin color  : Factor w/ 16 levels "black","blue",..: NA 2 12 NA NA NA NA NA NA NA ...
 $ Alignment   : Factor w/ 3 levels "good","bad","neutral": 1 1 1 2 2 1 1 1 1 2 ...
 $ Weight      : num  441 65 90 441 -99 -99 88 81 104 108 ...
 $ Intelligence: num  38 88 50 63 88 63 NA 10 75 50 ...
 $ Strength    : num  100 14 90 80 100 10 NA 8 28 85 ...
 $ Speed       : num  17 35 53 53 83 12 NA 13 38 100 ...
 $ Durability  : num  80 42 64 90 99 100 NA 5 80 85 ...
 $ Power       : num  17 35 84 55 100 71 NA 5 72 100 ...
 $ Combat      : num  64 85 65 95 56 64 NA 20 95 40 ...
 $ Total       : num  316 299 406 436 526 320 NA 61 388 460 ...

Tidy & Manipulate Data I

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.
    From the dataset,we can observe that each of the rules have been satisfied and hence no tidy functions need to be applied on the dataset.

Tidy & Manipulate Data II

heroes_combined <- mutate(heroes_combined,BMI=heroes_combined$Weight/heroes_combined$Height)
head(heroes_combined)

Scan I

#Checking for NULL values in the dataset
colSums(is.na(heroes_combined))
        Name       Gender    Eye color         Race   Hair color       Height    Publisher   Skin color    Alignment       Weight Intelligence 
           0           24          139          299          271            0            7          556            3            0          170 
    Strength        Speed   Durability        Power       Combat        Total          BMI 
         169          171          172          172          170          172            0 
sum(is.na(heroes_combined))
[1] 2495
#Checking for infinite and NaN values
is.special <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(heroes_combined, function(x) sum(is.special(x)))
        Name       Gender    Eye color         Race   Hair color       Height    Publisher   Skin color    Alignment       Weight Intelligence 
           0            0            0            0            0            0            0            0            0            0            0 
    Strength        Speed   Durability        Power       Combat        Total          BMI 
           0            0            0            0            0            0            0 
#Checking for obvious errors and inconsistencies 
(Rule1 <- editset(c("Height >= 0","Weight >=0")))

Edit set:
num1 : 0 <= Height
num2 : 0 <= Weight 
Violated <- violatedEdits(Rule1,heroes_combined) 
summary(Violated)
Edit violations, 600 observations, 0 completely missing (0%):

Edit violations per record:
#Recoding -99 value as NAs
heroes_combined$Weight <- heroes_combined$Weight %>% na_if(-99)
heroes_combined$Height <- heroes_combined$Height %>% na_if(-99)
# Handling missing values for the variables
sum(is.na(heroes_combined$Gender))  
[1] 24
levels(heroes_combined$Gender) = c(levels(heroes_combined$Gender), "Not Specified")
heroes_combined$Gender[is.na(heroes_combined$Gender)] <- "Not Specified"

sum(is.na(heroes_combined$`Eye color`)) 
[1] 139
levels(heroes_combined$`Eye color`) = c(levels(heroes_combined$`Eye color`), "Unknown")
heroes_combined$`Eye color`[is.na(heroes_combined$`Eye color`)] <- "Unknown"

sum(is.na(heroes_combined$`Race`)) 
[1] 299
levels(heroes_combined$`Race`) = c(levels(heroes_combined$`Race`), "Unknown")
heroes_combined$`Race`[is.na(heroes_combined$`Race`)] <- "Unknown"

sum(is.na(heroes_combined$`Hair color`)) 
[1] 271
levels(heroes_combined$`Hair color`) = c(levels(heroes_combined$`Hair color`), "Unknown")
heroes_combined$`Hair color`[is.na(heroes_combined$`Hair color`)] <- "Unknown"

sum(is.na(heroes_combined$`Publisher`)) 
[1] 7
levels(heroes_combined$`Publisher`) = c(levels(heroes_combined$`Publisher`), "Unknown")
heroes_combined$`Publisher`[is.na(heroes_combined$`Publisher`)] <- "Unknown"

sum(is.na(heroes_combined$`Skin color`)) 
[1] 556
levels(heroes_combined$`Skin color`) = c(levels(heroes_combined$`Skin color`), "Unknown")
heroes_combined$`Skin color`[is.na(heroes_combined$`Skin color`)] <- "Unknown"

sum(is.na(heroes_combined$`Alignment`)) 
[1] 3
levels(heroes_combined$`Alignment`) = c(levels(heroes_combined$`Alignment`), "Unknown")
heroes_combined$`Alignment`[is.na(heroes_combined$`Alignment`)] <- "Unknown"

#Imputing the values with Mean for numeric variables
heroes_combined$Weight<-impute(heroes_combined$Weight,fun = mean)
heroes_combined$Height<-impute(heroes_combined$Height,fun = mean)
heroes_combined$Intelligence<-impute(heroes_combined$Intelligence,fun = mean)
heroes_combined$Strength<-impute(heroes_combined$Strength,fun = mean)
heroes_combined$Speed<-impute(heroes_combined$Speed,fun = mean)
heroes_combined$Durability<-impute(heroes_combined$Durability,fun = mean)
heroes_combined$Power<-impute(heroes_combined$Power,fun = mean)
heroes_combined$Combat<-impute(heroes_combined$Combat,fun = mean)
heroes_combined$Total<-impute(heroes_combined$Total,fun = mean)
heroes_combined$BMI<-impute(heroes_combined$BMI,fun = mean)

colSums(is.na(heroes_combined))
        Name       Gender    Eye color         Race   Hair color       Height    Publisher   Skin color    Alignment       Weight Intelligence 
           0            0            0            0            0            0            0            0            0            0            0 
    Strength        Speed   Durability        Power       Combat        Total          BMI 
           0            0            0            0            0            0            0 
sum(is.na(heroes_combined))
[1] 0

Scan II

boxplot(as.numeric(heroes_combined$Weight),main="Boxplot of Superhero's weight",ylab="Weight",col="grey")

boxplot(as.numeric(heroes_combined$Height),main="Boxplot of Superhero's height",ylab="Height",col="grey")

boxplot(as.numeric(heroes_combined$Intelligence),main="Boxplot of Superhero's intelligence",ylab="Intelligence",col="grey")

boxplot(as.numeric(heroes_combined$Strength),main="Boxplot of Superhero's strength",ylab="Strength",col="grey")

boxplot(as.numeric(heroes_combined$Speed),main="Boxplot of Superhero's speed",ylab="Speed",col="grey")

boxplot(as.numeric(heroes_combined$Durability),main="Boxplot of Superhero's durability",ylab="Durability",col="grey")

boxplot(as.numeric(heroes_combined$Power),main="Boxplot of Superhero's power",ylab="Power",col="grey")

boxplot(as.numeric(heroes_combined$Combat),main="Boxplot of Superhero's combat",ylab="Combat",col="grey")

boxplot(as.numeric(heroes_combined$Total),main="Boxplot of Superhero's total stats",ylab="Total stats",col="grey")

boxplot(as.numeric(heroes_combined$BMI),main="Boxplot of Superhero's BMI",ylab="BMI",col="grey")

# Capping outliers for the 7 numeric variables that have outliers

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}
heroes_combined$Weight <- heroes_combined$Weight %>% cap()
boxplot(as.numeric(heroes_combined$Weight),main="Boxplot of Superhero's weight",ylab="Weight",col="blue")


heroes_combined$Height <- heroes_combined$Height %>% cap()
boxplot(as.numeric(heroes_combined$Height ),main="Boxplot of Superhero's height",ylab="Height",col="blue")


heroes_combined$Intelligence <- heroes_combined$Intelligence %>% cap()
boxplot(as.numeric(heroes_combined$Intelligence),main="Boxplot of Superhero's intelligence",ylab="Intelligence",col="blue")


heroes_combined$Speed <- heroes_combined$Speed %>% cap()
boxplot(as.numeric(heroes_combined$Speed),main="Boxplot of Superhero's Speed",ylab="Speed",col="blue")


heroes_combined$Combat <- heroes_combined$Combat %>% cap()
boxplot(as.numeric(heroes_combined$Combat),main="Boxplot of Superhero's Combat",ylab="Combat",col="blue")


heroes_combined$Total <- heroes_combined$Total %>% cap()
boxplot(as.numeric(heroes_combined$Total),main="Boxplot of Superhero's Total stats",ylab="Total Stats",col="blue")


heroes_combined$BMI <- heroes_combined$BMI %>% cap()
boxplot(as.numeric(heroes_combined$BMI),main="Boxplot of Superhero's BMI",ylab="BMI",col="blue")


#Summarizing the numeric variables after capping
summary(heroes_combined[, c(6, 10:18)])

 172 values imputed to 187.8131 


 180 values imputed to 114.2619 


 170 values imputed to 63.03721 


 169 values imputed to 41.35499 


 171 values imputed to 38.70629 


 172 values imputed to 59.89019 


 172 values imputed to 57.62617 


 170 values imputed to 60.77674 


 172 values imputed to 321.9346 

     Height          Weight       Intelligence       Strength          Speed         Durability         Power            Combat     
 Min.   :163.0   Min.   : 16.0   Min.   : 25.00   Min.   :  4.00   Min.   : 8.00   Min.   :  5.00   Min.   :  5.00   Min.   :28.00  
 1st Qu.:178.0   1st Qu.: 74.0   1st Qu.: 50.00   1st Qu.: 12.00   1st Qu.:25.00   1st Qu.: 42.00   1st Qu.: 43.00   1st Qu.:56.00  
 Median :187.8   Median :101.0   Median : 63.04   Median : 41.35   Median :38.71   Median : 59.89   Median : 57.63   Median :60.78  
 Mean   :184.5   Mean   :106.3   Mean   : 63.54   Mean   : 41.35   Mean   :38.32   Mean   : 59.89   Mean   : 57.63   Mean   :60.98  
 3rd Qu.:188.0   3rd Qu.:114.3   3rd Qu.: 75.00   3rd Qu.: 55.00   3rd Qu.:42.00   3rd Qu.: 80.00   3rd Qu.: 69.00   3rd Qu.:70.00  
 Max.   :211.0   Max.   :249.1   Max.   :100.00   Max.   :100.00   Max.   :83.00   Max.   :120.00   Max.   :100.00   Max.   :95.00  
     Total            BMI         
 Min.   :147.0   Min.   :-0.4648  
 1st Qu.:271.8   1st Qu.: 0.4000  
 Median :321.9   Median : 0.5277  
 Mean   :320.7   Mean   : 0.6718  
 3rd Qu.:359.0   3rd Qu.: 1.0000  
 Max.   :491.1   Max.   : 1.8182  

Transform

hist(heroes_combined$Height,main="Histogram of Superhero's Height")

centre_height <- scale(heroes_combined$Height, center = TRUE, scale = TRUE)
hist(centre_height,main="Histogram of Superhero's Height(Z-score Transformed)")


plot(heroes_combined$Height,heroes_combined$Weight, main="Superheroes Height v/s Weight",
   xlab=" Height ", ylab=" Weight ",pch=19,xlim=c(160,215))
abline(lm(heroes_combined$Weight~heroes_combined$Height), col="red") 
lines(lowess(heroes_combined$Height,heroes_combined$Weight), col = "blue")



