Adrian R Angkawijaya
27 July 2018
This is an Exploratory Data Analysis of the comparison between the characters in the Marvel and DC Universes. The analysis includes the distribution of the characters in terms of variables such as Race and Gender as well as looking at the top strongest characters for each of the two universes.
The dataset was originally obtained from the Superhero Database website and scraped to csv files.
Here load the three datasets:
setwd("/Users/adrianromano/Downloads/")
info <- read.csv("/Users/adrianromano/Downloads/superhero-set/heroes_information.csv", na.strings = c("-", "-99"))
stats <- read.csv("/Users/adrianromano/Downloads/superhero-set/characters_stats.csv", na.strings = "")
powers <- read.csv("/Users/adrianromano/Downloads/superhero-set/super_hero_powers.csv")
Import the necessary packages:
library(reshape2)
library(plyr)
library(dplyr)
Note: Since we are only interested in looking at Marvel and DC Comics data, we subset the dataset to only these two publishers and also remove duplicates of the character names so that each name is unique.
Subset the data only for Marvel and DC:
colnames(info)[colnames(info) == "name"] <- "Name"
marvel_dc_info <- info[(info$Publisher == "Marvel Comics" | info$Publisher == "DC Comics"), ]
Remove Name duplicates and select the columns that we are interested for the analysis:
marvel_dc_info <- marvel_dc_info[!duplicated(marvel_dc_info$Name), ]
marvel_dc_info <- marvel_dc_info %>%
select(Name, Gender, Race, Publisher)
Let’s combine the three datasets into one to see all their relationships together.
Note: Inner join is used to keep only the intersect character names. The rest of the data are dropped.
Join the first two datasets:
marvel_dc_stats_info <- join(marvel_dc_info, stats, by = "Name", type = "inner")
Then join them with the third dataset:
colnames(powers)[colnames(powers) == "hero_names"] <- "Name"
full_marvel_dc <- join(marvel_dc_stats_info, powers, by = "Name", type = "inner")
Check the first few rows of the combined dataset:
head(full_marvel_dc) # Result contains 186 columns so it is hidden to safe space
Notice that the data contains 186 columns with a lot of super power column names. They contain values of True and False that represent whether the character has that respective super power (True) or not (False).
We can transform them into a single column that contains all these super power column names to simplify the data.
We do so by using the melt function as follows:
marvel_dc <- melt(full_marvel_dc, id = c("Name", "Gender", "Race", "Publisher", "Alignment", "Intelligence",
"Strength", "Speed", "Durability", "Power", "Combat", "Total"))
colnames(marvel_dc)[colnames(marvel_dc) == "variable"] <- "Super.Power"
marvel_dc <- marvel_dc %>%
filter(value == "True") %>%
select(-value)
Check the first few rows of the dataset again:
head(marvel_dc)
## Name Gender Race Publisher Alignment Intelligence
## 1 Amazo Male Android DC Comics bad 75
## 2 Angel Male <NA> Marvel Comics good 63
## 3 Annihilus Male <NA> Marvel Comics bad 75
## 4 Ant-Man II Male Human Marvel Comics good 63
## 5 Apocalypse Male Mutant Marvel Comics bad 100
## 6 Aquaman Male Atlantean DC Comics good 63
## Strength Speed Durability Power Combat Total Super.Power
## 1 100 100 100 100 100 575 Agility
## 2 13 46 64 17 42 245 Agility
## 3 80 47 56 59 64 381 Agility
## 4 10 23 28 32 28 184 Agility
## 5 100 33 100 100 60 493 Agility
## 6 80 50 80 65 80 418 Agility
The data looks well structured and perfect for graphs !
Before moving on to the analysis, lets check if there are any missing values in the data.
Note: In the data, the missing values are represented by blank values, “-” or “-99”.
Check for missing values:
colSums(is.na(marvel_dc))
## Name Gender Race Publisher Alignment
## 0 47 943 0 17
## Intelligence Strength Speed Durability Power
## 0 0 0 0 0
## Combat Total Super.Power
## 0 0 0
The Gender, Race and Alignment columns have some missing values. Since the project does not involves any prediction model, we do not need to fill in the missing values and just filter them out for the analysis.
First, import the necessary packages:
library(ggplot2)
library(gridExtra)
library(wesanderson)
library(pander)
Convert the categorical columns to factors for visualization:
marvel_dc$Name <- as.factor(marvel_dc$Name)
marvel_dc$Gender <- as.factor(marvel_dc$Gender)
marvel_dc$Race <- as.factor(marvel_dc$Race)
marvel_dc$Publisher <- as.factor(marvel_dc$Publisher)
marvel_dc$Alignment <- as.factor(marvel_dc$Alignment)
marvel_dc$Super.Power <- as.factor(marvel_dc$Super.Power)
How many data observations are in each universe?
ggplot(marvel_dc, aes(x = Publisher, fill = Publisher)) +
geom_bar(stat = "count", aes(fill = Publisher), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Distribution of Marvel vs DC") +
geom_label(stat = "count", aes(label = ..count..)) +
guides(fill = FALSE) +
theme_classic()
Summary:
How many Males and Females are in each universe?
marvel_dc_gender <- marvel_dc %>%
filter(!is.na(Gender)) %>%
group_by(Gender) %>%
dplyr::count(Publisher) %>%
select(Gender, Publisher, Count = n)
ggplot(marvel_dc_gender, aes(x = Gender, y = Count)) +
geom_bar(stat = "identity", aes(fill = Gender), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Marvel vs DC Gender Comparison") +
facet_wrap(~Publisher) +
theme_bw()
Summary:
How about the race difference in the two universes?
## Marvel
marvel_race <- marvel_dc %>%
filter(!is.na(Race)) %>%
filter(Publisher == "Marvel Comics") %>%
group_by(Race) %>%
dplyr::count(Race) %>%
select(Race, Count = n) %>%
arrange(-Count)
marvel_race <- ggplot(marvel_race[1:10, ], aes(x = reorder(Race, Count), y = Count)) +
geom_bar(stat = "identity", aes(fill = Race), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Top 10 Marvel Races") +
coord_flip() +
guides(fill = FALSE, alpha = FALSE) +
theme_bw()
## DC
dc_race <- marvel_dc %>%
filter(!is.na(Race)) %>%
filter(Publisher == "DC Comics") %>%
group_by(Race) %>%
dplyr::count(Race) %>%
select(Race, Count = n) %>%
arrange(-Count)
dc_race <- ggplot(dc_race[1:10, ], aes(x = reorder(Race, Count), y = Count)) +
geom_bar(stat = "identity", aes(fill = Race), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Top 10 DC Races") +
coord_flip() +
guides(fill = FALSE, alpha = FALSE) +
theme_bw()
grid.arrange(marvel_race, dc_race, ncol = 2)
Summary:
How many heroes and villains are in each universe?
marvel_dc_alignment <- marvel_dc %>%
filter(!is.na(Alignment)) %>%
group_by(Alignment) %>%
dplyr::count(Publisher) %>%
select(Alignment, Publisher, Count = n)
ggplot(marvel_dc_alignment, aes(x = Alignment, y = Count)) +
geom_bar(stat = "identity", aes(fill = Alignment), col = "black", alpha = 0.8) +
labs(x = "", y = "Number of Characters", title = "Heroes vs Villains Comparison") +
guides(fill = FALSE) +
facet_wrap(~Publisher) +
theme_bw()
Summary:
Next, we look at the comparison of stats between the two universes.
Note: The stats consists of Intelligence, Strength, Speed, Durability, Power and Combat Skills ranging from 0 - 100.
Here is the boxplot comparison of all the stats for the two universes:
## Intelligence
marvel_dc_intelligence <- ggplot(marvel_dc, aes(x = Publisher, y = Intelligence, fill = Publisher)) +
geom_boxplot(alpha = 0.5) +
labs(x = "", title = "Boxplot Comparison of Intelligence") +
scale_fill_brewer(palette="Spectral") +
guides(fill = FALSE) +
theme_minimal()
## Strength
marvel_dc_strength <- ggplot(marvel_dc, aes(x = Publisher, y = Strength, fill = Publisher)) +
geom_boxplot(alpha = 0.5) +
labs(x = "", title = "Boxplot Comparison of Strength") +
scale_fill_manual(values=wes_palette(n=2, name= "Royal1")) +
guides(fill = FALSE) +
theme_minimal()
## Speed
marvel_dc_speed <- ggplot(marvel_dc, aes(x = Publisher, y = Speed, fill = Publisher)) +
geom_boxplot(alpha = 0.5) +
labs(x = "", title = "Boxplot Comparison of Speed") +
scale_fill_manual(values=wes_palette(n=2, name= "Moonrise3")) +
guides(fill = FALSE) +
theme_minimal()
## Durability
marvel_dc_durability <- ggplot(marvel_dc, aes(x = Publisher, y = Durability, fill = Publisher)) +
geom_boxplot(alpha = 0.5) +
labs(x = "", title = "Boxplot Comparison of Durability") +
scale_fill_brewer(palette="blues") +
guides(fill = FALSE) +
theme_minimal()
## Power
marvel_dc_power <- ggplot(marvel_dc, aes(x = Publisher, y = Power, fill = Publisher)) +
geom_boxplot(alpha = 0.5) +
labs(x = "", title = "Boxplot Comparison of Power") +
scale_fill_brewer(palette="Dark2") +
guides(fill = FALSE) +
theme_minimal()
## Combat
marvel_dc_combat <- ggplot(marvel_dc, aes(x = Publisher, y = Combat, fill = Publisher)) +
geom_boxplot(alpha = 0.5) +
labs(x = "", title = "Boxplot Comparison of Combat Skills") +
scale_fill_manual(values=wes_palette(n=2, name= "Rushmore")) +
guides(fill = FALSE) +
theme_minimal()
grid.arrange(marvel_dc_intelligence, marvel_dc_strength, marvel_dc_speed,
marvel_dc_durability, marvel_dc_power, marvel_dc_combat, ncol = 2)
Summary:
Which characters have the highest intelligence in each universe?
## Marvel
marvel_intel <- marvel_dc %>%
filter(Publisher == "Marvel Comics") %>%
group_by(Name) %>%
distinct(Intelligence) %>%
select(Name, Intelligence) %>%
arrange(-Intelligence)
ggplot(marvel_intel[1:15, ], aes(x = reorder(Name, Intelligence), y = Intelligence)) +
geom_bar(stat = "identity", aes(fill = Name), col = "black", alpha = 0.8) +
ylim(0, 110) +
labs(x = "", y = "", title = "Top 15 Marvel Characters with the Highest Intelligence") +
coord_flip() +
theme_bw()
## DC
dc_intel <- marvel_dc %>%
filter(Publisher == "DC Comics") %>%
group_by(Name) %>%
distinct(Intelligence) %>%
select(Name, Intelligence) %>%
arrange(-Intelligence)
ggplot(dc_intel[1:15, ], aes(x = reorder(Name, Intelligence), y = Intelligence)) +
geom_bar(stat = "identity", aes(fill = Name), col = "black", alpha = 0.8) +
ylim(0, 120) +
labs(x = "", y = "", title = "Top 15 DC Characters with the Highest Intelligence") +
coord_flip() +
theme_bw()
Summary:
Which characters have the highest strength in each universe?
## Marvel
marvel_str <- marvel_dc %>%
filter(Publisher == "Marvel Comics") %>%
group_by(Name) %>%
distinct(Strength) %>%
select(Name, Strength) %>%
arrange(-Strength)
ggplot(marvel_str[1:15, ], aes(x = reorder(Name, Strength), y = Strength)) +
geom_bar(stat = "identity", aes(fill = Name), col = "black", alpha = 0.8) +
ylim(0, 110) +
labs(x = "", y = "", title = "Top 15 Marvel Characters with the Highest Strength") +
coord_flip() +
theme_bw()
## DC
dc_str <- marvel_dc %>%
filter(Publisher == "DC Comics") %>%
group_by(Name) %>%
distinct(Strength) %>%
select(Name, Strength) %>%
arrange(-Strength)
ggplot(dc_str[1:15, ], aes(x = reorder(Name, Strength), y = Strength)) +
geom_bar(stat = "identity", aes(fill = Name), col = "black", alpha = 0.8) +
ylim(0, 110) +
labs(x = "", y = "", title = "Top 15 DC Characters with the Highest Strength") +
coord_flip() +
theme_bw()
Summary:
Which characters have the highest speed in each universe?
## Marvel
marvel_spd <- marvel_dc %>%
filter(Publisher == "Marvel Comics") %>%
group_by(Name) %>%
distinct(Speed) %>%
select(Name, Speed) %>%
arrange(-Speed)
ggplot(marvel_spd[1:15, ], aes(x = reorder(Name, Speed), y = Speed)) +
geom_bar(stat = "identity", aes(fill = Name), col = "black", alpha = 0.8) +
ylim(0, 110) +
labs(x = "", y = "", title = "Top 15 Marvel Characters with the Highest Speed") +
coord_flip() +
theme_bw()
## DC
dc_spd <- marvel_dc %>%
filter(Publisher == "DC Comics") %>%
group_by(Name) %>%
distinct(Speed) %>%
select(Name, Speed) %>%
arrange(-Speed)
ggplot(dc_spd[1:15, ], aes(x = reorder(Name, Speed), y = Speed)) +
geom_bar(stat = "identity", aes(fill = Name), col = "black", alpha = 0.8) +
ylim(0, 110) +
labs(x = "", y = "", title = "Top 15 DC Characters with the Highest Speed") +
coord_flip() +
theme_bw()
Summary:
Now that we have looked at each stats, lets now see the total stats altogether to see which characters are the strongest in each universe.
Note: The analysis of the “strongest” depends on all the combined stats mentioned above and not just based on Strength. For example, if Hulk has 100 Strength but 30 Intelligence, he will not make it to the top strongest (Sad..he’s one of my favourites).
Lets first look at the boxplot comparison of the overall stats:
ggplot(marvel_dc, aes(x = Publisher, y = Total, fill = Publisher)) +
geom_boxplot() +
labs(x = "", title = "Boxplot Comparison of Overall Stats") +
scale_fill_brewer(palette="Dark2") +
theme_minimal()
Which characters are the strongest in each universe?
## Marvel
marvel_total <- marvel_dc %>%
filter(Publisher == "Marvel Comics") %>%
group_by(Name) %>%
distinct(Total) %>%
select(Name, Total) %>%
arrange(-Total)
ggplot(marvel_total[1:20, ], aes(x = reorder(Name, Total), y = Total)) +
geom_bar(stat = "identity", aes(fill = Name), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Top 20 Strongest Marvel Characters") +
coord_flip() +
guides(fill = FALSE, alpha = FALSE) +
theme_bw()
## DC
dc_total <- marvel_dc %>%
filter(Publisher == "DC Comics") %>%
group_by(Name) %>%
distinct(Total) %>%
select(Name, Total) %>%
arrange(-Total)
ggplot(dc_total[1:20, ], aes(x = reorder(Name, Total), y = Total)) +
geom_bar(stat = "identity", aes(fill = Name), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Top 20 Strongest DC Characters") +
coord_flip() +
guides(fill = FALSE, alpha = FALSE) +
theme_bw()
Summary:
Now lets look at the super powers comparison:
## Marvel
marvel_super <- marvel_dc %>%
filter(Publisher == "Marvel Comics") %>%
group_by(Super.Power) %>%
dplyr::count(Super.Power) %>%
select(Super.Power, Count = n) %>%
arrange(-Count)
ggplot(marvel_super[1:15, ], aes(x = reorder(Super.Power, Count), y = Count)) +
geom_bar(stat = "identity", aes(fill = Super.Power), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Top 15 Marvel Super Powers") +
coord_flip() +
guides(fill = FALSE, alpha = FALSE) +
theme_bw()
## DC
dc_super <- marvel_dc %>%
filter(Publisher == "DC Comics") %>%
group_by(Super.Power) %>%
dplyr::count(Super.Power) %>%
select(Super.Power, Count = n) %>%
arrange(-Count)
ggplot(dc_super[1:15, ], aes(x = reorder(Super.Power, Count), y = Count)) +
geom_bar(stat = "identity", aes(fill = Super.Power), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Top 15 DC Super Powers") +
coord_flip() +
guides(fill = FALSE, alpha = FALSE) +
theme_bw()
Summary:
Which characters have the highest number of super powers in each universe?
## Marvel
marvel_power <- marvel_dc %>%
filter(Publisher == "Marvel Comics") %>%
group_by(Name) %>%
dplyr::count(Name) %>%
select(Name, Count = n) %>%
arrange(-Count)
ggplot(marvel_power[1:20, ], aes(x = reorder(Name, Count), y = Count)) +
geom_bar(stat = "identity", aes(fill = Name), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Top 20 Marvel Characters with the Highest Number of Super Powers") +
coord_flip() +
guides(fill = FALSE, alpha = FALSE) +
theme_bw()
## DC
dc_power <- marvel_dc %>%
filter(Publisher == "DC Comics") %>%
group_by(Name) %>%
dplyr::count(Name) %>%
select(Name, Count = n) %>%
arrange(-Count)
ggplot(dc_power[1:20, ], aes(x = reorder(Name, Count), y = Count)) +
geom_bar(stat = "identity", aes(fill = Name), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Top 20 DC Characters with the Highest Number of Super Powers") +
coord_flip() +
guides(fill = FALSE, alpha = FALSE) +
theme_bw()
Summary: