Introduction

This project is an investigation into both DC & Marvel characters. The dataset can be found here.

The purpose of this project is to demonstrate a framework for data pre-processing and perform Exploratory Data Analysis (EDA).

Data Preparation & Pre-Processing

Preparing the workspace and dataset

First, we will call the libraries needed for our initial processing and EDA.

library(tidyverse)
library(dplyr)
library(ggplot2)
library(RColorBrewer)

Next, we will read in the data. The source files are divided into DC characters & Marvel characters. Take a glance at the data by examining the top six rows of each.

dc_comics = read.csv("dc-wikia-data.csv", na.string="", stringsAsFactors = T)
marvel_comics = read.csv("marvel-wikia-data.csv",na.string="",stringsAsFactors = T)

head(dc_comics)
##   page_id                        name                               urlslug
## 1    1422        Batman (Bruce Wayne)        \\/wiki\\/Batman_(Bruce_Wayne)
## 2   23387       Superman (Clark Kent)       \\/wiki\\/Superman_(Clark_Kent)
## 3    1458  Green Lantern (Hal Jordan)  \\/wiki\\/Green_Lantern_(Hal_Jordan)
## 4    1659    James Gordon (New Earth)    \\/wiki\\/James_Gordon_(New_Earth)
## 5    1576 Richard Grayson (New Earth) \\/wiki\\/Richard_Grayson_(New_Earth)
## 6    1448 Wonder Woman (Diana Prince) \\/wiki\\/Wonder_Woman_(Diana_Prince)
##                ID           ALIGN        EYE       HAIR               SEX  GSM
## 1 Secret Identity Good Characters  Blue Eyes Black Hair   Male Characters <NA>
## 2 Secret Identity Good Characters  Blue Eyes Black Hair   Male Characters <NA>
## 3 Secret Identity Good Characters Brown Eyes Brown Hair   Male Characters <NA>
## 4 Public Identity Good Characters Brown Eyes White Hair   Male Characters <NA>
## 5 Secret Identity Good Characters  Blue Eyes Black Hair   Male Characters <NA>
## 6 Public Identity Good Characters  Blue Eyes Black Hair Female Characters <NA>
##               ALIVE APPEARANCES FIRST.APPEARANCE YEAR
## 1 Living Characters        3093        1939, May 1939
## 2 Living Characters        2496    1986, October 1986
## 3 Living Characters        1565    1959, October 1959
## 4 Living Characters        1316   1987, February 1987
## 5 Living Characters        1237      1940, April 1940
## 6 Living Characters        1231   1941, December 1941
head(marvel_comics)
##   page_id                                  name
## 1    1678             Spider-Man (Peter Parker)
## 2    7139       Captain America (Steven Rogers)
## 3   64786 Wolverine (James \\"Logan\\" Howlett)
## 4    1868   Iron Man (Anthony \\"Tony\\" Stark)
## 5    2460                   Thor (Thor Odinson)
## 6    2458            Benjamin Grimm (Earth-616)
##                                    urlslug               ID              ALIGN
## 1             \\/Spider-Man_(Peter_Parker)  Secret Identity    Good Characters
## 2       \\/Captain_America_(Steven_Rogers)  Public Identity    Good Characters
## 3 \\/Wolverine_(James_%22Logan%22_Howlett)  Public Identity Neutral Characters
## 4   \\/Iron_Man_(Anthony_%22Tony%22_Stark)  Public Identity    Good Characters
## 5                   \\/Thor_(Thor_Odinson) No Dual Identity    Good Characters
## 6            \\/Benjamin_Grimm_(Earth-616)  Public Identity    Good Characters
##          EYE       HAIR             SEX  GSM             ALIVE APPEARANCES
## 1 Hazel Eyes Brown Hair Male Characters <NA> Living Characters        4043
## 2  Blue Eyes White Hair Male Characters <NA> Living Characters        3360
## 3  Blue Eyes Black Hair Male Characters <NA> Living Characters        3061
## 4  Blue Eyes Black Hair Male Characters <NA> Living Characters        2961
## 5  Blue Eyes Blond Hair Male Characters <NA> Living Characters        2258
## 6  Blue Eyes    No Hair Male Characters <NA> Living Characters        2255
##   FIRST.APPEARANCE Year
## 1           Aug-62 1962
## 2           Mar-41 1941
## 3           Oct-74 1974
## 4           Mar-63 1963
## 5           Nov-50 1950
## 6           Nov-61 1961

Both have the same attribute columns, however before combining the two together, the column header case must be standardized. In order to distinguish them once they have been combined, we will add a publisher column.

#make the column headers all lowercase 
names(dc_comics)[4:13] <- tolower(names(dc_comics[4:13]))
names(marvel_comics)[4:13] <- tolower(names(marvel_comics[4:13]))

#add publisher column 
dc_comics$publisher <- "dc"
marvel_comics$publisher<-"marvel"

We will perform a union of the datasets and take a glance at the structure.

comics <- union(dc_comics, marvel_comics)
str(comics)
## 'data.frame':    23272 obs. of  14 variables:
##  $ page_id         : int  1422 23387 1458 1659 1576 1448 1486 1451 71760 1380 ...
##  $ name            : Factor w/ 23272 levels "3g4 (New Earth)",..: 598 6007 2488 3002 5280 6771 378 6289 1695 2185 ...
##  $ urlslug         : Factor w/ 23272 levels "\\/wiki\\/3g4_(New_Earth)",..: 598 6006 2488 3003 5279 6771 378 6288 1695 2185 ...
##  $ id              : Factor w/ 5 levels "Identity Unknown",..: 3 3 3 2 3 2 2 3 2 3 ...
##  $ align           : Factor w/ 4 levels "Bad Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ eye             : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 4 4 5 5 4 4 4 4 4 4 ...
##  $ hair            : Factor w/ 28 levels "Black Hair","Blond Hair",..: 1 1 4 17 1 1 2 1 2 2 ...
##  $ sex             : Factor w/ 6 levels "Female Characters",..: 3 3 3 3 3 1 3 3 1 3 ...
##  $ gsm             : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ alive           : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ appearances     : int  3093 2496 1565 1316 1237 1231 1121 1095 1075 1028 ...
##  $ first.appearance: Factor w/ 1606 levels "1935, October",..: 15 455 156 461 20 33 39 486 261 129 ...
##  $ year            : int  1939 1986 1959 1987 1940 1941 1941 1989 1969 1956 ...
##  $ publisher       : chr  "dc" "dc" "dc" "dc" ...

Once combined, there are 23272 observations of 14 variables. There are 10 factor variables, 3 integer variables, and 1 character variable. Now, that we have the dataset combined and prepared, we will begin to perform some of the pre-processing and data cleaning.

Pre-processing

1) Drop Unnecessary Columns

For the purposes of this analysis, we will not be using “page_id” or “urlslug”, so we will drop both of these columns.

comics <- comics[-c(1,3)]
2) Rename columns, as needed

Gender is a more accurate term for the features in the sex column. We will rename that column now.

comics <- rename(comics, gender = sex)
3) Clean data within columns

First, we will remove repetitive or superfluous words from each column. For example, the word “Characters” is redundant since each of our objects is a character. Since each of these columns includes factor variables, we use as.factor() to reset the type to factor after the removal of the undesirable string.

comics = comics %>% 
  mutate(align = as.factor(str_remove_all(align, " Characters"))) %>% 
  mutate(id = as.factor(str_remove_all(id, " Identity"))) %>% 
  mutate(eye = as.factor(str_remove_all(eye, " Eyes"))) %>% 
  mutate(hair = as.factor(str_remove_all(hair, " Hair"))) %>% 
  mutate(gender = as.factor(str_remove_all(gender, " Characters"))) %>% 
  mutate (gsm = as.factor(str_remove_all(gsm, " Characters"))) %>% 
  mutate(alive = as.factor(str_remove_all(alive, " Characters")))

Next, we will take a look at each of the columns’ levels and distributions within the levels to determine if any should be regrouped or re-assigned as is logically appropriate.

ID

table(comics$id)
## 
##     Identity Unknown Known to Authorities              No Dual 
##                    9                   15                 1788 
##               Public               Secret 
##                 6994                 8683
#convert to character so strings may be edited
comics$id <- as.character(comics$id)

#reassign id = "Identity Unknown" to "Unknown" & "Known to Authorities" to "Secret"
comics["id"][comics["id"] == "Identity Unknown"] <- "Unknown"
comics["id"][comics["id"] == "Known to Authorities"] <- "Secret"

#return to factor 
comics$id <- as.factor(comics$id)

align

table(comics$align)
## 
##                Bad               Good            Neutral Reformed Criminals 
##               9615               7468               2773                  3
#convert to character so string may be edited
comics$align<-as.character(comics$align)

#remove "Reformed Criminals as a level since there are so few in the group
comics["align"][comics["align"] == "Reformed Criminals"] <- ""

#revert to factor
comics$align<-as.factor(comics$align)

eye

table(comics$eye)
## 
##           Amber     Auburn Hair           Black  Black Eyeballs            Blue 
##              15               7             967               3            3064 
##           Brown        Compound            Gold           Green            Grey 
##            2803               1              23             904             135 
##           Hazel         Magenta        Multiple              No         One Eye 
##              99               2               7               7              21 
##          Orange   Photocellular            Pink          Purple             Red 
##              35              48              27              45             716 
##          Silver        Variable          Violet           White          Yellow 
##              12              49              23             516             342 
## Yellow Eyeballs 
##               6
comics$eye <- as.character(comics$eye)
comics["eye"][comics["eye"] == "Black Eyeballs"] <- "Black"
comics["eye"][comics["eye"] == "Compound"] <- "Multiple"
comics["eye"][comics["eye"] == "Magenta"] <- "Pink"
comics["eye"][comics["eye"] == "Yellow Eyeballs"] <- "Yellow"
comics$eye <-as.factor(comics$eye)

hair

table(comics$hair)
## 
##           Auburn             Bald            Black            Blond 
##               78              838             5329             2326 
##             Blue           Bronze            Brown             Dyed 
##               97                1             3487                1 
##             Gold            Green             Grey      Light Brown 
##               13              159              688                6 
##          Magenta               No           Orange     Orange-brown 
##                5             1176               64                3 
##             Pink   Platinum Blond           Purple              Red 
##               42                2               79             1081 
##    Reddish Blond    Reddish Brown           Silver Strawberry Blond 
##                6                3               19               75 
##         Variable           Violet            White           Yellow 
##               32                4             1100               20
#recategorize some into the larger groupings 
comics$hair <- as.character(comics$hair)
comics["hair"][comics["hair"] == "No"] <- "Bald"
comics["hair"][comics["hair"] == "Bronze"] <- "Brown"
comics["hair"][comics["hair"] == "Gold"] <- "Blond"
comics["hair"][comics["hair"] == "Light Brown"] <- "Brown"
comics["hair"][comics["hair"] == "Magenta"] <- "Pink"
comics["hair"][comics["hair"] == "Dyed"] <- ""
comics["hair"][comics["hair"] == "Orange-brown"] <- "Orange"
comics["hair"][comics["hair"] == "Platinum Blond"] <- "Blond"
comics["hair"][comics["hair"] == "Reddish Blond"] <- "Auburn"
comics["hair"][comics["hair"] == "Reddish Brown"] <- "Auburn"
comics["hair"][comics["hair"] == "Strawberry Blond"] <- "Auburn"
comics["hair"][comics["hair"] == "Violet"] <- "Purple"
comics["hair"][comics["hair"] == "Yellow"] <- "Blond"
comics$hair <- as.factor(comics$hair)

gender

table(comics$gender)
## 
##     Agender      Female Genderfluid  Genderless        Male Transgender 
##          45        5804           2          20       16421           1
comics$gender <- as.character(comics$gender)
comics["gender"][comics["gender"] == "Agender"] <- "Other"
comics["gender"][comics["gender"] == "Transgender"] <- "Other"
comics["gender"][comics["gender"] == "Genderless"] <- "Other"
comics["gender"][comics["gender"] == "Genderfluid"] <- "Other"
comics$gender<-as.factor(comics$gender)

gsm

table(comics$gsm)
## 
##      Bisexual   Genderfluid    Homosexual     Pansexual   Transgender 
##            29             1           120             1             2 
## Transvestites 
##             1
comics$gsm<-as.character(comics$gsm)
comics["gsm"][comics["gsm"] == "Genderfluid"] <- ""
comics["gsm"][comics["gsm"] == "Pansexual"] <- ""
comics["gsm"][comics["gsm"] == "Transgender"] <- ""
comics["gsm"][comics["gsm"] == "Transvestites"] <- ""
comics$gsm<-as.factor(comics$gsm)

year

#Convert to a factor 
comics$year = as.factor(comics$year)

publisher

#convert to factor
comics$publisher <- as.factor(comics$publisher)
4) Check for NAs

Due to the number of NA values in this dataset, we cannot blanket remove all of the NA values or retain only the complete case objects. That said, we should have a sense of where large gaps are in our dataset as we progress to the exploratory phase.

names<-colnames(comics)

response = ""

for (i in names){  
  response = c('Number of NAs for', i, sum(is.na(comics[i])))
  print(response)
}
## [1] "Number of NAs for" "name"              "0"                
## [1] "Number of NAs for" "id"                "5783"             
## [1] "Number of NAs for" "align"             "3413"             
## [1] "Number of NAs for" "eye"               "13395"            
## [1] "Number of NAs for" "hair"              "6538"             
## [1] "Number of NAs for" "gender"            "979"              
## [1] "Number of NAs for" "gsm"               "23118"            
## [1] "Number of NAs for" "alive"             "6"                
## [1] "Number of NAs for" "appearances"       "1451"             
## [1] "Number of NAs for" "first.appearance"  "884"              
## [1] "Number of NAs for" "year"              "884"              
## [1] "Number of NAs for" "publisher"         "0"

gsm & eye have a significant number of NA values. We will keep them in the dataset, however, this should be kept in mind if we choose to investigate either of those attributes more closely.

Exploratory Data Analysis

From our original pre-processing, we have a sense of the levels that are present in each of the factor variables. We will begin by visualizing the distribution of attributes.

Alignment

  1. Distribution
comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good"))

ggplot(comics, aes(x=align))+
  geom_bar( fill = "#005293") + 
  theme(axis.text.x = element_text(angle=90))+
  labs (
    title = "Distribution of Alignment", 
    x="Alignment", 
    y="Number of Characters")+
  scale_fill_brewer(palette = "Set1")

The largest proportion of characters in the dataset are classified as “Bad.” Let’s now look more into how this alignment is distributed across other attributes.


2) Gender & Alignment

ggplot(data=subset(comics, !is.na(align)), aes(x=gender, fill=align, na.rm=TRUE)) + 
  geom_bar(position="dodge") +
  theme(axis.text.x = element_text(angle=90)) +
  labs(
    title="DC & Marvel Character Alignment by Gender", 
    x = "Gender", 
    y = "Number of Characters" 
  ) + 
  scale_fill_brewer(palette="Set1");

  1. Alignment by Persona
comics %>% 
  filter(!is.na(id) & !is.na(align))%>%
    ggplot(aes(x=id, fill=align, na.rm=TRUE))+
      geom_bar(position="fill")+
      theme(axis.text.x=element_text(angle=90))+
      labs(
        title="Proportion of Character Alignment by Persona",
        x="Persona", 
        y="Proportion Alignment",
        fill="Alignment"
      ) +
      scale_fill_brewer(palette = "Set1")

  1. Alignment by Eye Color
comics %>% 
  filter(!is.na(eye) & !is.na(align) & eye != "")%>%
  ggplot(aes(x=eye, fill=align, na.rm=TRUE))+
  geom_bar(position="dodge")+
  theme(axis.text.x=element_text(angle=90))+
  labs(
    title="Character Alignment by Eye Color",
    x="Eye Color", 
    y="Alignment", 
    fill="Alignment"
  ) +
  scale_fill_brewer(palette = "Set1")

  1. Alignment by Hair Color
comics %>% 
  filter(!is.na(hair) & !is.na(align) & hair != "")%>%
  ggplot(aes(x=hair, fill=align, na.rm=TRUE))+
  geom_bar(position="dodge")+
  theme(axis.text.x=element_text(angle=90))+
  labs(
    title="Character Alignment by Hair Color",
    x="Hair Color", 
    y="Alignment", 
    fill="Alignment"
  ) +
  scale_fill_brewer(palette = "Set1")

  1. Alignment by GSM
comics %>% 
  filter(!is.na(gsm) & !is.na(align) & gsm != "")%>%
  ggplot(aes(x=gsm, fill=align, na.rm=TRUE))+
  geom_bar(position="dodge")+
  theme(axis.text.x=element_text(angle=90))+
  labs(
    title="Character Alignment by Sexuality",
    x="Sexuality", 
    y="Alignment", 
    fill="Alignment"
  ) +
  scale_fill_brewer(palette = "Set1")

  1. Alignment by Year
lvls <- levels(comics$year)
comics %>% 
  filter(!is.na(year) & !is.na(align) & year != "")%>%
  ggplot(aes(x=year, fill=align, na.rm=TRUE))+
  geom_bar(position="fill")+
  theme(axis.text.x=element_text(angle=0))+
  labs(
    title="Character Alignment by Year",
    x="Year", 
    y="Proportion Alignment", 
    fill="Alignment"
  ) +
  scale_fill_brewer(palette = "Set1") + 
  scale_x_discrete(breaks = lvls[seq(1, length(lvls), by=4)])+ 
  theme(axis.text.x = element_text(angle=90))

  1. Alignment by Publisher
comics %>% 
  filter(!is.na(align))%>%
  ggplot(aes(x=publisher, fill=align, na.rm=TRUE))+
  geom_bar(position="fill")+
    labs(
    title="Character Alignment by Publisher",
    x="Publisher", 
    y="Proportion Alignment", 
    fill="Alignment"
  ) +
  scale_fill_brewer(palette = "Set1") 

  1. Alignment by Number of Appearances & Gender
comics %>% 
  filter(!is.na(align) & !is.na(appearances) & !is.na(gender)) %>% 
  ggplot(aes(x=align, y=appearances, color=gender))+
  geom_jitter() +
  labs (
    title = "Character Alignment by Number of Appearances", 
    x="Alignment",
    y="Number of Character Appearances", 
    fill="Gender"
  ) 

Conclusions based on Alignment:

  • Across all characters, there are a greater number classified as “bad” than “good”; however, “good” characters have a greater average number of appearances. The same “good” guys take down multiple “bad” guys, as logically expected.
  • There are a greater number of male characters than female characters overall, but there is a higher proportion of “good” female characters than “good” male characters within the gender grouping.
  • “Bad” characters are more likely to have a secret identity where publicly known characters are moe likely to be “good”.
  • Blue- and brown-eyed chacters have a higher proportion of “good” characters. Characters with red, white, or yellow eyes are more likely to be “bad” than “good”.
  • The distributions within hair colors are all relatively even except for bald characters which have a higher proporiton of “bad” characters.
  • Bisexual and homosexual characters have a greater proportion of “good” characters than “bad” or “neutral” characters.
  • The 1940s were a great time for comic book villians with much higher proportions of “bad” characters than “good”.
  • DC has a higher proportion of “good” characters than Marvel, though both are pretty even on proportion of “bad” characters. Marvel has a greater proportion of neutral characters.
  • If I’m in the DC or Marvel universe and a bald headed, red-eyed male walks toward me, there is a pretty high liklihood he’s a super villian and I’m running the other direction.


Next Steps

The next steps I’d take with this project is diving a bit deeper into the exploratory data analysis to identify any other interesting trends among the attributes that are not alignment.

This data set could be used to build and train a multiple logistic regression model that would output “Good”, “Bad”, or “Neutral” based on a series of attributes.