This project is an investigation into both DC & Marvel characters. The dataset can be found here.
The purpose of this project is to demonstrate a framework for data pre-processing and perform Exploratory Data Analysis (EDA).First, we will call the libraries needed for our initial processing and EDA.
library(tidyverse)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
Next, we will read in the data. The source files are divided into DC characters & Marvel characters. Take a glance at the data by examining the top six rows of each.
dc_comics = read.csv("dc-wikia-data.csv", na.string="", stringsAsFactors = T)
marvel_comics = read.csv("marvel-wikia-data.csv",na.string="",stringsAsFactors = T)
head(dc_comics)
## page_id name urlslug
## 1 1422 Batman (Bruce Wayne) \\/wiki\\/Batman_(Bruce_Wayne)
## 2 23387 Superman (Clark Kent) \\/wiki\\/Superman_(Clark_Kent)
## 3 1458 Green Lantern (Hal Jordan) \\/wiki\\/Green_Lantern_(Hal_Jordan)
## 4 1659 James Gordon (New Earth) \\/wiki\\/James_Gordon_(New_Earth)
## 5 1576 Richard Grayson (New Earth) \\/wiki\\/Richard_Grayson_(New_Earth)
## 6 1448 Wonder Woman (Diana Prince) \\/wiki\\/Wonder_Woman_(Diana_Prince)
## ID ALIGN EYE HAIR SEX GSM
## 1 Secret Identity Good Characters Blue Eyes Black Hair Male Characters <NA>
## 2 Secret Identity Good Characters Blue Eyes Black Hair Male Characters <NA>
## 3 Secret Identity Good Characters Brown Eyes Brown Hair Male Characters <NA>
## 4 Public Identity Good Characters Brown Eyes White Hair Male Characters <NA>
## 5 Secret Identity Good Characters Blue Eyes Black Hair Male Characters <NA>
## 6 Public Identity Good Characters Blue Eyes Black Hair Female Characters <NA>
## ALIVE APPEARANCES FIRST.APPEARANCE YEAR
## 1 Living Characters 3093 1939, May 1939
## 2 Living Characters 2496 1986, October 1986
## 3 Living Characters 1565 1959, October 1959
## 4 Living Characters 1316 1987, February 1987
## 5 Living Characters 1237 1940, April 1940
## 6 Living Characters 1231 1941, December 1941
head(marvel_comics)
## page_id name
## 1 1678 Spider-Man (Peter Parker)
## 2 7139 Captain America (Steven Rogers)
## 3 64786 Wolverine (James \\"Logan\\" Howlett)
## 4 1868 Iron Man (Anthony \\"Tony\\" Stark)
## 5 2460 Thor (Thor Odinson)
## 6 2458 Benjamin Grimm (Earth-616)
## urlslug ID ALIGN
## 1 \\/Spider-Man_(Peter_Parker) Secret Identity Good Characters
## 2 \\/Captain_America_(Steven_Rogers) Public Identity Good Characters
## 3 \\/Wolverine_(James_%22Logan%22_Howlett) Public Identity Neutral Characters
## 4 \\/Iron_Man_(Anthony_%22Tony%22_Stark) Public Identity Good Characters
## 5 \\/Thor_(Thor_Odinson) No Dual Identity Good Characters
## 6 \\/Benjamin_Grimm_(Earth-616) Public Identity Good Characters
## EYE HAIR SEX GSM ALIVE APPEARANCES
## 1 Hazel Eyes Brown Hair Male Characters <NA> Living Characters 4043
## 2 Blue Eyes White Hair Male Characters <NA> Living Characters 3360
## 3 Blue Eyes Black Hair Male Characters <NA> Living Characters 3061
## 4 Blue Eyes Black Hair Male Characters <NA> Living Characters 2961
## 5 Blue Eyes Blond Hair Male Characters <NA> Living Characters 2258
## 6 Blue Eyes No Hair Male Characters <NA> Living Characters 2255
## FIRST.APPEARANCE Year
## 1 Aug-62 1962
## 2 Mar-41 1941
## 3 Oct-74 1974
## 4 Mar-63 1963
## 5 Nov-50 1950
## 6 Nov-61 1961
Both have the same attribute columns, however before combining the two together, the column header case must be standardized. In order to distinguish them once they have been combined, we will add a publisher column.
#make the column headers all lowercase
names(dc_comics)[4:13] <- tolower(names(dc_comics[4:13]))
names(marvel_comics)[4:13] <- tolower(names(marvel_comics[4:13]))
#add publisher column
dc_comics$publisher <- "dc"
marvel_comics$publisher<-"marvel"
We will perform a union of the datasets and take a glance at the structure.
comics <- union(dc_comics, marvel_comics)
str(comics)
## 'data.frame': 23272 obs. of 14 variables:
## $ page_id : int 1422 23387 1458 1659 1576 1448 1486 1451 71760 1380 ...
## $ name : Factor w/ 23272 levels "3g4 (New Earth)",..: 598 6007 2488 3002 5280 6771 378 6289 1695 2185 ...
## $ urlslug : Factor w/ 23272 levels "\\/wiki\\/3g4_(New_Earth)",..: 598 6006 2488 3003 5279 6771 378 6288 1695 2185 ...
## $ id : Factor w/ 5 levels "Identity Unknown",..: 3 3 3 2 3 2 2 3 2 3 ...
## $ align : Factor w/ 4 levels "Bad Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ eye : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 4 4 5 5 4 4 4 4 4 4 ...
## $ hair : Factor w/ 28 levels "Black Hair","Blond Hair",..: 1 1 4 17 1 1 2 1 2 2 ...
## $ sex : Factor w/ 6 levels "Female Characters",..: 3 3 3 3 3 1 3 3 1 3 ...
## $ gsm : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
## $ alive : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ appearances : int 3093 2496 1565 1316 1237 1231 1121 1095 1075 1028 ...
## $ first.appearance: Factor w/ 1606 levels "1935, October",..: 15 455 156 461 20 33 39 486 261 129 ...
## $ year : int 1939 1986 1959 1987 1940 1941 1941 1989 1969 1956 ...
## $ publisher : chr "dc" "dc" "dc" "dc" ...
Once combined, there are 23272 observations of 14 variables. There are 10 factor variables, 3 integer variables, and 1 character variable. Now, that we have the dataset combined and prepared, we will begin to perform some of the pre-processing and data cleaning.
For the purposes of this analysis, we will not be using “page_id” or “urlslug”, so we will drop both of these columns.
comics <- comics[-c(1,3)]
Gender is a more accurate term for the features in the sex column. We will rename that column now.
comics <- rename(comics, gender = sex)
First, we will remove repetitive or superfluous words from each column. For example, the word “Characters” is redundant since each of our objects is a character. Since each of these columns includes factor variables, we use as.factor() to reset the type to factor after the removal of the undesirable string.
comics = comics %>%
mutate(align = as.factor(str_remove_all(align, " Characters"))) %>%
mutate(id = as.factor(str_remove_all(id, " Identity"))) %>%
mutate(eye = as.factor(str_remove_all(eye, " Eyes"))) %>%
mutate(hair = as.factor(str_remove_all(hair, " Hair"))) %>%
mutate(gender = as.factor(str_remove_all(gender, " Characters"))) %>%
mutate (gsm = as.factor(str_remove_all(gsm, " Characters"))) %>%
mutate(alive = as.factor(str_remove_all(alive, " Characters")))
Next, we will take a look at each of the columns’ levels and distributions within the levels to determine if any should be regrouped or re-assigned as is logically appropriate.
ID
table(comics$id)
##
## Identity Unknown Known to Authorities No Dual
## 9 15 1788
## Public Secret
## 6994 8683
#convert to character so strings may be edited
comics$id <- as.character(comics$id)
#reassign id = "Identity Unknown" to "Unknown" & "Known to Authorities" to "Secret"
comics["id"][comics["id"] == "Identity Unknown"] <- "Unknown"
comics["id"][comics["id"] == "Known to Authorities"] <- "Secret"
#return to factor
comics$id <- as.factor(comics$id)
align
table(comics$align)
##
## Bad Good Neutral Reformed Criminals
## 9615 7468 2773 3
#convert to character so string may be edited
comics$align<-as.character(comics$align)
#remove "Reformed Criminals as a level since there are so few in the group
comics["align"][comics["align"] == "Reformed Criminals"] <- ""
#revert to factor
comics$align<-as.factor(comics$align)
eye
table(comics$eye)
##
## Amber Auburn Hair Black Black Eyeballs Blue
## 15 7 967 3 3064
## Brown Compound Gold Green Grey
## 2803 1 23 904 135
## Hazel Magenta Multiple No One Eye
## 99 2 7 7 21
## Orange Photocellular Pink Purple Red
## 35 48 27 45 716
## Silver Variable Violet White Yellow
## 12 49 23 516 342
## Yellow Eyeballs
## 6
comics$eye <- as.character(comics$eye)
comics["eye"][comics["eye"] == "Black Eyeballs"] <- "Black"
comics["eye"][comics["eye"] == "Compound"] <- "Multiple"
comics["eye"][comics["eye"] == "Magenta"] <- "Pink"
comics["eye"][comics["eye"] == "Yellow Eyeballs"] <- "Yellow"
comics$eye <-as.factor(comics$eye)
hair
table(comics$hair)
##
## Auburn Bald Black Blond
## 78 838 5329 2326
## Blue Bronze Brown Dyed
## 97 1 3487 1
## Gold Green Grey Light Brown
## 13 159 688 6
## Magenta No Orange Orange-brown
## 5 1176 64 3
## Pink Platinum Blond Purple Red
## 42 2 79 1081
## Reddish Blond Reddish Brown Silver Strawberry Blond
## 6 3 19 75
## Variable Violet White Yellow
## 32 4 1100 20
#recategorize some into the larger groupings
comics$hair <- as.character(comics$hair)
comics["hair"][comics["hair"] == "No"] <- "Bald"
comics["hair"][comics["hair"] == "Bronze"] <- "Brown"
comics["hair"][comics["hair"] == "Gold"] <- "Blond"
comics["hair"][comics["hair"] == "Light Brown"] <- "Brown"
comics["hair"][comics["hair"] == "Magenta"] <- "Pink"
comics["hair"][comics["hair"] == "Dyed"] <- ""
comics["hair"][comics["hair"] == "Orange-brown"] <- "Orange"
comics["hair"][comics["hair"] == "Platinum Blond"] <- "Blond"
comics["hair"][comics["hair"] == "Reddish Blond"] <- "Auburn"
comics["hair"][comics["hair"] == "Reddish Brown"] <- "Auburn"
comics["hair"][comics["hair"] == "Strawberry Blond"] <- "Auburn"
comics["hair"][comics["hair"] == "Violet"] <- "Purple"
comics["hair"][comics["hair"] == "Yellow"] <- "Blond"
comics$hair <- as.factor(comics$hair)
gender
table(comics$gender)
##
## Agender Female Genderfluid Genderless Male Transgender
## 45 5804 2 20 16421 1
comics$gender <- as.character(comics$gender)
comics["gender"][comics["gender"] == "Agender"] <- "Other"
comics["gender"][comics["gender"] == "Transgender"] <- "Other"
comics["gender"][comics["gender"] == "Genderless"] <- "Other"
comics["gender"][comics["gender"] == "Genderfluid"] <- "Other"
comics$gender<-as.factor(comics$gender)
gsm
table(comics$gsm)
##
## Bisexual Genderfluid Homosexual Pansexual Transgender
## 29 1 120 1 2
## Transvestites
## 1
comics$gsm<-as.character(comics$gsm)
comics["gsm"][comics["gsm"] == "Genderfluid"] <- ""
comics["gsm"][comics["gsm"] == "Pansexual"] <- ""
comics["gsm"][comics["gsm"] == "Transgender"] <- ""
comics["gsm"][comics["gsm"] == "Transvestites"] <- ""
comics$gsm<-as.factor(comics$gsm)
year
#Convert to a factor
comics$year = as.factor(comics$year)
publisher
#convert to factor
comics$publisher <- as.factor(comics$publisher)
Due to the number of NA values in this dataset, we cannot blanket remove all of the NA values or retain only the complete case objects. That said, we should have a sense of where large gaps are in our dataset as we progress to the exploratory phase.
names<-colnames(comics)
response = ""
for (i in names){
response = c('Number of NAs for', i, sum(is.na(comics[i])))
print(response)
}
## [1] "Number of NAs for" "name" "0"
## [1] "Number of NAs for" "id" "5783"
## [1] "Number of NAs for" "align" "3413"
## [1] "Number of NAs for" "eye" "13395"
## [1] "Number of NAs for" "hair" "6538"
## [1] "Number of NAs for" "gender" "979"
## [1] "Number of NAs for" "gsm" "23118"
## [1] "Number of NAs for" "alive" "6"
## [1] "Number of NAs for" "appearances" "1451"
## [1] "Number of NAs for" "first.appearance" "884"
## [1] "Number of NAs for" "year" "884"
## [1] "Number of NAs for" "publisher" "0"
gsm & eye have a significant number of NA values. We will keep them in the dataset, however, this should be kept in mind if we choose to investigate either of those attributes more closely.
From our original pre-processing, we have a sense of the levels that are present in each of the factor variables. We will begin by visualizing the distribution of attributes.
comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good"))
ggplot(comics, aes(x=align))+
geom_bar( fill = "#005293") +
theme(axis.text.x = element_text(angle=90))+
labs (
title = "Distribution of Alignment",
x="Alignment",
y="Number of Characters")+
scale_fill_brewer(palette = "Set1")
The largest proportion of characters in the dataset are classified as “Bad.” Let’s now look more into how this alignment is distributed across other attributes.
2) Gender & Alignment
ggplot(data=subset(comics, !is.na(align)), aes(x=gender, fill=align, na.rm=TRUE)) +
geom_bar(position="dodge") +
theme(axis.text.x = element_text(angle=90)) +
labs(
title="DC & Marvel Character Alignment by Gender",
x = "Gender",
y = "Number of Characters"
) +
scale_fill_brewer(palette="Set1");
comics %>%
filter(!is.na(id) & !is.na(align))%>%
ggplot(aes(x=id, fill=align, na.rm=TRUE))+
geom_bar(position="fill")+
theme(axis.text.x=element_text(angle=90))+
labs(
title="Proportion of Character Alignment by Persona",
x="Persona",
y="Proportion Alignment",
fill="Alignment"
) +
scale_fill_brewer(palette = "Set1")
comics %>%
filter(!is.na(eye) & !is.na(align) & eye != "")%>%
ggplot(aes(x=eye, fill=align, na.rm=TRUE))+
geom_bar(position="dodge")+
theme(axis.text.x=element_text(angle=90))+
labs(
title="Character Alignment by Eye Color",
x="Eye Color",
y="Alignment",
fill="Alignment"
) +
scale_fill_brewer(palette = "Set1")
comics %>%
filter(!is.na(hair) & !is.na(align) & hair != "")%>%
ggplot(aes(x=hair, fill=align, na.rm=TRUE))+
geom_bar(position="dodge")+
theme(axis.text.x=element_text(angle=90))+
labs(
title="Character Alignment by Hair Color",
x="Hair Color",
y="Alignment",
fill="Alignment"
) +
scale_fill_brewer(palette = "Set1")
comics %>%
filter(!is.na(gsm) & !is.na(align) & gsm != "")%>%
ggplot(aes(x=gsm, fill=align, na.rm=TRUE))+
geom_bar(position="dodge")+
theme(axis.text.x=element_text(angle=90))+
labs(
title="Character Alignment by Sexuality",
x="Sexuality",
y="Alignment",
fill="Alignment"
) +
scale_fill_brewer(palette = "Set1")
lvls <- levels(comics$year)
comics %>%
filter(!is.na(year) & !is.na(align) & year != "")%>%
ggplot(aes(x=year, fill=align, na.rm=TRUE))+
geom_bar(position="fill")+
theme(axis.text.x=element_text(angle=0))+
labs(
title="Character Alignment by Year",
x="Year",
y="Proportion Alignment",
fill="Alignment"
) +
scale_fill_brewer(palette = "Set1") +
scale_x_discrete(breaks = lvls[seq(1, length(lvls), by=4)])+
theme(axis.text.x = element_text(angle=90))
comics %>%
filter(!is.na(align))%>%
ggplot(aes(x=publisher, fill=align, na.rm=TRUE))+
geom_bar(position="fill")+
labs(
title="Character Alignment by Publisher",
x="Publisher",
y="Proportion Alignment",
fill="Alignment"
) +
scale_fill_brewer(palette = "Set1")
comics %>%
filter(!is.na(align) & !is.na(appearances) & !is.na(gender)) %>%
ggplot(aes(x=align, y=appearances, color=gender))+
geom_jitter() +
labs (
title = "Character Alignment by Number of Appearances",
x="Alignment",
y="Number of Character Appearances",
fill="Gender"
)
The next steps I’d take with this project is diving a bit deeper into the exploratory data analysis to identify any other interesting trends among the attributes that are not alignment.
This data set could be used to build and train a multiple logistic regression model that would output “Good”, “Bad”, or “Neutral” based on a series of attributes.