#title: "Ozer-Week2-Assignment1"
# For Explatory Data Analysis I chose a categorical data set. I have been seeing the comic books data set in my google searches since I started my MSDS and I thought it would be interesting and fun to revisit the characters from my childhood and maybe understand why I liked certain ones more then the others. For practice, I reviewed MSDS 600 material as well R tutorials book and watched a data camp tutorial.
#1. Provide a brief introduction to your data
#There are two major publishers of comics. Marvel and DC comics. Dataset contains the name of the super hero, his/her alignemnt (good or bad), physical attributes including alive or deceased, number of apperances, first time the characters was introduced as well as the publisher. I downlaoded the data set from kaggle.
#2. Discuss what you hope to discover about your data or certain questions that you will solve for
#I would like to concentrate on id, alignment and gender. The difference between the alignment and gender. Which genders are typically bad and which ones are not? Also I would like to know the association between the id and gender and perhaps add alignment to the mix. If you are good superhero would you reveal yourself? I used ggplot, counts and proportions for the analysis.
#The answers for these questions are below. My summary/findings and thoughts are at the end of each cell.
#3. Show a header or summary of your data and explain the columns were applicable
#4. Work in R to find your answers to the questions in part 2
#5. A summary of your findings, did you expect this? did you not?
#6. Include your code either embedded in your write-up or as a separate attachment.
#load data from local
comics<- read.csv(file="/Users/jay/Desktop/CODE/Data\ Camp/Exploratory\ Data\ Analysis/DataSets/comics.csv", header = TRUE, sep=",")
head(comics)
## name id align eye
## 1 Spider-Man (Peter Parker) Secret Good Hazel Eyes
## 2 Captain America (Steven Rogers) Public Good Blue Eyes
## 3 Wolverine (James \\"Logan\\" Howlett) Public Neutral Blue Eyes
## 4 Iron Man (Anthony \\"Tony\\" Stark) Public Good Blue Eyes
## 5 Thor (Thor Odinson) No Dual Good Blue Eyes
## 6 Benjamin Grimm (Earth-616) Public Good Blue Eyes
## hair gender gsm alive appearances first_appear
## 1 Brown Hair Male <NA> Living Characters 4043 Aug-62
## 2 White Hair Male <NA> Living Characters 3360 Mar-41
## 3 Black Hair Male <NA> Living Characters 3061 Oct-74
## 4 Black Hair Male <NA> Living Characters 2961 Mar-63
## 5 Blond Hair Male <NA> Living Characters 2258 Nov-50
## 6 No Hair Male <NA> Living Characters 2255 Nov-61
## publisher
## 1 marvel
## 2 marvel
## 3 marvel
## 4 marvel
## 5 marvel
## 6 marvel
#I see that most of variables are factors which is the way to present categorical data. Only numerical data is the aggregated number of "apperances" which is an integer data type.
#Check the class. It is a df.
class(comics)
## [1] "data.frame"
str(comics)
## 'data.frame': 23272 obs. of 11 variables:
## $ name : Factor w/ 23272 levels "'Spinner (Earth-616)",..: 19833 3335 22769 9647 20956 2220 17576 9346 18794 10957 ...
## $ id : Factor w/ 4 levels "No Dual","Public",..: 3 2 2 2 1 2 2 2 2 2 ...
## $ align : Factor w/ 4 levels "Bad","Good","Neutral",..: 2 2 3 2 2 2 2 2 3 2 ...
## $ eye : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 11 5 5 5 5 5 6 6 6 5 ...
## $ hair : Factor w/ 28 levels "Auburn Hair",..: 7 27 3 3 4 14 7 7 7 4 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ gsm : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
## $ alive : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ appearances : int 4043 3360 3061 2961 2258 2255 2072 2017 1955 1934 ...
## $ first_appear: Factor w/ 1606 levels "1935, October",..: 874 1278 1513 1296 1424 1432 1432 1364 1570 1432 ...
## $ publisher : Factor w/ 2 levels "dc","marvel": 2 2 2 2 2 2 2 2 2 2 ...
#result summaries
summary(comics)
## name
## 'Spinner (Earth-616) : 1
## \\"Spider-Girl\\" (Mutant\\/Spider Clone) (Earth-616): 1
## \\"Thumper\\" Morgan (Earth-616) : 1
## \\u00c4kr\\u00e4s (Earth-616) : 1
## 107 (Earth-616) : 1
## 11-Ball (Earth-616) : 1
## (Other) :23266
## id align eye
## No Dual:1788 Bad :9615 Blue Eyes : 3064
## Public :6994 Good :7468 Brown Eyes: 2803
## Secret :8698 Neutral :2773 Black Eyes: 967
## Unknown: 9 Reformed Criminals: 3 Green Eyes: 904
## NA's :5783 NA's :3413 Red Eyes : 716
## (Other) : 1423
## NA's :13395
## hair gender gsm
## Black Hair:5329 Female: 5804 Bisexual Characters : 29
## Brown Hair:3487 Male :16421 Genderfluid Characters: 1
## Blond Hair:2326 Other : 68 Homosexual Characters : 120
## No Hair :1176 NA's : 979 Pansexual Characters : 1
## White Hair:1100 Transgender Characters: 2
## (Other) :3316 Transvestites : 1
## NA's :6538 NA's :23118
## alive appearances first_appear
## Deceased Characters: 5458 Min. : 1.00 2010, December: 78
## Living Characters :17808 1st Qu.: 1.00 Jun-92 : 72
## NA's : 6 Median : 4.00 May-93 : 69
## Mean : 19.01 Sep-06 : 67
## 3rd Qu.: 10.00 Jan-94 : 66
## Max. :4043.00 (Other) :22036
## NA's :1451 NA's : 884
## publisher
## dc : 6896
## marvel:16376
##
##
##
##
##
#Some explatory analysis to find out more about the data
#names of the variables
names(comics)
## [1] "name" "id" "align" "eye"
## [5] "hair" "gender" "gsm" "alive"
## [9] "appearances" "first_appear" "publisher"
#number of columns
ncol(comics)
## [1] 11
#number of rows
nrow(comics)
## [1] 23272
#Here using "level"" to see the unique values for a a few of the variables. These are some of the variables i am interested in and I would like to know the distinct values for the scope of analysis.
levels(comics$id) # Public, secret were given but I would not have thought about "Unknown" and "No Dual". After googling "No Dual" means [on marvel only: No Dual Identity]
## [1] "No Dual" "Public" "Secret" "Unknown"
levels(comics$gender) # there are three genders.
## [1] "Female" "Male" "Other"
levels(comics$publisher) # only DC and Marvel comics as publishers
## [1] "dc" "marvel"
levels(comics$align) # this is interesting. i would have thought there will be "good" and "bad" but there is also "neutral"" and "reformed criminal" as a variable
## [1] "Bad" "Good" "Neutral"
## [4] "Reformed Criminals"
#I also see a variable called gsm. Googling the levels of GSM returns: If the character is a gender or sexual minority (e.g. Homosexual characters, bisexual characters). Here, I also notice that levels() function ignores NAs.
levels(comics$gsm)
## [1] "Bisexual Characters" "Genderfluid Characters"
## [3] "Homosexual Characters" "Pansexual Characters"
## [5] "Transgender Characters" "Transvestites"
#I dont understand gsm when i read it so renaming the column
install.packages("reshape", repos = "http://cran.us.r-project.org")
## Updating HTML index of packages in '.Library'
## Making 'packages.html' ... done
library(reshape)
#new name is gendersex_minority
comics <- rename(comics, c(gsm="gendersex_minority"))
names(comics)
## [1] "name" "id" "align"
## [4] "eye" "hair" "gender"
## [7] "gendersex_minority" "alive" "appearances"
## [10] "first_appear" "publisher"
#next I would like to know the distribution of identity to the alignment.
table(comics$id, comics$align)
##
## Bad Good Neutral Reformed Criminals
## No Dual 474 647 390 0
## Public 2172 2930 965 1
## Secret 4493 2475 959 1
## Unknown 7 0 2 0
#There are 7 unknown identity superheros that are aligned bad and there are no good ones. No surprises there. My identity would be unknown too if I chose to be bad.
#It is also no surprise that Secret identiy super heroes are mostly bad. now, i would like to see the totals for each column and row.
addmargins(table(comics$id, comics$align))
##
## Bad Good Neutral Reformed Criminals Sum
## No Dual 474 647 390 0 1511
## Public 2172 2930 965 1 6068
## Secret 4493 2475 959 1 7928
## Unknown 7 0 2 0 9
## Sum 7146 6052 2316 2 15516
#There are more bad superheros then good ones. This is a massive suprise to me. if all neutral ones have decided to be good then good will win but what are the chances of that.
#dataframes are great to get insights but i would like to see the distribution in graphs and will utilize ggplot2 for it. here is the syntax so I can use it as a template. Specify dataset, variables interested, layers to describe how variables are plotted. aes stands for aesthetics.
#ggplot(data, aes(x=var1, fill = var2)) + later_name()
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## Updating HTML index of packages in '.Library'
## Making 'packages.html' ... done
library(ggplot2)
#Stacked bar charts - I think it looks nice to show the relationship between two categorical variables with stacked bar.
ggplot(comics, aes(x=id, fill=align)) +
geom_bar()

#Couple of things stand out. Although there are only a few characters with an explicit "Unknown" identity, there are also many characters whose identity is unknown represented by NAs.basically there is no data.Also from the graph it is easier to see that bad superheroes generally hide their identities. Also bad is not always the largest category. There are also quite a few good superheros with a public identity. This shows there is an association between a superheros' alignment and identity. (which means: Distributions vary, they are not at equal sizes)
#I used to read male superheroes predominantly. next i am going to create a contingency table for alignment and gender to explore this further and also create a stackbar chart.
table(comics$align, comics$gender)
##
## Female Male Other
## Bad 1573 7561 32
## Good 2490 4809 17
## Neutral 836 1799 17
## Reformed Criminals 1 2 0
ggplot(comics, aes(x=align, fill=gender)) +
geom_bar()

#Chart shows that number of male superheros are dominant. Also from the contingency table i see that female super heros are typically good. There are very few records for Other sex and reformed criminals. maybe I can remove these records from my set to have cleaner output.
#Before I start next section, I need more info.
head(comics,4)
## name id align eye
## 1 Spider-Man (Peter Parker) Secret Good Hazel Eyes
## 2 Captain America (Steven Rogers) Public Good Blue Eyes
## 3 Wolverine (James \\"Logan\\" Howlett) Public Neutral Blue Eyes
## 4 Iron Man (Anthony \\"Tony\\" Stark) Public Good Blue Eyes
## hair gender gendersex_minority alive appearances
## 1 Brown Hair Male <NA> Living Characters 4043
## 2 White Hair Male <NA> Living Characters 3360
## 3 Black Hair Male <NA> Living Characters 3061
## 4 Black Hair Male <NA> Living Characters 2961
## first_appear publisher
## 1 Aug-62 marvel
## 2 Mar-41 marvel
## 3 Oct-74 marvel
## 4 Mar-63 marvel
nrow(comics)
## [1] 23272
install.packages("dplyr", repos = "http://cran.us.r-project.org")
## Updating HTML index of packages in '.Library'
## Making 'packages.html' ... done
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:reshape':
##
## rename
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Remove align level and other from gender. trying out using & in dplyr.
comics <- comics %>%
filter(align != "Reformed Criminals" & gender != "Other") %>%
droplevels()
nrow(comics)
## [1] 19068
comics %>%
filter(align == "Reformed Criminals" | gender == "Other")
## [1] name id align
## [4] eye hair gender
## [7] gendersex_minority alive appearances
## [10] first_appear publisher
## <0 rows> (or 0-length row.names)
#works!And used "&" and "|".
#Same as above but with faceting according to gender to have a better look.
ggplot(comics, aes(x=align)) +
geom_bar() +
facet_wrap(~gender)

#Do not have reformed anymore as well as Others. Number of male characters dominate at every alignment category. males are generally bad and females are generally good.
#Without gender. This puts Neutral in the middle arranging the order. Seeing how factor affects with ggplot
# Change the order of the levels in align
comics$align <- factor(comics$align,
levels = c("Bad", "Neutral", "Good"))
# Create plot of align
ggplot(comics, aes(x = align)) +
geom_bar()

#Same as above but side by side for ease of read. Also trying to use theme and angling the text 90 degrees.
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 90))

#Some of the things I can say are, more male characters then female characters. Across all genders "bad"" is the dominant alignment and among characters in good alignment, males are the most common.
#Now looking at proportions instead of just counts for insights.
#Next I would like to take look at proportions instead of counts between id and alignment. First a contingency table.
tab_cnt<- table(comics$id, comics$align)
prop.table(tab_cnt)
##
## Bad Neutral Good
## No Dual 0.0296478079 0.0248059959 0.0422497844
## Public 0.1390860251 0.0625455993 0.1924122836
## Secret 0.2878556742 0.0598925516 0.1609073423
## Unknown 0.0004642833 0.0001326524 0.0000000000
#Biggest category is Secret and Bad. It is 28.8% of characters (approx 1/3) And there are no unknown good characters. I removed reformed earlier!
#Proportions that are conditional to row. Row total is 1
prop.table(tab_cnt, 1)
##
## Bad Neutral Good
## No Dual 0.3065844 0.2565158 0.4368999
## Public 0.3529709 0.1587275 0.4883016
## Secret 0.5659147 0.1177468 0.3163385
## Unknown 0.7777778 0.2222222 0.0000000
# This means 56% of all secret characters are bad. 48% of all public characters are good - not surprising, even superheros need recognition!
#Proportions that are conditional to column. Column total is 1
prop.table(tab_cnt, 2)
##
## Bad Neutral Good
## No Dual 0.064867218 0.168316832 0.106807512
## Public 0.304309970 0.424392439 0.486418511
## Secret 0.629806995 0.406390639 0.406773977
## Unknown 0.001015818 0.000900090 0.000000000
#~63% of all bad characters has a secret identity. 48% of all good characters are public (same as row proportion!!! this is interesting, the rest differs.)
#now I want to see the same for alignment vs gender. column proportion.
tab_cnt2<- table(comics$align, comics$gender)
prop.table(tab_cnt2, 2) #condition the gender variable
##
## Female Male
## Bad 0.3210859 0.5336298
## Neutral 0.1706471 0.1269673
## Good 0.5082670 0.3394029
#51% of all female characters are good as opposed to only 34% of all male characters are good. We saw the same via ggplot!
#Stacked bar chart using "fill" and ylab addition to display proportions. here I conditioned on id.
ggplot(comics, aes(x=id, fill=align)) +
geom_bar(position = "fill") + ylab("proportion")

#greatest proportion belongs to bad alignment on unknown id. It is even larger than bad alignment on secret id.