Ozer_Week2_Assignment1.utf8.md

#title: "Ozer-Week2-Assignment1"

# For Explatory Data Analysis I chose a categorical data set. I have been seeing the comic books data set in my google searches since I started my MSDS and I thought it would be interesting and fun to revisit the characters from my childhood and maybe understand why I liked certain ones more then the others. For practice, I reviewed MSDS 600 material as well R tutorials book and watched a data camp tutorial.

#1. Provide a brief introduction to your data
#There are two major publishers of comics. Marvel and DC comics. Dataset contains the name of the super hero, his/her alignemnt (good or bad), physical attributes including alive or deceased, number of apperances, first time the characters was introduced as well as the publisher. I downlaoded the data set from kaggle.

#2. Discuss what you hope to discover about your data or certain questions that you will solve for
#I would like to concentrate on id, alignment and gender. The difference between the alignment and gender. Which genders are typically bad and which ones are not? Also I would like to know the association between the id and gender and perhaps add alignment to the mix. If you are good superhero would you reveal yourself? I used ggplot, counts and proportions for the analysis.

#The answers for these questions are below. My summary/findings and thoughts are at the end of each cell.
#3. Show a header or summary of your data and explain the columns were applicable
#4. Work in R to find your answers to the questions in part 2
#5. A summary of your findings, did you expect this? did you not? 
#6. Include your code either embedded in your write-up or as a separate attachment.

#load data from local
comics<- read.csv(file="/Users/jay/Desktop/CODE/Data\ Camp/Exploratory\ Data\ Analysis/DataSets/comics.csv", header = TRUE, sep=",")
head(comics)

##                                    name      id   align        eye
## 1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes
## 2       Captain America (Steven Rogers)  Public    Good  Blue Eyes
## 3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes
## 4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes
## 5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes
## 6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes
##         hair gender  gsm             alive appearances first_appear
## 1 Brown Hair   Male <NA> Living Characters        4043       Aug-62
## 2 White Hair   Male <NA> Living Characters        3360       Mar-41
## 3 Black Hair   Male <NA> Living Characters        3061       Oct-74
## 4 Black Hair   Male <NA> Living Characters        2961       Mar-63
## 5 Blond Hair   Male <NA> Living Characters        2258       Nov-50
## 6    No Hair   Male <NA> Living Characters        2255       Nov-61
##   publisher
## 1    marvel
## 2    marvel
## 3    marvel
## 4    marvel
## 5    marvel
## 6    marvel

#I see that most of variables are factors which is the way to present categorical data. Only numerical data is the aggregated number of "apperances" which is an integer data type.

#Check the class. It is a df.
class(comics)

## [1] "data.frame"

str(comics)

## 'data.frame':    23272 obs. of  11 variables:
##  $ name        : Factor w/ 23272 levels "'Spinner (Earth-616)",..: 19833 3335 22769 9647 20956 2220 17576 9346 18794 10957 ...
##  $ id          : Factor w/ 4 levels "No Dual","Public",..: 3 2 2 2 1 2 2 2 2 2 ...
##  $ align       : Factor w/ 4 levels "Bad","Good","Neutral",..: 2 2 3 2 2 2 2 2 3 2 ...
##  $ eye         : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 11 5 5 5 5 5 6 6 6 5 ...
##  $ hair        : Factor w/ 28 levels "Auburn Hair",..: 7 27 3 3 4 14 7 7 7 4 ...
##  $ gender      : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ gsm         : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ alive       : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ appearances : int  4043 3360 3061 2961 2258 2255 2072 2017 1955 1934 ...
##  $ first_appear: Factor w/ 1606 levels "1935, October",..: 874 1278 1513 1296 1424 1432 1432 1364 1570 1432 ...
##  $ publisher   : Factor w/ 2 levels "dc","marvel": 2 2 2 2 2 2 2 2 2 2 ...

#result summaries
summary(comics)

##                                                     name      
##  'Spinner (Earth-616)                                 :    1  
##  \\"Spider-Girl\\" (Mutant\\/Spider Clone) (Earth-616):    1  
##  \\"Thumper\\" Morgan (Earth-616)                     :    1  
##  \\u00c4kr\\u00e4s (Earth-616)                        :    1  
##  107 (Earth-616)                                      :    1  
##  11-Ball (Earth-616)                                  :    1  
##  (Other)                                              :23266  
##        id                      align              eye       
##  No Dual:1788   Bad               :9615   Blue Eyes : 3064  
##  Public :6994   Good              :7468   Brown Eyes: 2803  
##  Secret :8698   Neutral           :2773   Black Eyes:  967  
##  Unknown:   9   Reformed Criminals:   3   Green Eyes:  904  
##  NA's   :5783   NA's              :3413   Red Eyes  :  716  
##                                           (Other)   : 1423  
##                                           NA's      :13395  
##          hair         gender                          gsm       
##  Black Hair:5329   Female: 5804   Bisexual Characters   :   29  
##  Brown Hair:3487   Male  :16421   Genderfluid Characters:    1  
##  Blond Hair:2326   Other :   68   Homosexual Characters :  120  
##  No Hair   :1176   NA's  :  979   Pansexual Characters  :    1  
##  White Hair:1100                  Transgender Characters:    2  
##  (Other)   :3316                  Transvestites         :    1  
##  NA's      :6538                  NA's                  :23118  
##                  alive        appearances              first_appear  
##  Deceased Characters: 5458   Min.   :   1.00   2010, December:   78  
##  Living Characters  :17808   1st Qu.:   1.00   Jun-92        :   72  
##  NA's               :    6   Median :   4.00   May-93        :   69  
##                              Mean   :  19.01   Sep-06        :   67  
##                              3rd Qu.:  10.00   Jan-94        :   66  
##                              Max.   :4043.00   (Other)       :22036  
##                              NA's   :1451      NA's          :  884  
##   publisher    
##  dc    : 6896  
##  marvel:16376  
##                
##                
##                
##                
##

#Some explatory analysis to find out more about the data
#names of the variables
names(comics)

##  [1] "name"         "id"           "align"        "eye"         
##  [5] "hair"         "gender"       "gsm"          "alive"       
##  [9] "appearances"  "first_appear" "publisher"

#number of columns
ncol(comics)

## [1] 11

#number of rows
nrow(comics)

## [1] 23272

#Here using "level"" to see the unique values for a a few of the variables. These are some of the variables i am interested in and I would like to know the distinct values for the scope of analysis.
levels(comics$id) # Public, secret were given but I would not have thought about "Unknown" and "No Dual". After googling "No Dual" means [on marvel only: No Dual Identity]

## [1] "No Dual" "Public"  "Secret"  "Unknown"

levels(comics$gender) # there are three genders.

## [1] "Female" "Male"   "Other"

levels(comics$publisher) # only DC and Marvel comics as publishers

## [1] "dc"     "marvel"

levels(comics$align) # this is interesting. i would have thought there will be "good" and "bad" but there is also "neutral"" and "reformed criminal" as a variable

## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"

#I also see a variable called gsm. Googling the levels of GSM returns: If the character is a gender or sexual minority (e.g. Homosexual characters, bisexual characters). Here, I also notice that levels() function ignores NAs.
levels(comics$gsm)

## [1] "Bisexual Characters"    "Genderfluid Characters"
## [3] "Homosexual Characters"  "Pansexual Characters"  
## [5] "Transgender Characters" "Transvestites"

#I dont understand gsm when i read it so renaming the column
install.packages("reshape", repos = "http://cran.us.r-project.org")

## Updating HTML index of packages in '.Library'

## Making 'packages.html' ... done

library(reshape)

#new name is gendersex_minority
comics <- rename(comics, c(gsm="gendersex_minority"))
names(comics)

##  [1] "name"               "id"                 "align"             
##  [4] "eye"                "hair"               "gender"            
##  [7] "gendersex_minority" "alive"              "appearances"       
## [10] "first_appear"       "publisher"

#next I would like to know the distribution of identity to the alignment. 
table(comics$id, comics$align)

##          
##            Bad Good Neutral Reformed Criminals
##   No Dual  474  647     390                  0
##   Public  2172 2930     965                  1
##   Secret  4493 2475     959                  1
##   Unknown    7    0       2                  0

#There are 7 unknown identity superheros that are aligned bad and there are no good ones. No surprises there. My identity would be unknown too if I chose to be bad.

#It is also no surprise that Secret identiy super heroes are mostly bad. now, i would like to see the totals for each column and row.
addmargins(table(comics$id, comics$align))

##          
##             Bad  Good Neutral Reformed Criminals   Sum
##   No Dual   474   647     390                  0  1511
##   Public   2172  2930     965                  1  6068
##   Secret   4493  2475     959                  1  7928
##   Unknown     7     0       2                  0     9
##   Sum      7146  6052    2316                  2 15516

#There are more bad superheros then good ones. This is a massive suprise to me. if all neutral ones have decided to be good then good will win but what are the chances of that.

#dataframes are great to get insights but i would like to see the distribution in graphs and will utilize ggplot2 for it. here is the syntax so I can use it as a template. Specify dataset, variables interested, layers to describe how variables are plotted. aes stands for aesthetics. 
#ggplot(data, aes(x=var1, fill = var2)) + later_name()
install.packages("ggplot2", repos = "http://cran.us.r-project.org")

## Updating HTML index of packages in '.Library'

## Making 'packages.html' ... done

library(ggplot2)

#Stacked bar charts - I think it looks nice to show the relationship between two categorical variables with stacked bar.
ggplot(comics, aes(x=id, fill=align)) +
  geom_bar()

#Couple of things stand out. Although there are only a few characters with an explicit "Unknown" identity, there are also many characters whose identity is unknown represented by NAs.basically there is no data.Also from the graph it is easier to see that bad superheroes generally hide their identities. Also bad is not always the largest category. There are also quite a few good superheros with a public identity. This shows there is an association between a superheros' alignment and identity. (which means: Distributions vary, they are not at equal sizes)

#I used to read male superheroes predominantly. next i am going to create a contingency table for alignment and gender to explore this further and also create a stackbar chart.
table(comics$align, comics$gender)

##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

ggplot(comics, aes(x=align, fill=gender)) +
  geom_bar()

#Chart shows that number of male superheros are dominant. Also from the contingency table i see that female super heros are typically good. There are very few records for Other sex and reformed criminals. maybe I can remove these records from my set to have cleaner output.

#Before I start next section, I need more info.
head(comics,4)

##                                    name     id   align        eye
## 1             Spider-Man (Peter Parker) Secret    Good Hazel Eyes
## 2       Captain America (Steven Rogers) Public    Good  Blue Eyes
## 3 Wolverine (James \\"Logan\\" Howlett) Public Neutral  Blue Eyes
## 4   Iron Man (Anthony \\"Tony\\" Stark) Public    Good  Blue Eyes
##         hair gender gendersex_minority             alive appearances
## 1 Brown Hair   Male               <NA> Living Characters        4043
## 2 White Hair   Male               <NA> Living Characters        3360
## 3 Black Hair   Male               <NA> Living Characters        3061
## 4 Black Hair   Male               <NA> Living Characters        2961
##   first_appear publisher
## 1       Aug-62    marvel
## 2       Mar-41    marvel
## 3       Oct-74    marvel
## 4       Mar-63    marvel

nrow(comics)

## [1] 23272

install.packages("dplyr", repos = "http://cran.us.r-project.org")

## Updating HTML index of packages in '.Library'

## Making 'packages.html' ... done

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:reshape':
## 
##     rename
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Remove align level and other from gender. trying out using & in dplyr.
comics <- comics %>%
  filter(align != "Reformed Criminals" & gender != "Other") %>%
  droplevels()
nrow(comics)

## [1] 19068

comics %>%
  filter(align == "Reformed Criminals" | gender == "Other")

##  [1] name               id                 align             
##  [4] eye                hair               gender            
##  [7] gendersex_minority alive              appearances       
## [10] first_appear       publisher         
## <0 rows> (or 0-length row.names)

#works!And used "&" and "|".

#Same as above but with faceting according to gender to have a better look.
ggplot(comics, aes(x=align)) +
  geom_bar() +
  facet_wrap(~gender)

#Do not have reformed anymore as well as Others. Number of male characters dominate at every alignment category. males are generally bad and females are generally good.

#Without gender. This puts Neutral in the middle arranging the order. Seeing how factor affects with ggplot
# Change the order of the levels in align
comics$align <- factor(comics$align, 
                       levels = c("Bad", "Neutral", "Good"))

# Create plot of align
ggplot(comics, aes(x = align)) + 
  geom_bar()

#Same as above but side by side for ease of read. Also trying to use theme and angling the text 90 degrees.
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "dodge") +
   theme(axis.text.x = element_text(angle = 90))

#Some of the things I can say are, more male characters then female characters. Across all genders "bad"" is the dominant alignment and among characters in good alignment, males are the most common.

#Now looking at proportions instead of just counts for insights.
#Next I would like to take look at proportions instead of counts between id and alignment. First a contingency table.
tab_cnt<- table(comics$id, comics$align)
prop.table(tab_cnt)

##          
##                    Bad      Neutral         Good
##   No Dual 0.0296478079 0.0248059959 0.0422497844
##   Public  0.1390860251 0.0625455993 0.1924122836
##   Secret  0.2878556742 0.0598925516 0.1609073423
##   Unknown 0.0004642833 0.0001326524 0.0000000000

#Biggest category is Secret and Bad. It is 28.8% of characters (approx 1/3) And there are no unknown good characters. I removed reformed earlier!

#Proportions that are conditional to row. Row total is 1
prop.table(tab_cnt, 1)

##          
##                 Bad   Neutral      Good
##   No Dual 0.3065844 0.2565158 0.4368999
##   Public  0.3529709 0.1587275 0.4883016
##   Secret  0.5659147 0.1177468 0.3163385
##   Unknown 0.7777778 0.2222222 0.0000000

# This means 56% of all secret characters are bad. 48% of all public characters are good - not surprising, even superheros need recognition!

#Proportions that are conditional to column. Column total is 1
prop.table(tab_cnt, 2)

##          
##                   Bad     Neutral        Good
##   No Dual 0.064867218 0.168316832 0.106807512
##   Public  0.304309970 0.424392439 0.486418511
##   Secret  0.629806995 0.406390639 0.406773977
##   Unknown 0.001015818 0.000900090 0.000000000

#~63% of all bad characters has a secret identity. 48% of all good characters are public (same as row proportion!!! this is interesting, the rest differs.)

#now I want to see the same for alignment vs gender. column proportion.
tab_cnt2<- table(comics$align, comics$gender)
prop.table(tab_cnt2, 2) #condition the gender variable

##          
##              Female      Male
##   Bad     0.3210859 0.5336298
##   Neutral 0.1706471 0.1269673
##   Good    0.5082670 0.3394029

#51% of all female characters are good as opposed to only 34% of all male characters are good. We saw the same via ggplot!

#Stacked bar chart using "fill" and ylab addition to display proportions. here I conditioned on id.
ggplot(comics, aes(x=id, fill=align)) +
  geom_bar(position = "fill") + ylab("proportion")

#greatest proportion belongs to bad alignment on unknown id. It is even larger than bad alignment on secret id.