Elina Azrilyan

Week 3 Final Project

I am going to explore a dataset which includes information about Titanic passenger. The information includes class, age, and gender inforamtion as well as survival information. My hypothesis is that there will be a correlation between the survival rate and class of passenger. I am also going to assume that survival rate for women will be higher than men.
  1. Data Exploration:
require(ggplot2)
## Loading required package: ggplot2
MyData <- read.csv(file="https://raw.githubusercontent.com/che10vek/R-HW2/master/titanic.csv", header=TRUE, sep=",")
summary(MyData)
##        X                class         age          sex      survived 
##  Min.   :   1.0   1st class:325   adults:1207   man  :869   no :817  
##  1st Qu.: 329.8   2nd class:285   child : 109   women:447   yes:499  
##  Median : 658.5   3rd class:706                                      
##  Mean   : 658.5                                                      
##  3rd Qu.: 987.2                                                      
##  Max.   :1316.0
table(MyData$sex, MyData$survived)
##        
##          no yes
##   man   694 175
##   women 123 324
table(MyData$class, MyData$survived)
##            
##              no yes
##   1st class 122 203
##   2nd class 167 118
##   3rd class 528 178
Most of my conclusions seem to be confirmed by the analysis above. The survival rate among women apears to be much higher than among men. However my assumption about classes seems to be not entirely correct. 2nd Class seems to be displaying different data. 1st Class and 3rd class survival rates are in line with my assumptions.
  1. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)
NewData <- data.frame(MyData$class, MyData$age, MyData$sex, MyData$survived)
colnames(NewData) <- c("Class","Age", "Gender", "Survival")
head(NewData)
##       Class    Age Gender Survival
## 1 1st class adults    man      yes
## 2 1st class adults    man      yes
## 3 1st class adults    man      yes
## 4 1st class adults    man      yes
## 5 1st class adults    man      yes
## 6 1st class adults    man      yes
AdultData <- NewData[which(NewData$Age=='adults'),] 
head(AdultData)
##       Class    Age Gender Survival
## 1 1st class adults    man      yes
## 2 1st class adults    man      yes
## 3 1st class adults    man      yes
## 4 1st class adults    man      yes
## 5 1st class adults    man      yes
## 6 1st class adults    man      yes
I have excluded children from my data, because that data is making me sad
AdultData$SurvivalNum <- ifelse(AdultData$Survival=='yes', 1, 0)

myTable <- as.data.frame(table(AdultData$Class, AdultData$SurvivalNum))
myTable3 <- as.data.frame(table(AdultData$Gender, AdultData$Survival))
myTable3
##    Var1 Var2 Freq
## 1   man   no  659
## 2 women   no  106
## 3   man  yes  146
## 4 women  yes  296

Data below includes percentages of total observations in my dataset.

MyTable2 <- myTable[which(myTable$Var2=='1'),] 
MyTable3 <- myTable[which(myTable$Var2=='0'),] 
MyTable2Gen <- myTable3[which(myTable3$Var2=='yes'),] 
MyTable3Gen <- myTable3[which(myTable3$Var2=='no'),] 

PercMyTable2<-myTable
PercMyTable2$Percentage<-PercMyTable2$Freq/1207*100
PercMyTable2
##        Var1 Var2 Freq Percentage
## 1 1st class    0  122  10.107705
## 2 2nd class    0  167  13.835957
## 3 3rd class    0  476  39.436620
## 4 1st class    1  197  16.321458
## 5 2nd class    1   94   7.787904
## 6 3rd class    1  151  12.510356
PercMyTable3<-myTable3
PercMyTable3$Percentage<-PercMyTable3$Freq/1207*100
PercMyTable3
##    Var1 Var2 Freq Percentage
## 1   man   no  659  54.598177
## 2 women   no  106   8.782104
## 3   man  yes  146  12.096106
## 4 women  yes  296  24.523612
  1. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
tbl <- with(AdultData, table(Gender, SurvivalNum));
barplot(tbl, main="Survival by Gender", names.arg = c("Did Not Survive", "Survived"), beside = TRUE, legend = TRUE)

tbl2 <- with(AdultData, table(Class, SurvivalNum));
barplot(tbl2, main="Survival by Class", names.arg = c("Did Not Survive", "Survived"), beside = TRUE, legend = TRUE)

hist(MyTable2$Freq, col='red')

ggplot(MyTable2Gen, aes(x=Var1, y=Freq))+geom_point()+ xlab("Gender") + ylab("Survived Count") + ggtitle("Survivals by Gender")