HW3

Elina Azrilyan

Week 3 Final Project

I am going to explore a dataset which includes information about Titanic passenger. The information includes class, age, and gender inforamtion as well as survival information. My hypothesis is that there will be a correlation between the survival rate and class of passenger. I am also going to assume that survival rate for women will be higher than men.

Data Exploration:

require(ggplot2)

## Loading required package: ggplot2

MyData <- read.csv(file="https://raw.githubusercontent.com/che10vek/R-HW2/master/titanic.csv", header=TRUE, sep=",")
summary(MyData)

##        X                class         age          sex      survived 
##  Min.   :   1.0   1st class:325   adults:1207   man  :869   no :817  
##  1st Qu.: 329.8   2nd class:285   child : 109   women:447   yes:499  
##  Median : 658.5   3rd class:706                                      
##  Mean   : 658.5                                                      
##  3rd Qu.: 987.2                                                      
##  Max.   :1316.0

table(MyData$sex, MyData$survived)

##        
##          no yes
##   man   694 175
##   women 123 324

table(MyData$class, MyData$survived)

##            
##              no yes
##   1st class 122 203
##   2nd class 167 118
##   3rd class 528 178

Most of my conclusions seem to be confirmed by the analysis above. The survival rate among women apears to be much higher than among men. However my assumption about classes seems to be not entirely correct. 2nd Class seems to be displaying different data. 1st Class and 3rd class survival rates are in line with my assumptions.

Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

NewData <- data.frame(MyData$class, MyData$age, MyData$sex, MyData$survived)
colnames(NewData) <- c("Class","Age", "Gender", "Survival")
head(NewData)

##       Class    Age Gender Survival
## 1 1st class adults    man      yes
## 2 1st class adults    man      yes
## 3 1st class adults    man      yes
## 4 1st class adults    man      yes
## 5 1st class adults    man      yes
## 6 1st class adults    man      yes

AdultData <- NewData[which(NewData$Age=='adults'),] 
head(AdultData)

##       Class    Age Gender Survival
## 1 1st class adults    man      yes
## 2 1st class adults    man      yes
## 3 1st class adults    man      yes
## 4 1st class adults    man      yes
## 5 1st class adults    man      yes
## 6 1st class adults    man      yes

I have excluded children from my data, because that data is making me sad

AdultData$SurvivalNum <- ifelse(AdultData$Survival=='yes', 1, 0)

myTable <- as.data.frame(table(AdultData$Class, AdultData$SurvivalNum))
myTable3 <- as.data.frame(table(AdultData$Gender, AdultData$Survival))
myTable3

##    Var1 Var2 Freq
## 1   man   no  659
## 2 women   no  106
## 3   man  yes  146
## 4 women  yes  296

Data below includes percentages of total observations in my dataset.

MyTable2 <- myTable[which(myTable$Var2=='1'),] 
MyTable3 <- myTable[which(myTable$Var2=='0'),] 
MyTable2Gen <- myTable3[which(myTable3$Var2=='yes'),] 
MyTable3Gen <- myTable3[which(myTable3$Var2=='no'),] 

PercMyTable2<-myTable
PercMyTable2$Percentage<-PercMyTable2$Freq/1207*100
PercMyTable2

##        Var1 Var2 Freq Percentage
## 1 1st class    0  122  10.107705
## 2 2nd class    0  167  13.835957
## 3 3rd class    0  476  39.436620
## 4 1st class    1  197  16.321458
## 5 2nd class    1   94   7.787904
## 6 3rd class    1  151  12.510356

PercMyTable3<-myTable3
PercMyTable3$Percentage<-PercMyTable3$Freq/1207*100
PercMyTable3

##    Var1 Var2 Freq Percentage
## 1   man   no  659  54.598177
## 2 women   no  106   8.782104
## 3   man  yes  146  12.096106
## 4 women  yes  296  24.523612

Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

tbl <- with(AdultData, table(Gender, SurvivalNum));
barplot(tbl, main="Survival by Gender", names.arg = c("Did Not Survive", "Survived"), beside = TRUE, legend = TRUE)

tbl2 <- with(AdultData, table(Class, SurvivalNum));
barplot(tbl2, main="Survival by Class", names.arg = c("Did Not Survive", "Survived"), beside = TRUE, legend = TRUE)

hist(MyTable2$Freq, col='red')

ggplot(MyTable2Gen, aes(x=Var1, y=Freq))+geom_point()+ xlab("Gender") + ylab("Survived Count") + ggtitle("Survivals by Gender")

ggplot(MyTable3Gen, aes(x=Var1, y=Freq))+geom_point()+ xlab("Gender") + ylab("Casualty Count") + ggtitle("Survivals by Gender")

ggplot(MyTable2, aes(x=Var1, y=Freq))+geom_point()+ xlab("Class") + ylab("Survived Count") + ggtitle("Survivals by Class")

ggplot(MyTable3, aes(x=Var1, y=Freq))+geom_point()+ xlab("Class") + ylab("Casualty Count") + ggtitle("Survivals by Class")

Most of my conclusions seem to be confirmed by the analysis above. The survival rate among women apears to be much higher than among men. It looks like my assumption that chivalry was not dead at that time was right. However my assumption about classes seems to be not entirely correct. 2nd Class seems to be displaying different results, since more people from 2nd class survived that died. 1st Class and 3rd class survival rates are in line with my assumptions. In retrospect, the dataset that I picked had made it very challenging to include graphs in my analysis since the data was largely not numeric.

BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career. Done!