Purpose:

survival analysis for Titanic sink data (Datacamp tutorial)

Steps

Step 1: Load training and testing data and check the data structure

train <- read.csv (url("http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"))
test <- read.csv (url("http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"))
str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
str(test)
## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

Step 2: Check how many people survived and their proportions (And the sex difference)

table(train$Survived)
## 
##   0   1 
## 549 342
prop.table(table(train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384
Sex_tab <- table(train$Sex, train$Survived)
pSex_1 <- prop.table(Sex_tab,1)
pSex_2 <- prop.table(Sex_tab,2)
colnames(Sex_tab) <- c("Dead","Survice")
colnames(pSex_1) <- c("Dead","Survice")
colnames(pSex_2) <- c("Dead","Survice")
Sex_tab
##         
##          Dead Survice
##   female   81     233
##   male    468     109
pSex_1
##         
##               Dead   Survice
##   female 0.2579618 0.7420382
##   male   0.8110919 0.1889081
pSex_2
##         
##               Dead   Survice
##   female 0.1475410 0.6812865
##   male   0.8524590 0.3187135

Conclusion: For the training data

(1) 549 (62%) passed away and 342 (38%) survived; (2) For female, 81 (26%) passed away and 233 (74%) survived; For male, 468 (81%) passed away and 109 (19%) survived; (3) For the people who passed away, female vs male accounted for 15% vs 85%; For the people who survived, female vs male accounted for 68% vs 32%.

Step 2: Check if age plays a role for survival

train$child [train$Age < 18] <- 1
train$child [train$Age >= 18] <- 0
Chi_tab <- table(train$child, train$Survived)
pChi_1 <- prop.table(table(train$child, train$Survived),1)
pChi_2 <- prop.table(table(train$child, train$Survived),2)
rownames(Chi_tab) <- c("Adult", "Child")
rownames(pChi_1) <- c("Adult", "Child")
rownames(pChi_2) <- c("Adult", "Child")
colnames(Chi_tab) <- c("Dead", "Survive")
colnames(pChi_1) <- c("Dead", "Survive")
colnames(pChi_2) <- c("Dead", "Survive")
Chi_tab
##        
##         Dead Survive
##   Adult  372     229
##   Child   52      61
pChi_1
##        
##              Dead   Survive
##   Adult 0.6189684 0.3810316
##   Child 0.4601770 0.5398230
pChi_2
##        
##              Dead   Survive
##   Adult 0.8773585 0.7896552
##   Child 0.1226415 0.2103448

Conclusion: For the training data,

(1) For age, there were 52 children died while 61 survice; there were 372 adult dead and 229 survice. (2) For Adult, 62% died and 38% survived; For child, 46% died and 54% survivied. (3) For Dead people, 88% were adults and 12% were children; For people who survived, 79% were adult and 21% were children.