Who survived the titanic?

On 15 April 1912, during the voyage of “unsinkable” RMS Titanic, the ship sank after colliding with an iceberg, causing death over 1500 of passangers and crews. while there are many factor that can contribute to surviving the incident, we are still able to group these people to get the general idea who is likely to survive by various attributes.

Data Input, and quick summary of the data

library(scales)
#Data Input
pob <- read.csv('train.csv')
#Assigning the right data type to certain attributes.
pob$Sex <- factor(pob$Sex)
pob$Embarked <- factor(pob$Embarked)
pob$Pclass <- factor(pob$Pclass)
pob$Survived <- factor(pob$Survived)
str(pob)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
summary(pob)
##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##                                                                  
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin           Embarked
##  Min.   :  0.00   Length:891          :  2   
##  1st Qu.:  7.91   Class :character   C:168   
##  Median : 14.45   Mode  :character   Q: 77   
##  Mean   : 32.20                      S:644   
##  3rd Qu.: 31.00                              
##  Max.   :512.33                              
## 

Data Dictionary

Dealing with missing values, and creating new attributes

We have missing values in Age and Cabin columns, but i want us to focus on Age, because we can group these people based on their age, so how we supposed to group these people with unknown Age values?, we also have SibSp and Parch columns, based on these we can estimate their age and assign them to the right age group. so lets create a new Column named ‘AgeGroup’ : Young (16 and under), Adult (17-64), Elder(65+)

pob$AgeGroup[pob$Age<18] <- 'Young'
pob$AgeGroup[pob$Age>=18 & pob$Age<65] <- 'Adult'
pob$AgeGroup[pob$Age>=65] <- 'Elder'
summary(pob$AgeGroup=='')
##    Mode   FALSE    NA's 
## logical     714     177

as expected, there are missing values on AgeGroup as well because we have missing values on Age. Lets subset the data based on other feature along with missing age or agegroup

Mrs. title

Assuming legal age of marriage is above 18, they can be grouped as an adult.

library(stringr)
pob[is.na(pob$AgeGroup) & str_detect(pob$Name, pattern = "Mrs[.]"), c("AgeGroup")] <- 'Adult'

Master Title

Master is a young man too young to hold a ‘Mr.’ Title.

pob[is.na(pob$AgeGroup) & str_detect(pob$Name, pattern = "Master[.]"), c("AgeGroup")] <- 'Young'

Ship Crew

Assuming all crews are at legal working age, they can be grouped as an adult.

pob[is.na(pob$AgeGroup) & pob$Fare==0, c("AgeGroup")] <- 'Adult'

Passengers who board the titanic alone

we can filter the passengers data based on ‘SibSp’ and ‘Parch’ Columns, if both column has 0 value, meaning they are traveling alone, most likely they are an adult, a minor and elder should be traveling alongside their guardian (hopefully).

pob[is.na(pob$AgeGroup) & (pob$SibSp==0 & pob$Parch==0), c('AgeGroup')] <- 'Adult'

Lets see who else has not assigned into AgeGroup.

Child traveling along their family

pob[is.na(pob$AgeGroup) & pob$SibSp==1 & pob$Parch==2,]
##     PassengerId Survived Pclass                                     Name    Sex
## 784         784        0      3                   Johnston, Mr. Andrew G   male
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie" female
##     Age SibSp Parch     Ticket  Fare Cabin Embarked AgeGroup
## 784  NA     1     2 W./C. 6607 23.45              S     <NA>
## 889  NA     1     2 W./C. 6607 23.45              S     <NA>
pob[is.na(pob$AgeGroup) & pob$SibSp==1 & pob$Parch==2, c('AgeGroup')] <- 'Young'
pob[is.na(pob$AgeGroup)& pob$SibSp>1 & pob$Parch==2,]
##     PassengerId Survived Pclass                              Name    Sex Age
## 181         181        0      3      Sage, Miss. Constance Gladys female  NA
## 202         202        0      3               Sage, Mr. Frederick   male  NA
## 325         325        0      3          Sage, Mr. George John Jr   male  NA
## 793         793        0      3           Sage, Miss. Stella Anna female  NA
## 847         847        0      3          Sage, Mr. Douglas Bullen   male  NA
## 864         864        0      3 Sage, Miss. Dorothy Edith "Dolly" female  NA
##     SibSp Parch   Ticket  Fare Cabin Embarked AgeGroup
## 181     8     2 CA. 2343 69.55              S     <NA>
## 202     8     2 CA. 2343 69.55              S     <NA>
## 325     8     2 CA. 2343 69.55              S     <NA>
## 793     8     2 CA. 2343 69.55              S     <NA>
## 847     8     2 CA. 2343 69.55              S     <NA>
## 864     8     2 CA. 2343 69.55              S     <NA>
pob[is.na(pob$AgeGroup) & pob$SibSp>1 & pob$Parch==2, c('AgeGroup')] <- 'Young'
pob[is.na(pob$AgeGroup) & pob$SibSp==0 & pob$Parch==2,]
##     PassengerId Survived Pclass               Name    Sex Age SibSp Parch
## 594         594        0      3 Bourke, Miss. Mary female  NA     0     2
##     Ticket Fare Cabin Embarked AgeGroup
## 594 364848 7.75              Q     <NA>
pob[is.na(pob$AgeGroup) & pob$SibSp==0 & pob$Parch==2, c('AgeGroup')] <- 'Young'
pob[is.na(pob$AgeGroup) & pob$SibSp>1, ]
##     PassengerId Survived Pclass                    Name    Sex Age SibSp Parch
## 49           49        0      3     Samaan, Mr. Youssef   male  NA     2     0
## 230         230        0      3 Lefebre, Miss. Mathilde female  NA     3     1
## 302         302        1      3      McCoy, Mr. Bernard   male  NA     2     0
## 331         331        1      3      McCoy, Miss. Agnes female  NA     2     0
## 410         410        0      3      Lefebre, Miss. Ida female  NA     3     1
## 486         486        0      3  Lefebre, Miss. Jeannie female  NA     3     1
##     Ticket    Fare Cabin Embarked AgeGroup
## 49    2662 21.6792              C     <NA>
## 230   4133 25.4667              S     <NA>
## 302 367226 23.2500              Q     <NA>
## 331 367226 23.2500              Q     <NA>
## 410   4133 25.4667              S     <NA>
## 486   4133 25.4667              S     <NA>
pob[is.na(pob$AgeGroup) & pob$SibSp>1, c('AgeGroup')] <-'Young'

Rest of the passengers

we only have few missing value left on our data, lets go through it.

pob[is.na(pob$AgeGroup),]
##     PassengerId Survived Pclass                                 Name    Sex Age
## 47           47        0      3                    Lennon, Mr. Denis   male  NA
## 110         110        1      3                  Moran, Miss. Bertha female  NA
## 129         129        1      3                    Peter, Miss. Anna female  NA
## 215         215        0      3                  Kiernan, Mr. Philip   male  NA
## 241         241        0      3                Zabour, Miss. Thamine female  NA
## 242         242        1      3       Murphy, Miss. Katherine "Kate" female  NA
## 365         365        0      3                  O'Brien, Mr. Thomas   male  NA
## 452         452        0      3      Hagland, Mr. Ingvald Olai Olsen   male  NA
## 491         491        0      3 Hagland, Mr. Konrad Mathias Reiersen   male  NA
## 613         613        1      3          Murphy, Miss. Margaret Jane female  NA
## 640         640        0      3           Thorneycroft, Mr. Percival   male  NA
## 769         769        0      3                  Moran, Mr. Daniel J   male  NA
##     SibSp Parch Ticket    Fare Cabin Embarked AgeGroup
## 47      1     0 370371 15.5000              Q     <NA>
## 110     1     0 371110 24.1500              Q     <NA>
## 129     1     1   2668 22.3583 F E69        C     <NA>
## 215     1     0 367229  7.7500              Q     <NA>
## 241     1     0   2665 14.4542              C     <NA>
## 242     1     0 367230 15.5000              Q     <NA>
## 365     1     0 370365 15.5000              Q     <NA>
## 452     1     0  65303 19.9667              S     <NA>
## 491     1     0  65304 19.9667              S     <NA>
## 613     1     0 367230 15.5000              Q     <NA>
## 640     1     0 376564 16.1000              S     <NA>
## 769     1     0 371110 24.1500              Q     <NA>

Based on our data above, the rest of the passangers are traveling with 1 SibSp, meaning they are on titanic with their Sibling, or Spouse. All of them except Miss. Peter are not with her Parent(s) or children. Because she does not hold the Mrs Title. it mean that 1 value on Parch column is indicating Parent instead of children. We can guess that she is young.

pob[is.na(pob$AgeGroup) & pob$Parch == 1, c('AgeGroup')] <- 'Young'
pob[is.na(pob$AgeGroup),]
##     PassengerId Survived Pclass                                 Name    Sex Age
## 47           47        0      3                    Lennon, Mr. Denis   male  NA
## 110         110        1      3                  Moran, Miss. Bertha female  NA
## 215         215        0      3                  Kiernan, Mr. Philip   male  NA
## 241         241        0      3                Zabour, Miss. Thamine female  NA
## 242         242        1      3       Murphy, Miss. Katherine "Kate" female  NA
## 365         365        0      3                  O'Brien, Mr. Thomas   male  NA
## 452         452        0      3      Hagland, Mr. Ingvald Olai Olsen   male  NA
## 491         491        0      3 Hagland, Mr. Konrad Mathias Reiersen   male  NA
## 613         613        1      3          Murphy, Miss. Margaret Jane female  NA
## 640         640        0      3           Thorneycroft, Mr. Percival   male  NA
## 769         769        0      3                  Moran, Mr. Daniel J   male  NA
##     SibSp Parch Ticket    Fare Cabin Embarked AgeGroup
## 47      1     0 370371 15.5000              Q     <NA>
## 110     1     0 371110 24.1500              Q     <NA>
## 215     1     0 367229  7.7500              Q     <NA>
## 241     1     0   2665 14.4542              C     <NA>
## 242     1     0 367230 15.5000              Q     <NA>
## 365     1     0 370365 15.5000              Q     <NA>
## 452     1     0  65303 19.9667              S     <NA>
## 491     1     0  65304 19.9667              S     <NA>
## 613     1     0 367230 15.5000              Q     <NA>
## 640     1     0 376564 16.1000              S     <NA>
## 769     1     0 371110 24.1500              Q     <NA>

The rest of them are Adults because they are traveling without their Parents.

pob[is.na(pob$AgeGroup), c('AgeGroup')] <-'Adult'
unique(pob$AgeGroup)
## [1] "Adult" "Young" "Elder"

No more missing value, great!

now we can move on to analizing the data.

Who most likely to survive?

Does gender affect the rate of surviving?

lets see the surviving chance based on gender. my guess is female most likely to survive the sinking of titanic, but to confirm my guess, we can aggregate the data.

sexagg <- aggregate(Pclass~Sex+Survived, pob, length)
survivorbysex <- sexagg[sexagg$Survived==1,]
survivorbysex$Rate <- label_percent()(survivorbysex$Pclass/891)
survivorbysex
##      Sex Survived Pclass Rate
## 3 female        1    233  26%
## 4   male        1    109  12%

Out of 342 Passengers who survived, 233 of them are female, more than double compared to the male passengers.

Does Ticket Class help you survive the titanic?

My assumption is that being on the 1st class give you access to the better part of the titanic, maybe quick access to lifeboat? lets find out.

aggregate(Sex~Pclass, pob, length)
##   Pclass Sex
## 1      1 216
## 2      2 184
## 3      3 491
aggregate(Sex~Pclass+Survived, pob, length)
##   Pclass Survived Sex
## 1      1        0  80
## 2      2        0  97
## 3      3        0 372
## 4      1        1 136
## 5      2        1  87
## 6      3        1 119

Now, compare the survivor of each class within their own classes.

paste('First Class chance of surviving within their class:', label_percent()(136/216))
## [1] "First Class chance of surviving within their class: 63%"
paste('Second Class chance of surviving within their class:', label_percent()(87/184))
## [1] "Second Class chance of surviving within their class: 47%"
paste('Third Class chance of survivingr within their class:', label_percent()(119/491))
## [1] "Third Class chance of survivingr within their class: 24%"

Almost 63% of 1st class passenger survived. followed by 2nd class passenger with 47% surviving rate, and lastly, the 3rd class with only 24%, remember this is within their own classes, not all classes combined. We can say that being on better classes actually give you better chance on surviving.

Does the titanic crews prioritize their passengers?

It makes the most sense that the ship crew does not have to pay the ticket fare.

crew <- pob[pob$Fare==0,]
aggregate(Sex~Survived, crew, length)
##   Survived Sex
## 1        0  14
## 2        1   1

unfortunately, only 1 crew survived. while i can not be sure on the cause of low surviving rate, it is safe to say the crews did their job to prioritize their passengers.

Does young people more likely to survive?

ageagg <- aggregate(Sex~Survived+AgeGroup, pob, length)
ageagg
##   Survived AgeGroup Sex
## 1        0    Adult 472
## 2        1    Adult 275
## 3        0    Elder  10
## 4        1    Elder   1
## 5        0    Young  67
## 6        1    Young  66
# Rate of surviving by agegroup within their agegroup
paste('Adult chance of surviving within their AgeGroup:', label_percent()(275/702))
## [1] "Adult chance of surviving within their AgeGroup: 39%"
paste('Elder chance of surviving within their AgeGroup:', label_percent()(1/11))
## [1] "Elder chance of surviving within their AgeGroup: 9%"
paste('Young chance of surviving within their AgeGroup:', label_percent()(66/113))
## [1] "Young chance of surviving within their AgeGroup: 58%"

There we have it, young people are more likely to survive compared to other age group, more than 20% the chance of the adult surviving the titanic.

Conclusion

The key points we can make are :

  1. A female likely to survive compare to male counterpart.

  2. Being on a higher class also improve the chance of surviving (maybe higher classes are prioritzed?), especially if you look at the 3rd class, while having more survivor compared to 2nd class, their rate of surviving are lower because 3rd class consist of 491 passengers.

  3. Young passengers have more chance surviving the titanic.

we have to confirm if this points are true.

pob_agg <- aggregate(Embarked~Survived+Sex+Pclass+AgeGroup, pob, length)

Survivor Data Frame

survivor <- pob_agg[pob_agg$Survived==1,]
survivor$Rate <- survivor$Embarked/891
survivor <- survivor[order(survivor$Rate, decreasing = T),]
survivor <- subset(survivor, select = -c(Embarked, Survived))
survivor
##       Sex Pclass AgeGroup        Rate
## 2  female      1    Adult 0.094276094
## 6  female      2    Adult 0.065095398
## 10 female      3    Adult 0.057239057
## 4    male      1    Adult 0.044893378
## 12   male      3    Adult 0.038159371
## 24 female      3    Young 0.023569024
## 26   male      3    Young 0.014590348
## 20 female      2    Young 0.013468013
## 22   male      2    Young 0.010101010
## 8    male      2    Adult 0.008978676
## 18 female      1    Young 0.007856341
## 19   male      1    Young 0.004489338
## 14   male      1    Elder 0.001122334

Not Survivor Data Frame

notsurvivor <- pob_agg[pob_agg$Survived==0,]
notsurvivor$Rate <- notsurvivor$Embarked/891
notsurvivor <- notsurvivor[order(notsurvivor$Rate, decreasing = T),]
notsurvivor <- subset(notsurvivor, select = -c(Embarked, Survived))
notsurvivor
##       Sex Pclass AgeGroup        Rate
## 11   male      3    Adult 0.288439955
## 7    male      2    Adult 0.097643098
## 3    male      1    Adult 0.080808081
## 9  female      3    Adult 0.053872054
## 25   male      3    Young 0.044893378
## 23 female      3    Young 0.026936027
## 5  female      2    Adult 0.006734007
## 13   male      1    Elder 0.005611672
## 16   male      3    Elder 0.003367003
## 1  female      1    Adult 0.002244669
## 15   male      2    Elder 0.002244669
## 21   male      2    Young 0.002244669
## 17 female      1    Young 0.001122334

It checks out!, Top 3 Survivor are female, on the opposite side, Top 3 of Those who did not survived are male. It also held true with our Class claim, but didnt i said that young people have higher rate of surviving? this table didnt tell that information because it takes all the Passenger into account, so refer back to the AgeGroup segment above for that info.