On 15 April 1912, during the voyage of “unsinkable” RMS Titanic, the ship sank after colliding with an iceberg, causing death over 1500 of passangers and crews. while there are many factor that can contribute to surviving the incident, we are still able to group these people to get the general idea who is likely to survive by various attributes.
library(scales)
#Data Input
pob <- read.csv('train.csv')
#Assigning the right data type to certain attributes.
pob$Sex <- factor(pob$Sex)
pob$Embarked <- factor(pob$Embarked)
pob$Pclass <- factor(pob$Pclass)
pob$Survived <- factor(pob$Survived)
str(pob)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
summary(pob)
## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 : 2
## 1st Qu.: 7.91 Class :character C:168
## Median : 14.45 Mode :character Q: 77
## Mean : 32.20 S:644
## 3rd Qu.: 31.00
## Max. :512.33
##
We have missing values in Age and Cabin columns, but i want us to focus on Age, because we can group these people based on their age, so how we supposed to group these people with unknown Age values?, we also have SibSp and Parch columns, based on these we can estimate their age and assign them to the right age group. so lets create a new Column named ‘AgeGroup’ : Young (16 and under), Adult (17-64), Elder(65+)
pob$AgeGroup[pob$Age<18] <- 'Young'
pob$AgeGroup[pob$Age>=18 & pob$Age<65] <- 'Adult'
pob$AgeGroup[pob$Age>=65] <- 'Elder'
summary(pob$AgeGroup=='')
## Mode FALSE NA's
## logical 714 177
as expected, there are missing values on AgeGroup as well because we have missing values on Age. Lets subset the data based on other feature along with missing age or agegroup
Assuming legal age of marriage is above 18, they can be grouped as an adult.
library(stringr)
pob[is.na(pob$AgeGroup) & str_detect(pob$Name, pattern = "Mrs[.]"), c("AgeGroup")] <- 'Adult'
Master is a young man too young to hold a ‘Mr.’ Title.
pob[is.na(pob$AgeGroup) & str_detect(pob$Name, pattern = "Master[.]"), c("AgeGroup")] <- 'Young'
Assuming all crews are at legal working age, they can be grouped as an adult.
pob[is.na(pob$AgeGroup) & pob$Fare==0, c("AgeGroup")] <- 'Adult'
we can filter the passengers data based on ‘SibSp’ and ‘Parch’ Columns, if both column has 0 value, meaning they are traveling alone, most likely they are an adult, a minor and elder should be traveling alongside their guardian (hopefully).
pob[is.na(pob$AgeGroup) & (pob$SibSp==0 & pob$Parch==0), c('AgeGroup')] <- 'Adult'
Lets see who else has not assigned into AgeGroup.
pob[is.na(pob$AgeGroup) & pob$SibSp==1 & pob$Parch==2,]
## PassengerId Survived Pclass Name Sex
## 784 784 0 3 Johnston, Mr. Andrew G male
## 889 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female
## Age SibSp Parch Ticket Fare Cabin Embarked AgeGroup
## 784 NA 1 2 W./C. 6607 23.45 S <NA>
## 889 NA 1 2 W./C. 6607 23.45 S <NA>
pob[is.na(pob$AgeGroup) & pob$SibSp==1 & pob$Parch==2, c('AgeGroup')] <- 'Young'
pob[is.na(pob$AgeGroup)& pob$SibSp>1 & pob$Parch==2,]
## PassengerId Survived Pclass Name Sex Age
## 181 181 0 3 Sage, Miss. Constance Gladys female NA
## 202 202 0 3 Sage, Mr. Frederick male NA
## 325 325 0 3 Sage, Mr. George John Jr male NA
## 793 793 0 3 Sage, Miss. Stella Anna female NA
## 847 847 0 3 Sage, Mr. Douglas Bullen male NA
## 864 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NA
## SibSp Parch Ticket Fare Cabin Embarked AgeGroup
## 181 8 2 CA. 2343 69.55 S <NA>
## 202 8 2 CA. 2343 69.55 S <NA>
## 325 8 2 CA. 2343 69.55 S <NA>
## 793 8 2 CA. 2343 69.55 S <NA>
## 847 8 2 CA. 2343 69.55 S <NA>
## 864 8 2 CA. 2343 69.55 S <NA>
pob[is.na(pob$AgeGroup) & pob$SibSp>1 & pob$Parch==2, c('AgeGroup')] <- 'Young'
pob[is.na(pob$AgeGroup) & pob$SibSp==0 & pob$Parch==2,]
## PassengerId Survived Pclass Name Sex Age SibSp Parch
## 594 594 0 3 Bourke, Miss. Mary female NA 0 2
## Ticket Fare Cabin Embarked AgeGroup
## 594 364848 7.75 Q <NA>
pob[is.na(pob$AgeGroup) & pob$SibSp==0 & pob$Parch==2, c('AgeGroup')] <- 'Young'
pob[is.na(pob$AgeGroup) & pob$SibSp>1, ]
## PassengerId Survived Pclass Name Sex Age SibSp Parch
## 49 49 0 3 Samaan, Mr. Youssef male NA 2 0
## 230 230 0 3 Lefebre, Miss. Mathilde female NA 3 1
## 302 302 1 3 McCoy, Mr. Bernard male NA 2 0
## 331 331 1 3 McCoy, Miss. Agnes female NA 2 0
## 410 410 0 3 Lefebre, Miss. Ida female NA 3 1
## 486 486 0 3 Lefebre, Miss. Jeannie female NA 3 1
## Ticket Fare Cabin Embarked AgeGroup
## 49 2662 21.6792 C <NA>
## 230 4133 25.4667 S <NA>
## 302 367226 23.2500 Q <NA>
## 331 367226 23.2500 Q <NA>
## 410 4133 25.4667 S <NA>
## 486 4133 25.4667 S <NA>
pob[is.na(pob$AgeGroup) & pob$SibSp>1, c('AgeGroup')] <-'Young'
we only have few missing value left on our data, lets go through it.
pob[is.na(pob$AgeGroup),]
## PassengerId Survived Pclass Name Sex Age
## 47 47 0 3 Lennon, Mr. Denis male NA
## 110 110 1 3 Moran, Miss. Bertha female NA
## 129 129 1 3 Peter, Miss. Anna female NA
## 215 215 0 3 Kiernan, Mr. Philip male NA
## 241 241 0 3 Zabour, Miss. Thamine female NA
## 242 242 1 3 Murphy, Miss. Katherine "Kate" female NA
## 365 365 0 3 O'Brien, Mr. Thomas male NA
## 452 452 0 3 Hagland, Mr. Ingvald Olai Olsen male NA
## 491 491 0 3 Hagland, Mr. Konrad Mathias Reiersen male NA
## 613 613 1 3 Murphy, Miss. Margaret Jane female NA
## 640 640 0 3 Thorneycroft, Mr. Percival male NA
## 769 769 0 3 Moran, Mr. Daniel J male NA
## SibSp Parch Ticket Fare Cabin Embarked AgeGroup
## 47 1 0 370371 15.5000 Q <NA>
## 110 1 0 371110 24.1500 Q <NA>
## 129 1 1 2668 22.3583 F E69 C <NA>
## 215 1 0 367229 7.7500 Q <NA>
## 241 1 0 2665 14.4542 C <NA>
## 242 1 0 367230 15.5000 Q <NA>
## 365 1 0 370365 15.5000 Q <NA>
## 452 1 0 65303 19.9667 S <NA>
## 491 1 0 65304 19.9667 S <NA>
## 613 1 0 367230 15.5000 Q <NA>
## 640 1 0 376564 16.1000 S <NA>
## 769 1 0 371110 24.1500 Q <NA>
Based on our data above, the rest of the passangers are traveling with 1 SibSp, meaning they are on titanic with their Sibling, or Spouse. All of them except Miss. Peter are not with her Parent(s) or children. Because she does not hold the Mrs Title. it mean that 1 value on Parch column is indicating Parent instead of children. We can guess that she is young.
pob[is.na(pob$AgeGroup) & pob$Parch == 1, c('AgeGroup')] <- 'Young'
pob[is.na(pob$AgeGroup),]
## PassengerId Survived Pclass Name Sex Age
## 47 47 0 3 Lennon, Mr. Denis male NA
## 110 110 1 3 Moran, Miss. Bertha female NA
## 215 215 0 3 Kiernan, Mr. Philip male NA
## 241 241 0 3 Zabour, Miss. Thamine female NA
## 242 242 1 3 Murphy, Miss. Katherine "Kate" female NA
## 365 365 0 3 O'Brien, Mr. Thomas male NA
## 452 452 0 3 Hagland, Mr. Ingvald Olai Olsen male NA
## 491 491 0 3 Hagland, Mr. Konrad Mathias Reiersen male NA
## 613 613 1 3 Murphy, Miss. Margaret Jane female NA
## 640 640 0 3 Thorneycroft, Mr. Percival male NA
## 769 769 0 3 Moran, Mr. Daniel J male NA
## SibSp Parch Ticket Fare Cabin Embarked AgeGroup
## 47 1 0 370371 15.5000 Q <NA>
## 110 1 0 371110 24.1500 Q <NA>
## 215 1 0 367229 7.7500 Q <NA>
## 241 1 0 2665 14.4542 C <NA>
## 242 1 0 367230 15.5000 Q <NA>
## 365 1 0 370365 15.5000 Q <NA>
## 452 1 0 65303 19.9667 S <NA>
## 491 1 0 65304 19.9667 S <NA>
## 613 1 0 367230 15.5000 Q <NA>
## 640 1 0 376564 16.1000 S <NA>
## 769 1 0 371110 24.1500 Q <NA>
The rest of them are Adults because they are traveling without their Parents.
pob[is.na(pob$AgeGroup), c('AgeGroup')] <-'Adult'
unique(pob$AgeGroup)
## [1] "Adult" "Young" "Elder"
No more missing value, great!
now we can move on to analizing the data.
lets see the surviving chance based on gender. my guess is female most likely to survive the sinking of titanic, but to confirm my guess, we can aggregate the data.
sexagg <- aggregate(Pclass~Sex+Survived, pob, length)
survivorbysex <- sexagg[sexagg$Survived==1,]
survivorbysex$Rate <- label_percent()(survivorbysex$Pclass/891)
survivorbysex
## Sex Survived Pclass Rate
## 3 female 1 233 26%
## 4 male 1 109 12%
Out of 342 Passengers who survived, 233 of them are female, more than double compared to the male passengers.
My assumption is that being on the 1st class give you access to the better part of the titanic, maybe quick access to lifeboat? lets find out.
aggregate(Sex~Pclass, pob, length)
## Pclass Sex
## 1 1 216
## 2 2 184
## 3 3 491
aggregate(Sex~Pclass+Survived, pob, length)
## Pclass Survived Sex
## 1 1 0 80
## 2 2 0 97
## 3 3 0 372
## 4 1 1 136
## 5 2 1 87
## 6 3 1 119
Now, compare the survivor of each class within their own classes.
paste('First Class chance of surviving within their class:', label_percent()(136/216))
## [1] "First Class chance of surviving within their class: 63%"
paste('Second Class chance of surviving within their class:', label_percent()(87/184))
## [1] "Second Class chance of surviving within their class: 47%"
paste('Third Class chance of survivingr within their class:', label_percent()(119/491))
## [1] "Third Class chance of survivingr within their class: 24%"
Almost 63% of 1st class passenger survived. followed by 2nd class passenger with 47% surviving rate, and lastly, the 3rd class with only 24%, remember this is within their own classes, not all classes combined. We can say that being on better classes actually give you better chance on surviving.
It makes the most sense that the ship crew does not have to pay the ticket fare.
crew <- pob[pob$Fare==0,]
aggregate(Sex~Survived, crew, length)
## Survived Sex
## 1 0 14
## 2 1 1
unfortunately, only 1 crew survived. while i can not be sure on the cause of low surviving rate, it is safe to say the crews did their job to prioritize their passengers.
ageagg <- aggregate(Sex~Survived+AgeGroup, pob, length)
ageagg
## Survived AgeGroup Sex
## 1 0 Adult 472
## 2 1 Adult 275
## 3 0 Elder 10
## 4 1 Elder 1
## 5 0 Young 67
## 6 1 Young 66
# Rate of surviving by agegroup within their agegroup
paste('Adult chance of surviving within their AgeGroup:', label_percent()(275/702))
## [1] "Adult chance of surviving within their AgeGroup: 39%"
paste('Elder chance of surviving within their AgeGroup:', label_percent()(1/11))
## [1] "Elder chance of surviving within their AgeGroup: 9%"
paste('Young chance of surviving within their AgeGroup:', label_percent()(66/113))
## [1] "Young chance of surviving within their AgeGroup: 58%"
There we have it, young people are more likely to survive compared to other age group, more than 20% the chance of the adult surviving the titanic.
The key points we can make are :
A female likely to survive compare to male counterpart.
Being on a higher class also improve the chance of surviving (maybe higher classes are prioritzed?), especially if you look at the 3rd class, while having more survivor compared to 2nd class, their rate of surviving are lower because 3rd class consist of 491 passengers.
Young passengers have more chance surviving the titanic.
we have to confirm if this points are true.
pob_agg <- aggregate(Embarked~Survived+Sex+Pclass+AgeGroup, pob, length)
Survivor Data Frame
survivor <- pob_agg[pob_agg$Survived==1,]
survivor$Rate <- survivor$Embarked/891
survivor <- survivor[order(survivor$Rate, decreasing = T),]
survivor <- subset(survivor, select = -c(Embarked, Survived))
survivor
## Sex Pclass AgeGroup Rate
## 2 female 1 Adult 0.094276094
## 6 female 2 Adult 0.065095398
## 10 female 3 Adult 0.057239057
## 4 male 1 Adult 0.044893378
## 12 male 3 Adult 0.038159371
## 24 female 3 Young 0.023569024
## 26 male 3 Young 0.014590348
## 20 female 2 Young 0.013468013
## 22 male 2 Young 0.010101010
## 8 male 2 Adult 0.008978676
## 18 female 1 Young 0.007856341
## 19 male 1 Young 0.004489338
## 14 male 1 Elder 0.001122334
Not Survivor Data Frame
notsurvivor <- pob_agg[pob_agg$Survived==0,]
notsurvivor$Rate <- notsurvivor$Embarked/891
notsurvivor <- notsurvivor[order(notsurvivor$Rate, decreasing = T),]
notsurvivor <- subset(notsurvivor, select = -c(Embarked, Survived))
notsurvivor
## Sex Pclass AgeGroup Rate
## 11 male 3 Adult 0.288439955
## 7 male 2 Adult 0.097643098
## 3 male 1 Adult 0.080808081
## 9 female 3 Adult 0.053872054
## 25 male 3 Young 0.044893378
## 23 female 3 Young 0.026936027
## 5 female 2 Adult 0.006734007
## 13 male 1 Elder 0.005611672
## 16 male 3 Elder 0.003367003
## 1 female 1 Adult 0.002244669
## 15 male 2 Elder 0.002244669
## 21 male 2 Young 0.002244669
## 17 female 1 Young 0.001122334
It checks out!, Top 3 Survivor are female, on the opposite side, Top 3 of Those who did not survived are male. It also held true with our Class claim, but didnt i said that young people have higher rate of surviving? this table didnt tell that information because it takes all the Passenger into account, so refer back to the AgeGroup segment above for that info.