On April 14, the Royal Mail Ship Titanic sank during its maiden voyage and killed about thousands of passengers and ship personnel. Now, I am trying to explore who survived the tragedy the most. Are sex and passenger class matter to the passenger’s survivability?
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
titanic <- read.csv("train.csv")
str(titanic)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
We still have columns that are not in the right type, so we need to changes the data type
titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
str(titanic)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Now the data is in their right type so we can proceed to take a look at its statistical value.
summary(titanic)## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 Length:891
## 1st Qu.: 7.91 Class :character Class :character
## Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
So we have:
891 passengers
549 death and 342 survived
314 female passengers and 577 male passengers
3 Pclass type
head(titanic$Name)## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
titanic_name <- titanic %>%
separate(col = Name, into = c("Last2","First"),
sep="\\. ")%>%
separate(col = Last2, into = c("Last","Title"),
sep=", ")## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [514].
summary(titanic_name)## PassengerId Survived Pclass Last Title
## Min. : 1.0 0:549 1:216 Length:891 Length:891
## 1st Qu.:223.5 1:342 2:184 Class :character Class :character
## Median :446.0 3:491 Mode :character Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## First Sex Age SibSp
## Length:891 female:314 Min. : 0.42 Min. :0.000
## Class :character male :577 1st Qu.:20.12 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.000
## Mean :29.70 Mean :0.523
## 3rd Qu.:38.00 3rd Qu.:1.000
## Max. :80.00 Max. :8.000
## NA's :177
## Parch Ticket Fare Cabin
## Min. :0.0000 Length:891 Min. : 0.00 Length:891
## 1st Qu.:0.0000 Class :character 1st Qu.: 7.91 Class :character
## Median :0.0000 Mode :character Median : 14.45 Mode :character
## Mean :0.3816 Mean : 32.20
## 3rd Qu.:0.0000 3rd Qu.: 31.00
## Max. :6.0000 Max. :512.33
##
## Embarked
## Length:891
## Class :character
## Mode :character
##
##
##
##
Based on the famous film Titanic (1997), Jack Dawson does not survived and Rose DeWitt Bukater survived the crashed. But is it true?
library(stringr)
titanic_name %>%
filter(str_detect(First, "Jack") | str_detect(Last, "Dawson"))titanic_name %>%
filter(str_detect(First, "Rose") | str_detect(Last, "Bukater"))It seems Jack Dawson and Rose DeWitt Bukater is a fictional character made by James Cameron. Also to confirm my findings, here is the image from a website that confirmed the two are fictional
titanic_name %>%
group_by(Sex, Pclass) %>%
count(Survived, name = "Total")As we can see, female passengers have high survivability than male passengers and it is true due to the captain’s order to save the woman and children first at that point. But the response was very different and result in some of the female passengers doesn’t survive. As for Pclass, More of the first-class passengers survived because their cabins were closer to the lifeboats [just as much chance?] and many of the emigrants in the third class died because their poor English meant they did not understand what was happening
titanic_name <- titanic_name %>%
select(-Last,-First,-Ticket) %>%
mutate(Fsize=SibSp+Parch+1, Price=Fare/Fsize) %>%
select(-Fare)
titanic_name %>%
group_by(Pclass,Fsize) %>%
summarise(totnum=n(), rate=sum(Survived==1)/totnum)## `summarise()` has grouped output by 'Pclass'. You can override using the `.groups` argument.
Based on the data, I could say that if traveling alone or in a family, Pclass 1 gives higher survival probability in general.