Brief Hitory About Titanic

On April 14, the Royal Mail Ship Titanic sank during its maiden voyage and killed about thousands of passengers and ship personnel. Now, I am trying to explore who survived the tragedy the most. Are sex and passenger class matter to the passenger’s survivability?

Load The Packages and Read the Data

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
titanic <- read.csv("train.csv")
str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

We still have columns that are not in the right type, so we need to changes the data type

Change The Data Types

titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)

str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Now the data is in their right type so we can proceed to take a look at its statistical value.

summary(titanic)
##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##                                                                  
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin             Embarked        
##  Min.   :  0.00   Length:891         Length:891        
##  1st Qu.:  7.91   Class :character   Class :character  
##  Median : 14.45   Mode  :character   Mode  :character  
##  Mean   : 32.20                                        
##  3rd Qu.: 31.00                                        
##  Max.   :512.33                                        
## 

So we have:

  1. 891 passengers

  2. 549 death and 342 survived

  3. 314 female passengers and 577 male passengers

  4. 3 Pclass type

Check Name and Separate it to First and Last Name

head(titanic$Name)
## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Allen, Mr. William Henry"                           
## [6] "Moran, Mr. James"
titanic_name <- titanic %>%
  separate(col = Name, into = c("Last2","First"),
           sep="\\. ")%>%
  separate(col = Last2, into = c("Last","Title"),
           sep=", ")
## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [514].
summary(titanic_name)
##   PassengerId    Survived Pclass      Last              Title          
##  Min.   :  1.0   0:549    1:216   Length:891         Length:891        
##  1st Qu.:223.5   1:342    2:184   Class :character   Class :character  
##  Median :446.0            3:491   Mode  :character   Mode  :character  
##  Mean   :446.0                                                         
##  3rd Qu.:668.5                                                         
##  Max.   :891.0                                                         
##                                                                        
##     First               Sex           Age            SibSp      
##  Length:891         female:314   Min.   : 0.42   Min.   :0.000  
##  Class :character   male  :577   1st Qu.:20.12   1st Qu.:0.000  
##  Mode  :character                Median :28.00   Median :0.000  
##                                  Mean   :29.70   Mean   :0.523  
##                                  3rd Qu.:38.00   3rd Qu.:1.000  
##                                  Max.   :80.00   Max.   :8.000  
##                                  NA's   :177                    
##      Parch           Ticket               Fare           Cabin          
##  Min.   :0.0000   Length:891         Min.   :  0.00   Length:891        
##  1st Qu.:0.0000   Class :character   1st Qu.:  7.91   Class :character  
##  Median :0.0000   Mode  :character   Median : 14.45   Mode  :character  
##  Mean   :0.3816                      Mean   : 32.20                     
##  3rd Qu.:0.0000                      3rd Qu.: 31.00                     
##  Max.   :6.0000                      Max.   :512.33                     
##                                                                         
##    Embarked        
##  Length:891        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

The famous Jack Dawson and Rose DeWitt

Based on the famous film Titanic (1997), Jack Dawson does not survived and Rose DeWitt Bukater survived the crashed. But is it true?

library(stringr)
titanic_name %>%
  filter(str_detect(First, "Jack") | str_detect(Last, "Dawson"))
titanic_name %>%
  filter(str_detect(First, "Rose") | str_detect(Last, "Bukater"))

It seems Jack Dawson and Rose DeWitt Bukater is a fictional character made by James Cameron. Also to confirm my findings, here is the image from a website that confirmed the two are fictional

Survival Chance Based on Sex and Passanger Class

titanic_name %>%
  group_by(Sex, Pclass) %>%
  count(Survived, name = "Total")

As we can see, female passengers have high survivability than male passengers and it is true due to the captain’s order to save the woman and children first at that point. But the response was very different and result in some of the female passengers doesn’t survive. As for Pclass, More of the first-class passengers survived because their cabins were closer to the lifeboats [just as much chance?] and many of the emigrants in the third class died because their poor English meant they did not understand what was happening

titanic_name <- titanic_name %>%
  select(-Last,-First,-Ticket) %>%
  mutate(Fsize=SibSp+Parch+1, Price=Fare/Fsize) %>%
  select(-Fare)

titanic_name %>%
  group_by(Pclass,Fsize) %>%
  summarise(totnum=n(), rate=sum(Survived==1)/totnum)
## `summarise()` has grouped output by 'Pclass'. You can override using the `.groups` argument.

Based on the data, I could say that if traveling alone or in a family, Pclass 1 gives higher survival probability in general.