The RMS Titanic Shipwreck

Have you ever seen the movie Titanic? The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this report we will use visualizations and machine learning to predict the survival rates of the people that were on board

Data Structure

## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

We’ve got a sense of our variables, their class type, and the first few observations of each. We know we’re working with 1309 observations of 12 variables. To make things a bit more explicit since a couple of the variable names aren’t 100% illuminating, here’s what we’ve got to deal with:

Variable Name Description
Survived Survived (1) or died (0)
Pclass Passenger’s class
Name Passenger’s name
Sex Passenger’s sex
Age Passenger’s age
SibSp Number of siblings/spouses aboard
Parch Number of parents/children aboard
Ticket Ticket number
Fare Fare
Cabin Cabin
Embarked Port of embarkation

Data Summary

##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.3838   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                 NA's   :418                                        
##      Sex                 Age            SibSp            Parch      
##  Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
##                     Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                     NA's   :263                                     
##     Ticket               Fare            Cabin          
##  Length:1309        Min.   :  0.000   Length:1309       
##  Class :character   1st Qu.:  7.896   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character  
##                     Mean   : 33.295                     
##                     3rd Qu.: 31.275                     
##                     Max.   :512.329                     
##                     NA's   :1                           
##    Embarked        
##  Length:1309       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Our database has a list 1309 passengers, 418 missing records, 3 passanger classes -1st, 2nd and 3rd class. Maximum number of siblings/spouses aboard is 8 and the maximum number of parents/children aboard is 9. The median fare was 14.5$ dollars while maximum paid was 512.3$ dollars

Data Visualizations

## Warning: Removed 177 rows containing non-finite values (stat_bin).

Remember, ‘1’ represents Survived and ‘O’ represents Died

##    female      male 
## 0.7420382 0.1889081

Our first plot shows that females have a 74% chance of survival while males have a 19% chance of survival

## Warning: Removed 177 rows containing non-finite values (stat_bin).

Combining variables age and sex, our plot still shows that females are more likely to than males

## Warning: Removed 12 rows containing missing values (geom_point).

This plot has 3 categories -1 for 1st class passengers, -2 for 2nd class passengers, -3 for 3rd class passengers. We have a lot of blue dots in the 1st category which suggests that passengers in the 1st class cabin were more likely to survive than those in the others

Data processing and exploratory analysis

##         
##          Capt Col Don Dona  Dr Jonkheer Lady Major Master Miss Mlle Mme
##   female    0   0   0    1   1        0    1     0      0  260    2   1
##   male      1   4   1    0   7        1    0     2     61    0    0   0
##         
##           Mr Mrs  Ms Rev Sir the Countess
##   female   0 197   2   0   0            1
##   male   757   0   0   8   1            0

Explicitly, they was a Captain, 2 Majors, 4 Colonels, a Don and a Dona, 7 doctors, 211 spinsters, 818 bachelors, and 10 missionaries.

##    Master      Miss        Mr       Mrs   Officer   Royalty 
## 0.5750000 0.7027027 0.1566731 0.7936508 0.2631579 0.7500000

Conditioning for title, a Master has a 58% chance of survival, a Miss has a 72% chance of surving, a Mrs has a 79% of surviving, an Officer has 26%, while a Royalty has a 75% chance of surviving.

##         1         2         3         4         5         6         7 
## 0.3035382 0.5527950 0.5784314 0.7241379 0.2000000 0.1363636 0.3333333 
##         8        11 
## 0.0000000 0.0000000

A family of 4 has the highest chance of survival at 72%. Those that were alone had a 30% survival rate. Those with big families (5-11) had the least survival rate at 16%.

##     Alone       Big     Small 
## 0.3035382 0.1612903 0.5787671

##      PassengerId Survived Pclass               Name  Sex  Age SibSp Parch
## 1044        1044       NA      3 Storey, Mr. Thomas male 60.5     0     0
##      Ticket Fare Cabin Embarked Title Surname Fsize FsizeD
## 1044   3701   NA              S    Mr  Storey     1  Alone
## Warning: Removed 1 rows containing non-finite values (stat_density).

##       1       2       3 
## 60.0000 15.0458  8.0500

We have a missing value in our dataset ‘Fare’. We will solve this by imputing the median figure. The median fare is 60$, 15$, 8$, for 1st, 2nd and 3rd class respectively.

Age

##  1  2  3 
## 39 29 24
## [1] 0

The median age was 39, 29 and 24 in the 1st, 2nd, and 3rd class respectively.

Children

##        
##           0   1
##   Adult 495 279
##   Child  54  63

Our prediction is that 54 children died and 63 survivied, let’s explore further to see how accurate our prediction is.

Correlogram

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : num  3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : num  1 2 2 2 1 1 1 1 2 2 ...
##  $ FsizeD  : num  1 1 2 1 2 2 2 3 1 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: num  1 3 1 1 1 2 1 1 1 3 ...
##  $ Title   : num  1 4 6 4 1 1 1 2 4 4 ...
##  $ Child   : num  1 1 1 1 1 1 1 2 1 2 ...

Modeling + prediction

## ans_rf
##   0   1 
## 608 283
## 
## Call:
##  randomForest(formula = factor(Survived) ~ Pclass + Sex + Fare +      Embarked + Title + FsizeD + Child, data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.72%
## Confusion matrix:
##     0   1 class.error
## 0 504  45  0.08196721
## 1 104 238  0.30409357
## [1] 0.8327722

Using Random Forest, our classification model predicts that overall, 608 passengers died, and 283 survived. This was predicted with a 83.2% accuracy and a 16.72% error rate.

The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest. The more the accuracy of the random forest decreases due to the exclusion (or permutation) of a single variable, the more important that variable is deemed, and therefore variables with a large mean decrease in accuracy are more important for classification of the data. It’s obvious here that Title is the most important variable for this analysis.

Important Variables

Title is the most important variable for this analysis followed by Sex.