Introduction

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Problem Description

The Aim of this project is to predict the survival of passengers on titanic using Machine learning algorithms. The model should be able to predict whether the passenger survived the sinking of Titanic or not.

Data

The data has been downloaded from Kaggle.

Data Dictionary

Variable
1.survival(Survival) 0 = No, 1 = Yes

2.pclass (Ticket class) 1 = 1st, 2 = 2nd, 3 = 3rd

3.sex (Sex)

4.Age (Age in years)

5.sibsp (# of siblings / spouses aboard the Titanic)

6.parch (# of parents / children aboard the Titanic)

7.ticket (Ticket number)

8.fare (Passenger fare)

9.cabin (Cabin number)

10.embarked(Port of Embarkation) C = Cherbourg, Q = Queenstown, S = Southampton

Load and check data

library(dplyr)        #data manipulation
library(ggplot2)      #visualization
library(naniar)       #missing value visualization
library(mice)         #missing value imputation
library(randomForest) #classification algorithm
library(rpart)        #classification algorithm
library(rpart.plot)   #decision tree visualization
library(caTools)      #splitting data into training and test set
#Load data
titanic_data<-read.csv("titanic.csv",stringsAsFactors = FALSE)
#Checking structure of the data
str(titanic_data)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
summary(titanic_data)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 
head(titanic_data)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q

The data has 891 observations and 12 variables. The names of the variables are as follows:

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

Exploratory Data Analysis

Missing data (NA) visualization

vis_miss(titanic_data)

Age has 19.9% NA values, as shown from the above plot.

Survival as a function of Age

ggplot(titanic_data, aes(x=Age, fill=factor(Survived))) +
  geom_histogram(bins=30)+
  facet_grid(.~Sex)+
  ggtitle("Age vs Survived")+
  scale_fill_discrete(name="Survived")

As we can see, chanced of survival of females is greater than that of males. Males below age 18 had considerable survival rate due to tha fact that they were children. ##Survival as a function of Sex

ggplot(titanic_data, aes(Sex, fill = factor(Survived))) + 
  geom_bar(position = 'fill')+
  xlab("Sex") +
  ylab("Frequency") +
  scale_fill_discrete(name = "Survived") + 
  ggtitle("Sex vs Survived")

This plot illustrated that females had survival rate of ~75%, whereas men had survival rate < 25%.

Survival as a function of Pclass

ggplot(titanic_data, aes(x=Pclass, fill=factor(Survived))) +
  geom_bar(position="fill")+
  ylab("Frequency")+
  scale_fill_discrete(name="Survived")+
  ggtitle("Pclass vs Survived")

From this plot, we can observe that passengers in 1st class have survival rate of more than 50%. 2nd class passengers have survival rate of close to 50%. For 3rd class passengers, survival rate is just 25%.

Survival as a function of Fare

ggplot(titanic_data, aes(x=Fare, fill=factor(Survived)))+
  geom_histogram()+
  ggtitle("Fare vs survived")

From this graph, it looks like there is a strong correlation between fare and survival rate. People who had paid < $50 have less than 50% survival rate. As you move towards right towards higher fare prices, we see a higher survival rate.

Feature Engineering (Deriving Title from Name and Family Size from SibSp and Parch)

Deriving Title

#exploring name variable
head(titanic_data$Name)
## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Allen, Mr. William Henry"                           
## [6] "Moran, Mr. James"
#grab title from passenger's name
titanic_data$Title<-gsub('(.*,)|(\\..*)','',titanic_data$Name)
unique(titanic_data$Title)
##  [1] " Mr"           " Mrs"          " Miss"         " Master"      
##  [5] " Don"          " Rev"          " Dr"           " Mme"         
##  [9] " Ms"           " Major"        " Lady"         " Sir"         
## [13] " Mlle"         " Col"          " Capt"         " the Countess"
## [17] " Jonkheer"

As we can see, We have 17 unique Titles.

table(titanic_data$Sex,titanic_data$Title)
##         
##           Capt  Col  Don  Dr  Jonkheer  Lady  Major  Master  Miss  Mlle
##   female     0    0    0   1         0     1      0       0   182     2
##   male       1    2    1   6         1     0      2      40     0     0
##         
##           Mme  Mr  Mrs  Ms  Rev  Sir  the Countess
##   female    1   0  125   1    0    0             1
##   male      0 517    0   0    6    1             0

We see, there can be some typing mistakes in titles (Mlle , Mme, Ms).

titanic_data$Title[titanic_data$Title==' Mlle']<-' Miss'
titanic_data$Title[titanic_data$Title==' Ms']<-' Miss'
titanic_data$Title[titanic_data$Title==' Mme']<-' Mrs'
table(titanic_data$Sex,titanic_data$Title)
##         
##           Capt  Col  Don  Dr  Jonkheer  Lady  Major  Master  Miss  Mr  Mrs
##   female     0    0    0   1         0     1      0       0   185   0  126
##   male       1    2    1   6         1     0      2      40     0 517    0
##         
##           Rev  Sir  the Countess
##   female    0    0             1
##   male      6    1             0

Also, there are number of rare titles, that we see here. We can combine all such titles to “Rare title”

rare_titles<-c(" Capt"," Col"," Don"," Dr"," Rev", " Jonkheer"," Lady"," Major"," the Countess"," Sir")
titanic_data$Title[titanic_data$Title %in% rare_titles] <- " Rare title"
table(titanic_data$Sex,titanic_data$Title)
##         
##           Master  Miss  Mr  Mrs  Rare title
##   female       0   185   0  126           3
##   male        40     0 517    0          20

Visualizing the relationship between Title and survival

ggplot(titanic_data, aes(x=factor(Title), fill=factor(Survived)))+
  geom_bar()+
  xlab("Title")+
  scale_fill_discrete(name = "Survived") + 
  ggtitle("Title vs survived")

Children (Titles Miss and Master) have high survival rate. Women (Title Mrs) also have great chances of survival. For men (Title Mr), survival rate is pretty low.

Deriving Family Size

titanic_data$Familysize<-titanic_data$SibSp+titanic_data$Parch+1

Visualizing the relationship between Family Size and Survival

# Visualize the relationship between Familysize & survival
ggplot(titanic_data, aes(x = Familysize, fill = factor(Survived))) +
  geom_bar(stat='count', position='dodge') +
  scale_x_continuous(breaks=c(1:12)) +
  labs(x = 'Family Size') 

We can see, there is a survival penalty to solo travellers and those with family size above 4.

Missing Value Imputation

Exploring Cabin variable

head(titanic_data$Cabin,30)
##  [1] ""            "C85"         ""            "C123"        ""           
##  [6] ""            "E46"         ""            ""            ""           
## [11] "G6"          "C103"        ""            ""            ""           
## [16] ""            ""            ""            ""            ""           
## [21] ""            "D56"         ""            "A6"          ""           
## [26] ""            ""            "C23 C25 C27" ""            ""
sum(titanic_data$Cabin=="")
## [1] 687

Due to huge number of missing values, cabin variable will not be included for model building.

Exploring Embarked variable

unique(titanic_data$Embarked)
## [1] "S" "C" "Q" ""
which(titanic_data$Embarked=="")
## [1]  62 830
titanic_data[c(62,830),]
##     PassengerId Survived Pclass                                      Name
## 62           62        1      1                       Icard, Miss. Amelie
## 830         830        1      1 Stone, Mrs. George Nelson (Martha Evelyn)
##        Sex Age SibSp Parch Ticket Fare Cabin Embarked Title Familysize
## 62  female  38     0     0 113572   80   B28           Miss          1
## 830 female  62     0     0 113572   80   B28            Mrs          1

There are 2 missing values for Embarked variable, for passenger 62 and 830. Both are solo travellers.

#visualizing embarked , fare and Pclass
fare_embark_data<-titanic_data %>% filter(PassengerId !=62 & PassengerId!=830)
str(fare_embark_data)
## 'data.frame':    889 obs. of  14 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
##  $ Title      : chr  " Mr" " Mrs" " Miss" " Mrs" ...
##  $ Familysize : num  2 2 1 2 1 1 1 5 3 2 ...
ggplot(fare_embark_data, aes(x=Embarked,y=Fare,fill=factor(Pclass)))+ 
  geom_boxplot()+
  geom_hline(aes(yintercept=80))

As we can see from the graph, The median fare for 1st class passenger departing from Charbourg (‘C’) coincides nicely with our missing embarkment passengers. Hence, For Pclass 1 and fare $80, the closest option is Embarkment port ‘C’.

#imputing missing values for Embarked
titanic_data$Embarked[c(62,830)]<-'C'
titanic_data[c(62,830),]
##     PassengerId Survived Pclass                                      Name
## 62           62        1      1                       Icard, Miss. Amelie
## 830         830        1      1 Stone, Mrs. George Nelson (Martha Evelyn)
##        Sex Age SibSp Parch Ticket Fare Cabin Embarked Title Familysize
## 62  female  38     0     0 113572   80   B28        C  Miss          1
## 830 female  62     0     0 113572   80   B28        C   Mrs          1

Now, Embarked column is complete.

Exploring Age variable

sum(is.na(titanic_data$Age))
## [1] 177

There are 177 missing Age values. For Age imputation, Multiple imputation using chained equations from Mice package.

#Factorizing variables


titanic_data$Pclass<-factor(titanic_data$Pclass)
titanic_data$Sex<-factor(titanic_data$Sex)
titanic_data$Embarked<-factor(titanic_data$Embarked)
titanic_data$Title<-factor(titanic_data$Title)
titanic_data$Familysize<-factor(titanic_data$Familysize)
titanic_data$Survived<-factor(titanic_data$Survived)

#Set a random seed

set.seed(128)
mice_model<-mice(titanic_data[,names(titanic_data) %in% c('Pclass','Sex','Embarked','Title','Age','SibSp','Parch','Fare')],method='rf')
## 
##  iter imp variable
##   1   1  Age
##   1   2  Age
##   1   3  Age
##   1   4  Age
##   1   5  Age
##   2   1  Age
##   2   2  Age
##   2   3  Age
##   2   4  Age
##   2   5  Age
##   3   1  Age
##   3   2  Age
##   3   3  Age
##   3   4  Age
##   3   5  Age
##   4   1  Age
##   4   2  Age
##   4   3  Age
##   4   4  Age
##   4   5  Age
##   5   1  Age
##   5   2  Age
##   5   3  Age
##   5   4  Age
##   5   5  Age
mice_output<-complete(mice_model)
par(mfrow=c(1,2))
hist(titanic_data$Age, freq=F,main="Original data : Age", col="dark blue", ylim=c(0,0.04))
hist(mice_output$Age, freq=F,main="MICE output : Age", col="blue", ylim=c(0,0.04))

We can see the original age distribution is somewhat similar to the mice age distribution. Things look good, Hence, we can replace the age with mice output.

titanic_data$Age<-mice_output$Age

#Checking for missing values
sum(is.na(titanic_data$Age))
## [1] 0

No missing values in Age

# Visualizing to check for any missing values
vis_miss(titanic_data)

All the missing values have been handled. Data is ready for model building for prediction of survival.

Splitting Data into Training and Test set

#Keeping only desired variables for model building
#Age,Survived,Pclass,Sex,Fare,Familysize,Embarked and Title
titanic_sub<-titanic_data[,c(2,3,5,6,10,12,13,14)]

#Set a random seed
set.seed(123)
split = sample.split(titanic_sub$Survived, SplitRatio = 0.75)
training_set = subset(titanic_sub, split == TRUE)
test_set = subset(titanic_sub, split == FALSE)
str(training_set)
## 'data.frame':    668 obs. of  8 variables:
##  $ Survived  : Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 2 1 1 ...
##  $ Pclass    : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 3 1 3 3 ...
##  $ Sex       : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 1 1 2 2 ...
##  $ Age       : num  22 38 26 35 11 27 4 58 20 39 ...
##  $ Fare      : num  7.25 71.28 7.92 53.1 8.46 ...
##  $ Embarked  : Factor w/ 3 levels "C","Q","S": 3 1 3 3 2 3 3 3 3 3 ...
##  $ Title     : Factor w/ 5 levels " Master"," Miss",..: 3 4 2 4 3 4 2 2 3 3 ...
##  $ Familysize: Factor w/ 9 levels "1","2","3","4",..: 2 2 1 2 1 3 3 1 1 7 ...

Model Building

Decision tree Classification

dec_tree_model<-rpart(Survived~ Age+Sex+Pclass+Familysize+Embarked+Title+Fare, data=training_set,method = "class")
pred_survived<-predict(dec_tree_model,newdata=test_set[-1],type="class")
#confusion matrix
table(test_set$Survived,pred_survived)
##    pred_survived
##       0   1
##   0 121  16
##   1  33  53
#Evaluating accuracy
mean(test_set$Survived==pred_survived) 
## [1] 0.7802691

So, decision tree model gave us an accuracy of 78.03%

#Plotting decision tree
rpart.plot(dec_tree_model)

plotcp(dec_tree_model)

Random forest Classification

set.seed(123)
rf_model<-randomForest(Survived~Age+Sex+Pclass+Familysize+Embarked+Title+Fare,data = training_set)
#prediction
rf_pred<-predict(rf_model,test_set)
mean(test_set$Survived==rf_pred)
## [1] 0.8071749
print(rf_model)
## 
## Call:
##  randomForest(formula = Survived ~ Age + Sex + Pclass + Familysize +      Embarked + Title + Fare, data = training_set) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 15.42%
## Confusion matrix:
##     0   1 class.error
## 0 379  33  0.08009709
## 1  70 186  0.27343750

With this model, accuracy has improved to 80.72%. The summary shows an estimated error of 15.42% for the model.

Plot rf model

plot(rf_model)
legend('topright', colnames(rf_model$err.rate),col=1:3,fill=1:3)

The black line shows the overall error rate which falls below 20%. The red and green lines show the error rate for ‘died’ and ‘survived’ respectively.

Variable importance

varImpPlot(rf_model,main="RF_Model")

This shows the relative variable importance by plotting the mean decrease in Gini calclulated among all the trees.

importance    <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))

# Create a rank variable based on importance
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))

# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
                           y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 4, colour = 'white') +
  labs(x = 'Variables') +
  coord_flip() 

Title, Fare and Sex are the topmost important variables in our prediction of survival.

Prediction

Making th final predictions for test set using Random Forest classifier

test_set$pred_survived<-rf_pred
head(test_set)
##    Survived Pclass    Sex Age    Fare Embarked   Title Familysize
## 5         0      3   male  35  8.0500        S      Mr          1
## 7         0      1   male  54 51.8625        S      Mr          1
## 8         0      3   male   2 21.0750        S  Master          5
## 10        1      2 female  14 30.0708        C     Mrs          2
## 15        0      3 female  14  7.8542        S    Miss          1
## 21        0      2   male  35 26.0000        S      Mr          1
##    pred_survived
## 5              0
## 7              0
## 8              0
## 10             1
## 15             1
## 21             0