Predicting the Survival of Titanic Passengers

Overview

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy shocked the international community and lead to better safety regulations for ships. One of the reasons that the shipwreck lead to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Objective

The main objective is to predict the who survived in the sank ship usin machine learning algorithm.

Description of the Columns

Survived - Survival (0 = No; 1 = Yes)
Pclass - Passenger Class (1 = First Class; 2 = Second Class; 3 = Third Class)
Name - Name of the Passenger
Sex - Gender of the Passenger
Age - Age of Passenger
Sibsp - Number of Siblings/Spouses Aboard
Parch - Number of Parents/Children Aboard
Ticket - Ticket Number
Fare - Passenger Fare
Cabin - Cabin
Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Data Preprocessing

Loading the data and checking the structure.

library(readxl)
titanic<-read_excel("Z:/Data Science data/Semester 2/Titanic.xls")
str(titanic)

## tibble [891 x 12] (S3: tbl_df/tbl/data.frame)
##  $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : num [1:891] 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr [1:891] "male" "female" "female" "female" ...
##  $ Age        : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr [1:891] NA "C85" NA "C123" ...
##  $ Embarked   : chr [1:891] "S" "C" "S" "S" ...

Summary:-

summary(titanic)

##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
##

Converting following variables to factor:-

Survived:- it is in numeric form but we want it in factor form as it is are dependent variables. and also converting 0 to “No” and 1 for “Yes” for simplicity.
Pclass:- As per the description, we know it should be in categorical type.
Sex:- As per description, the sex should be categorical but it is stored character type of data .
Embarked:- It is the port where the passenger aboard ship, so it must be in factor.

titanic$Survived<-ifelse(titanic$Survived==0,"No","Yes")
titanic$Pclass <- factor(titanic$Pclass,
                     levels = c(1,2,3),
                     labels = c("First Class", "Second Class", "Third Class"))
str(titanic)

## tibble [891 x 12] (S3: tbl_df/tbl/data.frame)
##  $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : chr [1:891] "No" "Yes" "Yes" "Yes" ...
##  $ Pclass     : Factor w/ 3 levels "First Class",..: 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr [1:891] "male" "female" "female" "female" ...
##  $ Age        : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr [1:891] NA "C85" NA "C123" ...
##  $ Embarked   : chr [1:891] "S" "C" "S" "S" ...

fact<-c("Survived","Pclass","Sex","Embarked")
titanic[,fact]<-lapply(titanic[,fact], as.factor)
str(titanic)

## tibble [891 x 12] (S3: tbl_df/tbl/data.frame)
##  $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "First Class",..: 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr [1:891] NA "C85" NA "C123" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Once again checking the Summary

summary(titanic)

##   PassengerId    Survived           Pclass        Name               Sex     
##  Min.   :  1.0   No :549   First Class :216   Length:891         female:314  
##  1st Qu.:223.5   Yes:342   Second Class:184   Class :character   male  :577  
##  Median :446.0             Third Class :491   Mode  :character               
##  Mean   :446.0                                                               
##  3rd Qu.:668.5                                                               
##  Max.   :891.0                                                               
##                                                                              
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin           Embarked  
##  Min.   :  0.00   Length:891         C   :168  
##  1st Qu.:  7.91   Class :character   Q   : 77  
##  Median : 14.45   Mode  :character   S   :644  
##  Mean   : 32.20                      NA's:  2  
##  3rd Qu.: 31.00                                
##  Max.   :512.33                                
##

Checking whether data contains any null values or not

colSums(is.na(titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

As we can see that Age, Cabin and embarked columns as null values. we can delete cabin as it has more no. of values. For age and embarked we can replace the value with median and mode respectively.Converting the age variable to categorical to make it into groups.

And also deleting Name, PassengerId and ticket as it doesn’t have any significance in the survival rate.

del_col<-c("PassengerId","Name","Cabin","Ticket")
titanic[,del_col]<-list(NULL)
titanic1<-titanic
titanic1$Age[is.na(titanic1$Age)]<-28
titanic1$Age<-cut(titanic1$Age,breaks = c(0,20,30,40,Inf),labels = c("Teen", "Young","Adult","Old"))
titanic1$Embarked[is.na(titanic1$Embarked)]<-"S"
# Scaling numeric data so each numeric data is having equal weightage
col_sca=c("SibSp","Parch","Fare")
titanic1[,col_sca]<-lapply(titanic1[,col_sca], scale)
colSums(is.na(titanic1))

## Survived   Pclass      Sex      Age                            Embarked 
##        0        0        0        0        0        0        0        0

As we can see that there is no null values so we can proceed further for some EDA.

Exploratory Data Analysis

Age Group Wise Distribution

library(ggplot2) 
ggplot(titanic1,aes(x=Age)) + geom_bar(aes(fill=Survived)) +labs(x = "Age Group",y="Frequency",
                                                                  title = "Age Wise Distribution")

From Above graph we can see that around 45% of the population were from age group between 20 to 30.

Gender Wise Distribution

ggplot(titanic1,aes(x = Sex)) + geom_bar(aes(fill = Survived)) +labs(x="Gender",y="Frequency",
                                                                      title = "Gender wise Distribution")

From above graph, we can observe that Female Survivors are twice in no. to Male survivors.And also we can see that approximately 20% of male are survived out of total male who aboard ship.

According to passenger class

ggplot(titanic1,aes(x=Pclass)) + geom_bar(aes(fill=Survived)) +labs(x="Passenger Class",y = "Frequency",
                                                            title = "Passenger Class wise Distribution")

From Graph, we can observe that First Class survived compared to other passenger class and also we can see that more no. of non survivor are from third class passenger.

Spliting Data for ML Algorithm

library(caret)

## Loading required package: lattice

set.seed(100) # keeping spliting constant in every iteration
index<-createDataPartition(titanic1$Survived,p=0.7,list = F)
train_data<-titanic1[index,]

## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

test_data<-titanic1[-index,]
dim(train_data) # dimension of training data

## [1] 625   8

dim(test_data) # dimension of testing data

## [1] 266   8

Applying Machine Algoritm

Logistic Regression

model_lr<-glm(Survived~.,data = train_data,family = "binomial")
summary(model_lr)

## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1341  -0.6823  -0.4232   0.6457   2.4750  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         3.27317    0.46197   7.085 1.39e-12 ***
## PclassSecond Class -0.81696    0.35699  -2.288  0.02211 *  
## PclassThird Class  -2.05281    0.36346  -5.648 1.62e-08 ***
## Sexmale            -2.56343    0.23197 -11.051  < 2e-16 ***
## AgeYoung           -0.92639    0.29500  -3.140  0.00169 ** 
## AgeAdult           -0.78566    0.35989  -2.183  0.02903 *  
## AgeOld             -1.57372    0.37698  -4.175 2.99e-05 ***
## SibSp              -0.36193    0.13850  -2.613  0.00897 ** 
## Parch              -0.05578    0.12703  -0.439  0.66055    
## Fare                0.13387    0.16021   0.836  0.40338    
## EmbarkedQ           0.44357    0.44781   0.991  0.32191    
## EmbarkedS          -0.23071    0.28654  -0.805  0.42073    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 832.49  on 624  degrees of freedom
## Residual deviance: 569.89  on 613  degrees of freedom
## AIC: 593.89
## 
## Number of Fisher Scoring iterations: 5

As we can see that Parch, fare and Embarked are insignificant.We can say that there is less or no relation of survival of person with place he/she aboard.

Therefore, deleting the insignificant variables. and building the model.

del_col1<-c("Fare","Embarked","Parch")
titanic1[,del_col1]<-list(NULL)
# Again spliting the train test or deleting the those columns
set.seed(100) # keeping spliting constant
index<-createDataPartition(titanic1$Survived,p=0.7,list = F)
train_data<-titanic1[index,]
test_data<-titanic1[-index,]

model_lr<-glm(Survived~.,data = train_data,family = "binomial")
summary(model_lr)

## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1667  -0.6678  -0.4561   0.6328   2.4503  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.2308     0.3876   8.335  < 2e-16 ***
## PclassSecond Class  -1.0329     0.3055  -3.381 0.000721 ***
## PclassThird Class   -2.1834     0.2886  -7.566 3.84e-14 ***
## Sexmale             -2.6108     0.2271 -11.498  < 2e-16 ***
## AgeYoung            -0.8235     0.2850  -2.890 0.003854 ** 
## AgeAdult            -0.7375     0.3552  -2.076 0.037862 *  
## AgeOld              -1.5637     0.3736  -4.186 2.84e-05 ***
## SibSp               -0.3712     0.1275  -2.911 0.003605 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 832.49  on 624  degrees of freedom
## Residual deviance: 574.38  on 617  degrees of freedom
## AIC: 590.38
## 
## Number of Fisher Scoring iterations: 5

As we can see all the variables are significant, so we can proceed for accuracy of the model.

# Predicting the value in training data to consider cutoff for testing data.
pred_tr_lr<-fitted(model_lr)
# considering the cutoff with help of ROCR curve
library(ROCR)
pred<-prediction(pred_tr_lr,train_data$Survived)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05))

As we can see that 0.35 and 0.4 has the value of sensitivity and specificity closer,so lets check for both with help of confusion matrix.

For 0.35

pred_tr_lr1<-ifelse(pred_tr_lr<0.35,"No","Yes")
pred_tr_lr1<-as.factor(pred_tr_lr1)
confusionMatrix(pred_tr_lr1,train_data$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  298  49
##        Yes  87 191
##                                          
##                Accuracy : 0.7824         
##                  95% CI : (0.748, 0.8142)
##     No Information Rate : 0.616          
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.5534         
##                                          
##  Mcnemar's Test P-Value : 0.00151        
##                                          
##             Sensitivity : 0.7740         
##             Specificity : 0.7958         
##          Pos Pred Value : 0.8588         
##          Neg Pred Value : 0.6871         
##              Prevalence : 0.6160         
##          Detection Rate : 0.4768         
##    Detection Prevalence : 0.5552         
##       Balanced Accuracy : 0.7849         
##                                          
##        'Positive' Class : No             
##

For 0.4

pred_tr_lr2<-ifelse(pred_tr_lr<0.4,"No","Yes")
pred_tr_lr2<-as.factor(pred_tr_lr2)
confusionMatrix(pred_tr_lr2,train_data$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  301  55
##        Yes  84 185
##                                           
##                Accuracy : 0.7776          
##                  95% CI : (0.7429, 0.8096)
##     No Information Rate : 0.616           
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5404          
##                                           
##  Mcnemar's Test P-Value : 0.01755         
##                                           
##             Sensitivity : 0.7818          
##             Specificity : 0.7708          
##          Pos Pred Value : 0.8455          
##          Neg Pred Value : 0.6877          
##              Prevalence : 0.6160          
##          Detection Rate : 0.4816          
##    Detection Prevalence : 0.5696          
##       Balanced Accuracy : 0.7763          
##                                           
##        'Positive' Class : No              
##

As we know that we have to consider the cutoff which has sensitivity and specificity close to each other. Therefore, we consider 0.4 as the cutoff and predicting the value of test.

pred_lr<-predict(model_lr,test_data,type="response")
pred_lr1<-ifelse(pred_lr<0.4,"No","Yes")
pred_lr1<-as.factor(pred_lr1)
confusionMatrix(pred_lr1,test_data$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  129  21
##        Yes  35  81
##                                           
##                Accuracy : 0.7895          
##                  95% CI : (0.7355, 0.8369)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 1.118e-09       
##                                           
##                   Kappa : 0.566           
##                                           
##  Mcnemar's Test P-Value : 0.08235         
##                                           
##             Sensitivity : 0.7866          
##             Specificity : 0.7941          
##          Pos Pred Value : 0.8600          
##          Neg Pred Value : 0.6983          
##              Prevalence : 0.6165          
##          Detection Rate : 0.4850          
##    Detection Prevalence : 0.5639          
##       Balanced Accuracy : 0.7904          
##                                           
##        'Positive' Class : No              
##

Logistic Regression algorithm gives accuracy of 0.7895 or 78.95%

Decision Tree

Creating a model and ploting decision tree

library(rpart)
library(rpart.plot)
model_tr<-rpart(Survived~.,data = train_data)
rpart.plot(model_tr)

Checking the Accuracy of model

pred_tr<-predict(model_tr,test_data,type="class")
confusionMatrix(pred_tr,test_data$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  154  36
##        Yes  10  66
##                                           
##                Accuracy : 0.8271          
##                  95% CI : (0.7762, 0.8705)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 6.692e-14       
##                                           
##                   Kappa : 0.6158          
##                                           
##  Mcnemar's Test P-Value : 0.0002278       
##                                           
##             Sensitivity : 0.9390          
##             Specificity : 0.6471          
##          Pos Pred Value : 0.8105          
##          Neg Pred Value : 0.8684          
##              Prevalence : 0.6165          
##          Detection Rate : 0.5789          
##    Detection Prevalence : 0.7143          
##       Balanced Accuracy : 0.7930          
##                                           
##        'Positive' Class : No              
##

Decision Tree algorithm gives accuracy of 0.8271 or 82.71%

Random forest

Repeating the same step of creating the model, and predict the value on test data.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

model_rf<-randomForest(Survived~.,data = train_data)
model_rf

## 
## Call:
##  randomForest(formula = Survived ~ ., data = train_data) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.16%
## Confusion matrix:
##      No Yes class.error
## No  338  47   0.1220779
## Yes  79 161   0.3291667

plot(model_rf)

pred_rf<-predict(model_rf,test_data)
confusionMatrix(pred_rf,test_data$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  152  37
##        Yes  12  65
##                                           
##                Accuracy : 0.8158          
##                  95% CI : (0.7639, 0.8605)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 1.592e-12       
##                                           
##                   Kappa : 0.5915          
##                                           
##  Mcnemar's Test P-Value : 0.0006068       
##                                           
##             Sensitivity : 0.9268          
##             Specificity : 0.6373          
##          Pos Pred Value : 0.8042          
##          Neg Pred Value : 0.8442          
##              Prevalence : 0.6165          
##          Detection Rate : 0.5714          
##    Detection Prevalence : 0.7105          
##       Balanced Accuracy : 0.7820          
##                                           
##        'Positive' Class : No              
##

Random Forest algorithm gives accuracy of 0.8158 or 81.58%

Support Vector Machine Algorithm

Repeating the same step of building model and predicting the value on test data.

library(e1071)
model_svm<-svm(Survived~.,data = train_data)
pred_svm<-predict(model_svm,test_data)
confusionMatrix(pred_svm,test_data$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  147  32
##        Yes  17  70
##                                           
##                Accuracy : 0.8158          
##                  95% CI : (0.7639, 0.8605)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 1.592e-12       
##                                           
##                   Kappa : 0.5993          
##                                           
##  Mcnemar's Test P-Value : 0.0455          
##                                           
##             Sensitivity : 0.8963          
##             Specificity : 0.6863          
##          Pos Pred Value : 0.8212          
##          Neg Pred Value : 0.8046          
##              Prevalence : 0.6165          
##          Detection Rate : 0.5526          
##    Detection Prevalence : 0.6729          
##       Balanced Accuracy : 0.7913          
##                                           
##        'Positive' Class : No              
##

Support Vector Machine algorithm gives accuracy of 0.8158 or 81.58%

Naive Bayes Algorithm