Prediction of Chances of Survival of Titanic Passengers

Overview :

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy shocked the international community and lead to better safety regulations for ships. One of the reasons that the shipwreck lead to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Objective :

The main objective of the dataset is to predict Chances of Survival based on several explanatory factors such as Pclass, Sex, Age, SibSp, Parch, etc. using Machine Learning algorithms.

The methods we intend to use are:

Binary Logistic Regression
Naive Bayes
Decision Tree
Random Forest

Description of variables:

Survived - Survival (0 = Not Survived; 1 = Survived)
Pclass - Passenger Class (1 = 1st class; 2 = 2nd class; 3 = 3rd class)
Name - Name of the Passenger
Sex - Sex of the Passenger
Sibsp - Number of Siblings/Spouses Aboard
Parch - Number of Parents/Children Aboard
Ticket - Ticket Number
Fare - Passenger Fare
Cabin - Cabin
Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Data Pre-processing :

Loading the data and checking the structure, first five rows and dimensions of the data

library(readxl)
Titanic<-read_excel("C:/Users/intel/Downloads/Titanic.xls")
str(Titanic)

## Classes 'tbl_df', 'tbl' and 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : num  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : num  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : num  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

head(Titanic)

## # A tibble: 6 x 12
##   PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare
##         <dbl>    <dbl>  <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>  <dbl>
## 1           1        0      3 Brau~ male     22     1     0 A/5 2~  7.25
## 2           2        1      1 Cumi~ fema~    38     1     0 PC 17~ 71.3 
## 3           3        1      3 Heik~ fema~    26     0     0 STON/~  7.92
## 4           4        1      1 Futr~ fema~    35     1     0 113803 53.1 
## 5           5        0      3 Alle~ male     35     0     0 373450  8.05
## 6           6        0      3 Mora~ male     NA     0     0 330877  8.46
## # ... with 2 more variables: Cabin <chr>, Embarked <chr>

dim(Titanic)

## [1] 891  12

Summary :

summary(Titanic)

##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
##

Converting following variables to factor :

Survived - It is in numeric character datatype but for our analysis we want it in factor form.
Pclass - It is in numeric character datatype and as per the given description, the Pclass must be categorical.
Sex - It is in character datatype and as per the given description, the Sex must be categorical.
Embarked - It is in character datatype and as per the given description it is a port, so it must be in factor.

names<-c("Survived","Pclass","Sex","Embarked")
Titanic[,names]<-lapply(Titanic[,names], as.factor)
str(Titanic)

## Classes 'tbl_df', 'tbl' and 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : num  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : num  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Checking whether the data contains any null calues or not :

colSums(is.na(Titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

It is clearly seen that Age, Cabin and Embarked columns has null values. We must eliminate Cabin column as it has more no. of missing values and it does not have significance in the survival rate. For Age and Embarked we can replace the value with median and mode respectively. Converting the age variable to categorical to make it into groups.

And also we’ll eliminate Name, PassengerId and Ticket columns as it doesn’t have any significance in the survival rate.

col<-c("PassengerId","Name","Ticket","Cabin")
Titanic[,col]<-list(NULL)
Titanic1<-Titanic
Titanic1$Age[is.na(Titanic1$Age)]<-28
Titanic1$Age<-cut(Titanic1$Age,breaks = c(0,20,28,40,Inf),labels = c("c1","c2","c3","c4"))
Titanic1$Embarked[is.na(Titanic1$Embarked)]<-"S" 
#scaling the numeric data
col_scale=c("SibSp","Parch","Fare")
Titanic1[,col_scale]<-lapply(Titanic1[,col_scale], scale)
colSums(is.na(Titanic1))

## Survived   Pclass      Sex      Age                            Embarked 
##        0        0        0        0        0        0        0        0

Now we are left with no missing values in the data, so we can proceed further

EDA:

Age wise distribution :

library(ggplot2)
ggplot(Titanic1,aes(x=Age)) + geom_bar(aes(fill=Survived)) +labs(x = "Age Group",y="Frequency",title = "Age Wise Distribution")

We can conclude that 45% of passengers survived were from the age group of 20 to 30.

Sex wise distribution :

ggplot(Titanic1,aes(x=Sex)) + geom_bar(aes(fill=Survived)) +labs(x = "Sex Group",y="Frequency",title = "Sex Wise Distribution")

We can conclude that majority of passengers survived were female as compared to male.

ggplot(Titanic1,aes(x=Pclass)) + geom_bar(aes(fill=Survived)) +labs(x="Passenger Class",y = "Frequency",title = "Passenger Class wise Distribution")

We can conclude that most passengers survived were from 1st class followed by 3rd class and then 2nd class

Splitting Data

library(caret)

## Loading required package: lattice

set.seed(100) # keeping split constant in every iteration
index<-createDataPartition(Titanic1$Survived,p=0.7,list = F)
train_titanic<-Titanic1[index,]
test_titanic<-Titanic1[-index,]
dim(train_titanic) # dimension of training data

## [1] 625   8

dim(test_titanic) # dimension of testing data

## [1] 266   8

Applying Machine Learning Algorithms :

Binary Logistic Regression

LR_model<-glm(Survived~.,data = train_titanic,family = "binomial")
summary(LR_model)

## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1027  -0.6974  -0.4105   0.6058   2.4715  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.27716    0.46282   7.081 1.43e-12 ***
## Pclass2     -0.82501    0.35749  -2.308  0.02101 *  
## Pclass3     -2.03914    0.36186  -5.635 1.75e-08 ***
## Sexmale     -2.57536    0.23293 -11.057  < 2e-16 ***
## Agec2       -0.99272    0.30350  -3.271  0.00107 ** 
## Agec3       -0.71172    0.33401  -2.131  0.03310 *  
## Agec4       -1.56704    0.37676  -4.159 3.19e-05 ***
## SibSp       -0.36164    0.13925  -2.597  0.00940 ** 
## Parch       -0.06057    0.12712  -0.476  0.63374    
## Fare         0.13488    0.16149   0.835  0.40360    
## EmbarkedQ    0.46524    0.44861   1.037  0.29970    
## EmbarkedS   -0.23584    0.28594  -0.825  0.40949    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 832.49  on 624  degrees of freedom
## Residual deviance: 569.16  on 613  degrees of freedom
## AIC: 593.16
## 
## Number of Fisher Scoring iterations: 5

As we can see Parch, Fare and Embarked has no significance in survival rate so we’ll eliminate these columns and re-build the model.

col1<-c("Parch","Fare","Embarked")
Titanic1[,col1]<-list(NULL)

set.seed(100) 
index<-createDataPartition(Titanic1$Survived,p=0.7,list = F)
train_titanic<-Titanic1[index,]
test_titanic<-Titanic1[-index,]
dim(train_titanic)

## [1] 625   5

dim(test_titanic)

## [1] 266   5

LR_model<-glm(Survived~.,data = train_titanic,family = "binomial")
summary(LR_model)

## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1435  -0.6686  -0.4458   0.6377   2.4455  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.2266     0.3874   8.329  < 2e-16 ***
## Pclass2      -1.0394     0.3058  -3.399 0.000676 ***
## Pclass3      -2.1670     0.2861  -7.575 3.60e-14 ***
## Sexmale      -2.6208     0.2278 -11.506  < 2e-16 ***
## Agec2        -0.8743     0.2918  -2.997 0.002731 ** 
## Agec3        -0.6637     0.3302  -2.010 0.044407 *  
## Agec4        -1.5542     0.3732  -4.164 3.12e-05 ***
## SibSp        -0.3725     0.1281  -2.907 0.003647 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 832.49  on 624  degrees of freedom
## Residual deviance: 573.91  on 617  degrees of freedom
## AIC: 589.91
## 
## Number of Fisher Scoring iterations: 5

train_titanic_LR<-fitted(LR_model)
library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

pred<-prediction(train_titanic_LR,train_titanic$Survived)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05))

pred_LR<-predict(LR_model,test_titanic,type="response")
pred_LR1<-ifelse(pred_LR<0.35,0,1)
pred_LR1<-as.factor(pred_LR1)
confusionMatrix(pred_LR1,test_titanic$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 128  20
##          1  36  82
##                                           
##                Accuracy : 0.7895          
##                  95% CI : (0.7355, 0.8369)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 1.118e-09       
##                                           
##                   Kappa : 0.5676          
##                                           
##  Mcnemar's Test P-Value : 0.04502         
##                                           
##             Sensitivity : 0.7805          
##             Specificity : 0.8039          
##          Pos Pred Value : 0.8649          
##          Neg Pred Value : 0.6949          
##              Prevalence : 0.6165          
##          Detection Rate : 0.4812          
##    Detection Prevalence : 0.5564          
##       Balanced Accuracy : 0.7922          
##                                           
##        'Positive' Class : 0               
##

Binary Logistic Regression gives us an accuracy of 78.95%

Naive Bayes Algorithm

Building the model on train data i.e. Training the data and finding the accuracy on test data

library(e1071)

## Warning: package 'e1071' was built under R version 3.6.2

NB_model<-naiveBayes(Survived~.,data = train_titanic)
NB_pred<-predict(NB_model,test_titanic)
confusionMatrix(NB_pred,test_titanic$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 144  29
##          1  20  73
##                                           
##                Accuracy : 0.8158          
##                  95% CI : (0.7639, 0.8605)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 1.592e-12       
##                                           
##                   Kappa : 0.6038          
##                                           
##  Mcnemar's Test P-Value : 0.2531          
##                                           
##             Sensitivity : 0.8780          
##             Specificity : 0.7157          
##          Pos Pred Value : 0.8324          
##          Neg Pred Value : 0.7849          
##              Prevalence : 0.6165          
##          Detection Rate : 0.5414          
##    Detection Prevalence : 0.6504          
##       Balanced Accuracy : 0.7969          
##                                           
##        'Positive' Class : 0               
##

Naive Bayes algorithm gives accuracy of 81.58%

Decision Tree

Building the model on train data i.e. Training the data

library(rpart)

## Warning: package 'rpart' was built under R version 3.6.2

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.6.2

DT_model<-rpart(Survived~.,data = train_titanic)
rpart.plot(DT_model)

Checking the Accuracy of model on test data

DT_pred<-predict(DT_model,test_titanic,type="class")
confusionMatrix(DT_pred,test_titanic$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 156  36
##          1   8  66
##                                           
##                Accuracy : 0.8346          
##                  95% CI : (0.7844, 0.8772)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 7.118e-15       
##                                           
##                   Kappa : 0.631           
##                                           
##  Mcnemar's Test P-Value : 4.693e-05       
##                                           
##             Sensitivity : 0.9512          
##             Specificity : 0.6471          
##          Pos Pred Value : 0.8125          
##          Neg Pred Value : 0.8919          
##              Prevalence : 0.6165          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 0.7218          
##       Balanced Accuracy : 0.7991          
##                                           
##        'Positive' Class : 0               
##

Decision Tree algorithm gives accuracy of 83.46%

Random forest