Overview :

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy shocked the international community and lead to better safety regulations for ships. One of the reasons that the shipwreck lead to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Objective :

The main objective of the dataset is to predict Chances of Survival based on several explanatory factors such as Pclass, Sex, Age, SibSp, Parch, etc. using Machine Learning algorithms.

The methods we intend to use are:

Description of variables:

Data Pre-processing :

Loading the data and checking the structure, first five rows and dimensions of the data

library(readxl)
Titanic<-read_excel("C:/Users/intel/Downloads/Titanic.xls")
str(Titanic)
## Classes 'tbl_df', 'tbl' and 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : num  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : num  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : num  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
head(Titanic)
## # A tibble: 6 x 12
##   PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare
##         <dbl>    <dbl>  <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>  <dbl>
## 1           1        0      3 Brau~ male     22     1     0 A/5 2~  7.25
## 2           2        1      1 Cumi~ fema~    38     1     0 PC 17~ 71.3 
## 3           3        1      3 Heik~ fema~    26     0     0 STON/~  7.92
## 4           4        1      1 Futr~ fema~    35     1     0 113803 53.1 
## 5           5        0      3 Alle~ male     35     0     0 373450  8.05
## 6           6        0      3 Mora~ male     NA     0     0 330877  8.46
## # ... with 2 more variables: Cabin <chr>, Embarked <chr>
dim(Titanic)
## [1] 891  12

Summary :

summary(Titanic)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 

Converting following variables to factor :

names<-c("Survived","Pclass","Sex","Embarked")
Titanic[,names]<-lapply(Titanic[,names], as.factor)
str(Titanic)
## Classes 'tbl_df', 'tbl' and 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : num  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : num  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Checking whether the data contains any null calues or not :

colSums(is.na(Titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

It is clearly seen that Age, Cabin and Embarked columns has null values. We must eliminate Cabin column as it has more no. of missing values and it does not have significance in the survival rate. For Age and Embarked we can replace the value with median and mode respectively. Converting the age variable to categorical to make it into groups.

And also we’ll eliminate Name, PassengerId and Ticket columns as it doesn’t have any significance in the survival rate.

col<-c("PassengerId","Name","Ticket","Cabin")
Titanic[,col]<-list(NULL)
Titanic1<-Titanic
Titanic1$Age[is.na(Titanic1$Age)]<-28
Titanic1$Age<-cut(Titanic1$Age,breaks = c(0,20,28,40,Inf),labels = c("c1","c2","c3","c4"))
Titanic1$Embarked[is.na(Titanic1$Embarked)]<-"S" 
#scaling the numeric data
col_scale=c("SibSp","Parch","Fare")
Titanic1[,col_scale]<-lapply(Titanic1[,col_scale], scale)
colSums(is.na(Titanic1))
## Survived   Pclass      Sex      Age                            Embarked 
##        0        0        0        0        0        0        0        0

Now we are left with no missing values in the data, so we can proceed further

EDA:

Age wise distribution :

library(ggplot2)
ggplot(Titanic1,aes(x=Age)) + geom_bar(aes(fill=Survived)) +labs(x = "Age Group",y="Frequency",title = "Age Wise Distribution")

We can conclude that 45% of passengers survived were from the age group of 20 to 30.

Sex wise distribution :

ggplot(Titanic1,aes(x=Sex)) + geom_bar(aes(fill=Survived)) +labs(x = "Sex Group",y="Frequency",title = "Sex Wise Distribution")

We can conclude that majority of passengers survived were female as compared to male.

ggplot(Titanic1,aes(x=Pclass)) + geom_bar(aes(fill=Survived)) +labs(x="Passenger Class",y = "Frequency",title = "Passenger Class wise Distribution")

We can conclude that most passengers survived were from 1st class followed by 3rd class and then 2nd class

Splitting Data

library(caret)
## Loading required package: lattice
set.seed(100) # keeping split constant in every iteration
index<-createDataPartition(Titanic1$Survived,p=0.7,list = F)
train_titanic<-Titanic1[index,]
test_titanic<-Titanic1[-index,]
dim(train_titanic) # dimension of training data 
## [1] 625   8
dim(test_titanic) # dimension of testing data
## [1] 266   8

Applying Machine Learning Algorithms :

Binary Logistic Regression

LR_model<-glm(Survived~.,data = train_titanic,family = "binomial")
summary(LR_model)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1027  -0.6974  -0.4105   0.6058   2.4715  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.27716    0.46282   7.081 1.43e-12 ***
## Pclass2     -0.82501    0.35749  -2.308  0.02101 *  
## Pclass3     -2.03914    0.36186  -5.635 1.75e-08 ***
## Sexmale     -2.57536    0.23293 -11.057  < 2e-16 ***
## Agec2       -0.99272    0.30350  -3.271  0.00107 ** 
## Agec3       -0.71172    0.33401  -2.131  0.03310 *  
## Agec4       -1.56704    0.37676  -4.159 3.19e-05 ***
## SibSp       -0.36164    0.13925  -2.597  0.00940 ** 
## Parch       -0.06057    0.12712  -0.476  0.63374    
## Fare         0.13488    0.16149   0.835  0.40360    
## EmbarkedQ    0.46524    0.44861   1.037  0.29970    
## EmbarkedS   -0.23584    0.28594  -0.825  0.40949    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 832.49  on 624  degrees of freedom
## Residual deviance: 569.16  on 613  degrees of freedom
## AIC: 593.16
## 
## Number of Fisher Scoring iterations: 5

As we can see Parch, Fare and Embarked has no significance in survival rate so we’ll eliminate these columns and re-build the model.

col1<-c("Parch","Fare","Embarked")
Titanic1[,col1]<-list(NULL)

set.seed(100) 
index<-createDataPartition(Titanic1$Survived,p=0.7,list = F)
train_titanic<-Titanic1[index,]
test_titanic<-Titanic1[-index,]
dim(train_titanic) 
## [1] 625   5
dim(test_titanic) 
## [1] 266   5
LR_model<-glm(Survived~.,data = train_titanic,family = "binomial")
summary(LR_model)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1435  -0.6686  -0.4458   0.6377   2.4455  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.2266     0.3874   8.329  < 2e-16 ***
## Pclass2      -1.0394     0.3058  -3.399 0.000676 ***
## Pclass3      -2.1670     0.2861  -7.575 3.60e-14 ***
## Sexmale      -2.6208     0.2278 -11.506  < 2e-16 ***
## Agec2        -0.8743     0.2918  -2.997 0.002731 ** 
## Agec3        -0.6637     0.3302  -2.010 0.044407 *  
## Agec4        -1.5542     0.3732  -4.164 3.12e-05 ***
## SibSp        -0.3725     0.1281  -2.907 0.003647 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 832.49  on 624  degrees of freedom
## Residual deviance: 573.91  on 617  degrees of freedom
## AIC: 589.91
## 
## Number of Fisher Scoring iterations: 5
train_titanic_LR<-fitted(LR_model)
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
pred<-prediction(train_titanic_LR,train_titanic$Survived)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05))

pred_LR<-predict(LR_model,test_titanic,type="response")
pred_LR1<-ifelse(pred_LR<0.35,0,1)
pred_LR1<-as.factor(pred_LR1)
confusionMatrix(pred_LR1,test_titanic$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 128  20
##          1  36  82
##                                           
##                Accuracy : 0.7895          
##                  95% CI : (0.7355, 0.8369)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 1.118e-09       
##                                           
##                   Kappa : 0.5676          
##                                           
##  Mcnemar's Test P-Value : 0.04502         
##                                           
##             Sensitivity : 0.7805          
##             Specificity : 0.8039          
##          Pos Pred Value : 0.8649          
##          Neg Pred Value : 0.6949          
##              Prevalence : 0.6165          
##          Detection Rate : 0.4812          
##    Detection Prevalence : 0.5564          
##       Balanced Accuracy : 0.7922          
##                                           
##        'Positive' Class : 0               
## 

Binary Logistic Regression gives us an accuracy of 78.95%

Naive Bayes Algorithm

Building the model on train data i.e. Training the data and finding the accuracy on test data

library(e1071)
## Warning: package 'e1071' was built under R version 3.6.2
NB_model<-naiveBayes(Survived~.,data = train_titanic)
NB_pred<-predict(NB_model,test_titanic)
confusionMatrix(NB_pred,test_titanic$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 144  29
##          1  20  73
##                                           
##                Accuracy : 0.8158          
##                  95% CI : (0.7639, 0.8605)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 1.592e-12       
##                                           
##                   Kappa : 0.6038          
##                                           
##  Mcnemar's Test P-Value : 0.2531          
##                                           
##             Sensitivity : 0.8780          
##             Specificity : 0.7157          
##          Pos Pred Value : 0.8324          
##          Neg Pred Value : 0.7849          
##              Prevalence : 0.6165          
##          Detection Rate : 0.5414          
##    Detection Prevalence : 0.6504          
##       Balanced Accuracy : 0.7969          
##                                           
##        'Positive' Class : 0               
## 

Naive Bayes algorithm gives accuracy of 81.58%

Decision Tree

Building the model on train data i.e. Training the data

library(rpart)
## Warning: package 'rpart' was built under R version 3.6.2
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.6.2
DT_model<-rpart(Survived~.,data = train_titanic)
rpart.plot(DT_model)

Checking the Accuracy of model on test data

DT_pred<-predict(DT_model,test_titanic,type="class")
confusionMatrix(DT_pred,test_titanic$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 156  36
##          1   8  66
##                                           
##                Accuracy : 0.8346          
##                  95% CI : (0.7844, 0.8772)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 7.118e-15       
##                                           
##                   Kappa : 0.631           
##                                           
##  Mcnemar's Test P-Value : 4.693e-05       
##                                           
##             Sensitivity : 0.9512          
##             Specificity : 0.6471          
##          Pos Pred Value : 0.8125          
##          Neg Pred Value : 0.8919          
##              Prevalence : 0.6165          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 0.7218          
##       Balanced Accuracy : 0.7991          
##                                           
##        'Positive' Class : 0               
## 

Decision Tree algorithm gives accuracy of 83.46%

Random forest

Building the model on train data i.e. Training the data and finding the accuracy on test data

library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
RF_model<-randomForest(Survived~.,data = train_titanic)
RF_model
## 
## Call:
##  randomForest(formula = Survived ~ ., data = train_titanic) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.32%
## Confusion matrix:
##     0   1 class.error
## 0 340  45   0.1168831
## 1  82 158   0.3416667
plot(RF_model)

RF_pred<-predict(RF_model,test_titanic)
confusionMatrix(RF_pred,test_titanic$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 154  37
##          1  10  65
##                                           
##                Accuracy : 0.8233          
##                  95% CI : (0.7721, 0.8672)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 1.974e-13       
##                                           
##                   Kappa : 0.6066          
##                                           
##  Mcnemar's Test P-Value : 0.0001491       
##                                           
##             Sensitivity : 0.9390          
##             Specificity : 0.6373          
##          Pos Pred Value : 0.8063          
##          Neg Pred Value : 0.8667          
##              Prevalence : 0.6165          
##          Detection Rate : 0.5789          
##    Detection Prevalence : 0.7180          
##       Balanced Accuracy : 0.7881          
##                                           
##        'Positive' Class : 0               
## 

Random Forest gives us an accuracy of 82.33%

Conclusion :

After performing various classification algorithms and taking into account their accuracies, we can conclude all the models had an accuracy ranging from 78% to 84%. Out of which Decision Tree gave a slightly better accuracy of 83.46%