On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy shocked the international community and lead to better safety regulations for ships. One of the reasons that the shipwreck lead to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
The main objective is to predict the who survived in the sank ship usin machine learning algorithm.
Survived - Survival (0 = No; 1 = Yes)
Pclass - Passenger Class (1 = First Class; 2 = Second Class; 3 = Third Class)
Name - Name of the Passenger
Sex - Gender of the Passenger
Age - Age of Passenger
Sibsp - Number of Siblings/Spouses Aboard
Parch - Number of Parents/Children Aboard
Ticket - Ticket Number
Fare - Passenger Fare
Cabin - Cabin
Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Loading the data and checking the structure.
library(readxl)
titanic<-read_excel("Z:/Data Science data/Semester 2/Titanic.xls")
str(titanic)
## tibble [891 x 12] (S3: tbl_df/tbl/data.frame)
## $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : num [1:891] 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr [1:891] "male" "female" "female" "female" ...
## $ Age : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr [1:891] NA "C85" NA "C123" ...
## $ Embarked : chr [1:891] "S" "C" "S" "S" ...
Summary:-
summary(titanic)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
Converting following variables to factor:-
Survived:- it is in numeric form but we want it in factor form as it is are dependent variables. and also converting 0 to “No” and 1 for “Yes” for simplicity.
Pclass:- As per the description, we know it should be in categorical type.
Sex:- As per description, the sex should be categorical but it is stored character type of data .
Embarked:- It is the port where the passenger aboard ship, so it must be in factor.
titanic$Survived<-ifelse(titanic$Survived==0,"No","Yes")
titanic$Pclass <- factor(titanic$Pclass,
levels = c(1,2,3),
labels = c("First Class", "Second Class", "Third Class"))
str(titanic)
## tibble [891 x 12] (S3: tbl_df/tbl/data.frame)
## $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : chr [1:891] "No" "Yes" "Yes" "Yes" ...
## $ Pclass : Factor w/ 3 levels "First Class",..: 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr [1:891] "male" "female" "female" "female" ...
## $ Age : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr [1:891] NA "C85" NA "C123" ...
## $ Embarked : chr [1:891] "S" "C" "S" "S" ...
fact<-c("Survived","Pclass","Sex","Embarked")
titanic[,fact]<-lapply(titanic[,fact], as.factor)
str(titanic)
## tibble [891 x 12] (S3: tbl_df/tbl/data.frame)
## $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "First Class",..: 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr [1:891] NA "C85" NA "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Once again checking the Summary
summary(titanic)
## PassengerId Survived Pclass Name Sex
## Min. : 1.0 No :549 First Class :216 Length:891 female:314
## 1st Qu.:223.5 Yes:342 Second Class:184 Class :character male :577
## Median :446.0 Third Class :491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 C :168
## 1st Qu.: 7.91 Class :character Q : 77
## Median : 14.45 Mode :character S :644
## Mean : 32.20 NA's: 2
## 3rd Qu.: 31.00
## Max. :512.33
##
Checking whether data contains any null values or not
colSums(is.na(titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
As we can see that Age, Cabin and embarked columns as null values. we can delete cabin as it has more no. of values. For age and embarked we can replace the value with median and mode respectively.Converting the age variable to categorical to make it into groups.
And also deleting Name, PassengerId and ticket as it doesn’t have any significance in the survival rate.
del_col<-c("PassengerId","Name","Cabin","Ticket")
titanic[,del_col]<-list(NULL)
titanic1<-titanic
titanic1$Age[is.na(titanic1$Age)]<-28
titanic1$Age<-cut(titanic1$Age,breaks = c(0,20,30,40,Inf),labels = c("Teen", "Young","Adult","Old"))
titanic1$Embarked[is.na(titanic1$Embarked)]<-"S"
# Scaling numeric data so each numeric data is having equal weightage
col_sca=c("SibSp","Parch","Fare")
titanic1[,col_sca]<-lapply(titanic1[,col_sca], scale)
colSums(is.na(titanic1))
## Survived Pclass Sex Age Embarked
## 0 0 0 0 0 0 0 0
As we can see that there is no null values so we can proceed further for some EDA.
library(ggplot2)
ggplot(titanic1,aes(x=Age)) + geom_bar(aes(fill=Survived)) +labs(x = "Age Group",y="Frequency",
title = "Age Wise Distribution")
From Above graph we can see that around 45% of the population were from age group between 20 to 30.
ggplot(titanic1,aes(x = Sex)) + geom_bar(aes(fill = Survived)) +labs(x="Gender",y="Frequency",
title = "Gender wise Distribution")
From above graph, we can observe that Female Survivors are twice in no. to Male survivors.And also we can see that approximately 20% of male are survived out of total male who aboard ship.
ggplot(titanic1,aes(x=Pclass)) + geom_bar(aes(fill=Survived)) +labs(x="Passenger Class",y = "Frequency",
title = "Passenger Class wise Distribution")
From Graph, we can observe that First Class survived compared to other passenger class and also we can see that more no. of non survivor are from third class passenger.
library(caret)
## Loading required package: lattice
set.seed(100) # keeping spliting constant in every iteration
index<-createDataPartition(titanic1$Survived,p=0.7,list = F)
train_data<-titanic1[index,]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
test_data<-titanic1[-index,]
dim(train_data) # dimension of training data
## [1] 625 8
dim(test_data) # dimension of testing data
## [1] 266 8
model_lr<-glm(Survived~.,data = train_data,family = "binomial")
summary(model_lr)
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1341 -0.6823 -0.4232 0.6457 2.4750
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.27317 0.46197 7.085 1.39e-12 ***
## PclassSecond Class -0.81696 0.35699 -2.288 0.02211 *
## PclassThird Class -2.05281 0.36346 -5.648 1.62e-08 ***
## Sexmale -2.56343 0.23197 -11.051 < 2e-16 ***
## AgeYoung -0.92639 0.29500 -3.140 0.00169 **
## AgeAdult -0.78566 0.35989 -2.183 0.02903 *
## AgeOld -1.57372 0.37698 -4.175 2.99e-05 ***
## SibSp -0.36193 0.13850 -2.613 0.00897 **
## Parch -0.05578 0.12703 -0.439 0.66055
## Fare 0.13387 0.16021 0.836 0.40338
## EmbarkedQ 0.44357 0.44781 0.991 0.32191
## EmbarkedS -0.23071 0.28654 -0.805 0.42073
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 832.49 on 624 degrees of freedom
## Residual deviance: 569.89 on 613 degrees of freedom
## AIC: 593.89
##
## Number of Fisher Scoring iterations: 5
As we can see that Parch, fare and Embarked are insignificant.We can say that there is less or no relation of survival of person with place he/she aboard.
Therefore, deleting the insignificant variables. and building the model.
del_col1<-c("Fare","Embarked","Parch")
titanic1[,del_col1]<-list(NULL)
# Again spliting the train test or deleting the those columns
set.seed(100) # keeping spliting constant
index<-createDataPartition(titanic1$Survived,p=0.7,list = F)
train_data<-titanic1[index,]
test_data<-titanic1[-index,]
model_lr<-glm(Survived~.,data = train_data,family = "binomial")
summary(model_lr)
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1667 -0.6678 -0.4561 0.6328 2.4503
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.2308 0.3876 8.335 < 2e-16 ***
## PclassSecond Class -1.0329 0.3055 -3.381 0.000721 ***
## PclassThird Class -2.1834 0.2886 -7.566 3.84e-14 ***
## Sexmale -2.6108 0.2271 -11.498 < 2e-16 ***
## AgeYoung -0.8235 0.2850 -2.890 0.003854 **
## AgeAdult -0.7375 0.3552 -2.076 0.037862 *
## AgeOld -1.5637 0.3736 -4.186 2.84e-05 ***
## SibSp -0.3712 0.1275 -2.911 0.003605 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 832.49 on 624 degrees of freedom
## Residual deviance: 574.38 on 617 degrees of freedom
## AIC: 590.38
##
## Number of Fisher Scoring iterations: 5
As we can see all the variables are significant, so we can proceed for accuracy of the model.
# Predicting the value in training data to consider cutoff for testing data.
pred_tr_lr<-fitted(model_lr)
# considering the cutoff with help of ROCR curve
library(ROCR)
pred<-prediction(pred_tr_lr,train_data$Survived)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05))
As we can see that 0.35 and 0.4 has the value of sensitivity and specificity closer,so lets check for both with help of confusion matrix.
pred_tr_lr1<-ifelse(pred_tr_lr<0.35,"No","Yes")
pred_tr_lr1<-as.factor(pred_tr_lr1)
confusionMatrix(pred_tr_lr1,train_data$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 298 49
## Yes 87 191
##
## Accuracy : 0.7824
## 95% CI : (0.748, 0.8142)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5534
##
## Mcnemar's Test P-Value : 0.00151
##
## Sensitivity : 0.7740
## Specificity : 0.7958
## Pos Pred Value : 0.8588
## Neg Pred Value : 0.6871
## Prevalence : 0.6160
## Detection Rate : 0.4768
## Detection Prevalence : 0.5552
## Balanced Accuracy : 0.7849
##
## 'Positive' Class : No
##
pred_tr_lr2<-ifelse(pred_tr_lr<0.4,"No","Yes")
pred_tr_lr2<-as.factor(pred_tr_lr2)
confusionMatrix(pred_tr_lr2,train_data$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 301 55
## Yes 84 185
##
## Accuracy : 0.7776
## 95% CI : (0.7429, 0.8096)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5404
##
## Mcnemar's Test P-Value : 0.01755
##
## Sensitivity : 0.7818
## Specificity : 0.7708
## Pos Pred Value : 0.8455
## Neg Pred Value : 0.6877
## Prevalence : 0.6160
## Detection Rate : 0.4816
## Detection Prevalence : 0.5696
## Balanced Accuracy : 0.7763
##
## 'Positive' Class : No
##
As we know that we have to consider the cutoff which has sensitivity and specificity close to each other. Therefore, we consider 0.4 as the cutoff and predicting the value of test.
pred_lr<-predict(model_lr,test_data,type="response")
pred_lr1<-ifelse(pred_lr<0.4,"No","Yes")
pred_lr1<-as.factor(pred_lr1)
confusionMatrix(pred_lr1,test_data$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 129 21
## Yes 35 81
##
## Accuracy : 0.7895
## 95% CI : (0.7355, 0.8369)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : 1.118e-09
##
## Kappa : 0.566
##
## Mcnemar's Test P-Value : 0.08235
##
## Sensitivity : 0.7866
## Specificity : 0.7941
## Pos Pred Value : 0.8600
## Neg Pred Value : 0.6983
## Prevalence : 0.6165
## Detection Rate : 0.4850
## Detection Prevalence : 0.5639
## Balanced Accuracy : 0.7904
##
## 'Positive' Class : No
##
Logistic Regression algorithm gives accuracy of 0.7895 or 78.95%
Creating a model and ploting decision tree
library(rpart)
library(rpart.plot)
model_tr<-rpart(Survived~.,data = train_data)
rpart.plot(model_tr)
Checking the Accuracy of model
pred_tr<-predict(model_tr,test_data,type="class")
confusionMatrix(pred_tr,test_data$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 154 36
## Yes 10 66
##
## Accuracy : 0.8271
## 95% CI : (0.7762, 0.8705)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : 6.692e-14
##
## Kappa : 0.6158
##
## Mcnemar's Test P-Value : 0.0002278
##
## Sensitivity : 0.9390
## Specificity : 0.6471
## Pos Pred Value : 0.8105
## Neg Pred Value : 0.8684
## Prevalence : 0.6165
## Detection Rate : 0.5789
## Detection Prevalence : 0.7143
## Balanced Accuracy : 0.7930
##
## 'Positive' Class : No
##
Decision Tree algorithm gives accuracy of 0.8271 or 82.71%
Repeating the same step of creating the model, and predict the value on test data.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
model_rf<-randomForest(Survived~.,data = train_data)
model_rf
##
## Call:
## randomForest(formula = Survived ~ ., data = train_data)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.16%
## Confusion matrix:
## No Yes class.error
## No 338 47 0.1220779
## Yes 79 161 0.3291667
plot(model_rf)
pred_rf<-predict(model_rf,test_data)
confusionMatrix(pred_rf,test_data$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 152 37
## Yes 12 65
##
## Accuracy : 0.8158
## 95% CI : (0.7639, 0.8605)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : 1.592e-12
##
## Kappa : 0.5915
##
## Mcnemar's Test P-Value : 0.0006068
##
## Sensitivity : 0.9268
## Specificity : 0.6373
## Pos Pred Value : 0.8042
## Neg Pred Value : 0.8442
## Prevalence : 0.6165
## Detection Rate : 0.5714
## Detection Prevalence : 0.7105
## Balanced Accuracy : 0.7820
##
## 'Positive' Class : No
##
Random Forest algorithm gives accuracy of 0.8158 or 81.58%
Repeating the same step of building model and predicting the value on test data.
library(e1071)
model_svm<-svm(Survived~.,data = train_data)
pred_svm<-predict(model_svm,test_data)
confusionMatrix(pred_svm,test_data$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 147 32
## Yes 17 70
##
## Accuracy : 0.8158
## 95% CI : (0.7639, 0.8605)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : 1.592e-12
##
## Kappa : 0.5993
##
## Mcnemar's Test P-Value : 0.0455
##
## Sensitivity : 0.8963
## Specificity : 0.6863
## Pos Pred Value : 0.8212
## Neg Pred Value : 0.8046
## Prevalence : 0.6165
## Detection Rate : 0.5526
## Detection Prevalence : 0.6729
## Balanced Accuracy : 0.7913
##
## 'Positive' Class : No
##
Support Vector Machine algorithm gives accuracy of 0.8158 or 81.58%
Repeating the same step of building model and predicting the value on test data.
model_nb<-naiveBayes(Survived~.,data = train_data)
pred_nb<-predict(model_nb,test_data)
confusionMatrix(pred_nb,test_data$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 146 29
## Yes 18 73
##
## Accuracy : 0.8233
## 95% CI : (0.7721, 0.8672)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : 1.974e-13
##
## Kappa : 0.6185
##
## Mcnemar's Test P-Value : 0.1447
##
## Sensitivity : 0.8902
## Specificity : 0.7157
## Pos Pred Value : 0.8343
## Neg Pred Value : 0.8022
## Prevalence : 0.6165
## Detection Rate : 0.5489
## Detection Prevalence : 0.6579
## Balanced Accuracy : 0.8030
##
## 'Positive' Class : No
##
Naive Bayes algorithm gives accuracy of 0.8233 or 82.33%
After performing various classification techniques and taking into account their accuracies, we can conclude all the models had an accuracy ranging from 78% to 83%. Out of which Naive Bayes Algorithm gave a slightly better accuracy of 82.33%.