On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy shocked the international community and lead to better safety regulations for ships. One of the reasons that the shipwreck lead to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
The main objective of the dataset is to predict Chances of Survival based on several explanatory factors such as Pclass, Sex, Age, SibSp, Parch, etc. using Machine Learning algorithms.
The methods we intend to use are:
Loading the data and checking the structure, first five rows and dimensions of the data
library(readxl)
Titanic<-read_excel("C:/Users/intel/Downloads/Titanic.xls")
str(Titanic)
## Classes 'tbl_df', 'tbl' and 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: num 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : num 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : num 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : num 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
head(Titanic)
## # A tibble: 6 x 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 0 3 Brau~ male 22 1 0 A/5 2~ 7.25
## 2 2 1 1 Cumi~ fema~ 38 1 0 PC 17~ 71.3
## 3 3 1 3 Heik~ fema~ 26 0 0 STON/~ 7.92
## 4 4 1 1 Futr~ fema~ 35 1 0 113803 53.1
## 5 5 0 3 Alle~ male 35 0 0 373450 8.05
## 6 6 0 3 Mora~ male NA 0 0 330877 8.46
## # ... with 2 more variables: Cabin <chr>, Embarked <chr>
dim(Titanic)
## [1] 891 12
Summary :
summary(Titanic)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
Converting following variables to factor :
Survived - It is in numeric character datatype but for our analysis we want it in factor form.
Pclass - It is in numeric character datatype and as per the given description, the Pclass must be categorical.
Sex - It is in character datatype and as per the given description, the Sex must be categorical.
Embarked - It is in character datatype and as per the given description it is a port, so it must be in factor.
names<-c("Survived","Pclass","Sex","Embarked")
Titanic[,names]<-lapply(Titanic[,names], as.factor)
str(Titanic)
## Classes 'tbl_df', 'tbl' and 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: num 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : num 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : num 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Checking whether the data contains any null calues or not :
colSums(is.na(Titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
It is clearly seen that Age, Cabin and Embarked columns has null values. We must eliminate Cabin column as it has more no. of missing values and it does not have significance in the survival rate. For Age and Embarked we can replace the value with median and mode respectively. Converting the age variable to categorical to make it into groups.
And also we’ll eliminate Name, PassengerId and Ticket columns as it doesn’t have any significance in the survival rate.
col<-c("PassengerId","Name","Ticket","Cabin")
Titanic[,col]<-list(NULL)
Titanic1<-Titanic
Titanic1$Age[is.na(Titanic1$Age)]<-28
Titanic1$Age<-cut(Titanic1$Age,breaks = c(0,20,28,40,Inf),labels = c("c1","c2","c3","c4"))
Titanic1$Embarked[is.na(Titanic1$Embarked)]<-"S"
#scaling the numeric data
col_scale=c("SibSp","Parch","Fare")
Titanic1[,col_scale]<-lapply(Titanic1[,col_scale], scale)
colSums(is.na(Titanic1))
## Survived Pclass Sex Age Embarked
## 0 0 0 0 0 0 0 0
Now we are left with no missing values in the data, so we can proceed further
Age wise distribution :
library(ggplot2)
ggplot(Titanic1,aes(x=Age)) + geom_bar(aes(fill=Survived)) +labs(x = "Age Group",y="Frequency",title = "Age Wise Distribution")
We can conclude that 45% of passengers survived were from the age group of 20 to 30.
Sex wise distribution :
ggplot(Titanic1,aes(x=Sex)) + geom_bar(aes(fill=Survived)) +labs(x = "Sex Group",y="Frequency",title = "Sex Wise Distribution")
We can conclude that majority of passengers survived were female as compared to male.
ggplot(Titanic1,aes(x=Pclass)) + geom_bar(aes(fill=Survived)) +labs(x="Passenger Class",y = "Frequency",title = "Passenger Class wise Distribution")
We can conclude that most passengers survived were from 1st class followed by 3rd class and then 2nd class
library(caret)
## Loading required package: lattice
set.seed(100) # keeping split constant in every iteration
index<-createDataPartition(Titanic1$Survived,p=0.7,list = F)
train_titanic<-Titanic1[index,]
test_titanic<-Titanic1[-index,]
dim(train_titanic) # dimension of training data
## [1] 625 8
dim(test_titanic) # dimension of testing data
## [1] 266 8
LR_model<-glm(Survived~.,data = train_titanic,family = "binomial")
summary(LR_model)
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1027 -0.6974 -0.4105 0.6058 2.4715
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.27716 0.46282 7.081 1.43e-12 ***
## Pclass2 -0.82501 0.35749 -2.308 0.02101 *
## Pclass3 -2.03914 0.36186 -5.635 1.75e-08 ***
## Sexmale -2.57536 0.23293 -11.057 < 2e-16 ***
## Agec2 -0.99272 0.30350 -3.271 0.00107 **
## Agec3 -0.71172 0.33401 -2.131 0.03310 *
## Agec4 -1.56704 0.37676 -4.159 3.19e-05 ***
## SibSp -0.36164 0.13925 -2.597 0.00940 **
## Parch -0.06057 0.12712 -0.476 0.63374
## Fare 0.13488 0.16149 0.835 0.40360
## EmbarkedQ 0.46524 0.44861 1.037 0.29970
## EmbarkedS -0.23584 0.28594 -0.825 0.40949
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 832.49 on 624 degrees of freedom
## Residual deviance: 569.16 on 613 degrees of freedom
## AIC: 593.16
##
## Number of Fisher Scoring iterations: 5
As we can see Parch, Fare and Embarked has no significance in survival rate so we’ll eliminate these columns and re-build the model.
col1<-c("Parch","Fare","Embarked")
Titanic1[,col1]<-list(NULL)
set.seed(100)
index<-createDataPartition(Titanic1$Survived,p=0.7,list = F)
train_titanic<-Titanic1[index,]
test_titanic<-Titanic1[-index,]
dim(train_titanic)
## [1] 625 5
dim(test_titanic)
## [1] 266 5
LR_model<-glm(Survived~.,data = train_titanic,family = "binomial")
summary(LR_model)
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1435 -0.6686 -0.4458 0.6377 2.4455
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.2266 0.3874 8.329 < 2e-16 ***
## Pclass2 -1.0394 0.3058 -3.399 0.000676 ***
## Pclass3 -2.1670 0.2861 -7.575 3.60e-14 ***
## Sexmale -2.6208 0.2278 -11.506 < 2e-16 ***
## Agec2 -0.8743 0.2918 -2.997 0.002731 **
## Agec3 -0.6637 0.3302 -2.010 0.044407 *
## Agec4 -1.5542 0.3732 -4.164 3.12e-05 ***
## SibSp -0.3725 0.1281 -2.907 0.003647 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 832.49 on 624 degrees of freedom
## Residual deviance: 573.91 on 617 degrees of freedom
## AIC: 589.91
##
## Number of Fisher Scoring iterations: 5
train_titanic_LR<-fitted(LR_model)
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred<-prediction(train_titanic_LR,train_titanic$Survived)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05))
pred_LR<-predict(LR_model,test_titanic,type="response")
pred_LR1<-ifelse(pred_LR<0.35,0,1)
pred_LR1<-as.factor(pred_LR1)
confusionMatrix(pred_LR1,test_titanic$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 128 20
## 1 36 82
##
## Accuracy : 0.7895
## 95% CI : (0.7355, 0.8369)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : 1.118e-09
##
## Kappa : 0.5676
##
## Mcnemar's Test P-Value : 0.04502
##
## Sensitivity : 0.7805
## Specificity : 0.8039
## Pos Pred Value : 0.8649
## Neg Pred Value : 0.6949
## Prevalence : 0.6165
## Detection Rate : 0.4812
## Detection Prevalence : 0.5564
## Balanced Accuracy : 0.7922
##
## 'Positive' Class : 0
##
Binary Logistic Regression gives us an accuracy of 78.95%
Building the model on train data i.e. Training the data and finding the accuracy on test data
library(e1071)
## Warning: package 'e1071' was built under R version 3.6.2
NB_model<-naiveBayes(Survived~.,data = train_titanic)
NB_pred<-predict(NB_model,test_titanic)
confusionMatrix(NB_pred,test_titanic$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 144 29
## 1 20 73
##
## Accuracy : 0.8158
## 95% CI : (0.7639, 0.8605)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : 1.592e-12
##
## Kappa : 0.6038
##
## Mcnemar's Test P-Value : 0.2531
##
## Sensitivity : 0.8780
## Specificity : 0.7157
## Pos Pred Value : 0.8324
## Neg Pred Value : 0.7849
## Prevalence : 0.6165
## Detection Rate : 0.5414
## Detection Prevalence : 0.6504
## Balanced Accuracy : 0.7969
##
## 'Positive' Class : 0
##
Naive Bayes algorithm gives accuracy of 81.58%
Building the model on train data i.e. Training the data
library(rpart)
## Warning: package 'rpart' was built under R version 3.6.2
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.6.2
DT_model<-rpart(Survived~.,data = train_titanic)
rpart.plot(DT_model)
Checking the Accuracy of model on test data
DT_pred<-predict(DT_model,test_titanic,type="class")
confusionMatrix(DT_pred,test_titanic$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 156 36
## 1 8 66
##
## Accuracy : 0.8346
## 95% CI : (0.7844, 0.8772)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : 7.118e-15
##
## Kappa : 0.631
##
## Mcnemar's Test P-Value : 4.693e-05
##
## Sensitivity : 0.9512
## Specificity : 0.6471
## Pos Pred Value : 0.8125
## Neg Pred Value : 0.8919
## Prevalence : 0.6165
## Detection Rate : 0.5865
## Detection Prevalence : 0.7218
## Balanced Accuracy : 0.7991
##
## 'Positive' Class : 0
##
Decision Tree algorithm gives accuracy of 83.46%
Building the model on train data i.e. Training the data and finding the accuracy on test data
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
RF_model<-randomForest(Survived~.,data = train_titanic)
RF_model
##
## Call:
## randomForest(formula = Survived ~ ., data = train_titanic)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.32%
## Confusion matrix:
## 0 1 class.error
## 0 340 45 0.1168831
## 1 82 158 0.3416667
plot(RF_model)
RF_pred<-predict(RF_model,test_titanic)
confusionMatrix(RF_pred,test_titanic$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 154 37
## 1 10 65
##
## Accuracy : 0.8233
## 95% CI : (0.7721, 0.8672)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : 1.974e-13
##
## Kappa : 0.6066
##
## Mcnemar's Test P-Value : 0.0001491
##
## Sensitivity : 0.9390
## Specificity : 0.6373
## Pos Pred Value : 0.8063
## Neg Pred Value : 0.8667
## Prevalence : 0.6165
## Detection Rate : 0.5789
## Detection Prevalence : 0.7180
## Balanced Accuracy : 0.7881
##
## 'Positive' Class : 0
##
Random Forest gives us an accuracy of 82.33%
After performing various classification algorithms and taking into account their accuracies, we can conclude all the models had an accuracy ranging from 78% to 84%. Out of which Decision Tree gave a slightly better accuracy of 83.46%