The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
This analysis attempts to predicate the probability for survival of the Titanic passengers. In order to do this, I will use the different features available about the passengers, use a subset of the data to train an algorithm and then run the algorithm on the rest of the data set to get a prediction.
library(ggplot2)
library(dplyr)
library(GGally)
library(randomForest)
library(rpart)
library(rpart.plot)
train <- read.csv('train.csv',stringsAsFactors = FALSE)
test <- read.csv('test.csv',stringsAsFactors = FALSE)
full <- bind_rows(train,test)
#Number of rows in Training dataset
train_length = dim(train)[1]
train_length
## [1] 891
#Lets see the structure of the data
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
# Check for the missing values
colSums(is.na(full))
## PassengerId Survived Pclass Name Sex Age
## 0 418 0 0 0 263
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 1 0 0
colSums(full == "")
## PassengerId Survived Pclass Name Sex Age
## 0 NA 0 0 0 NA
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 NA 1014 2
# There are lots of missing values in Age column and no values in Cabin Column and 2 missing values in Embarked
# Lets change the empty/missing values in Embarked to the first choice "C".
full$Embarked[full$Embarked==""]="C"
# Lets see how many features could be taken as features
apply(full,2,function(x)length(unique(x)))
## PassengerId Survived Pclass Name Sex Age
## 1309 3 3 1307 2 99
## SibSp Parch Ticket Fare Cabin Embarked
## 7 8 929 282 187 3
# Lets make Survived, Pclass, Sex, Embarked these features to type factors
cols <- c("Survived","Pclass","Sex","Embarked")
for(feature in cols){
full[,feature] <- as.factor(full[,feature])
}
#updated structure of the dataset full
str(full)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Let’s see the relationship between different features
# Sex vs Survival Count
ggplot(data = full[1:train_length,] , aes(x=Sex , fill=Survived)) + geom_bar()
Looks like Female survival rate is more than Male
# Embarked vs Survival Count
ggplot(data = full[1:train_length,] , aes(x=Embarked, fill=Survived)) + geom_bar()
#C = Cherbourg, Q = Queenstown, S = Southampton
The highest number of people who survived are boarded at “S-Southampton” which is greater than “C-Cherbourg” and “Q-Queenstown”
ggplot(data = full[1:train_length,] , aes(x=Embarked, fill=Survived)) + geom_bar(position="fill")
But when it comes to the probability of surviving rate who boarded at different ports is more at “C-Cherbourgh” than “Q” and “S”.
# Lets calculate the probabilities of Surviving who boarded at Ports C , Q and S
t <- table(full[1:train_length,c(12,2)])
prob_t = t
for(i in 1:dim(t)[1]){
prob_t[i,] <- (prob_t[i,]/rowSums(t)[i])*100
}
prob_t
## Survived
## Embarked 0 1
## C 44.11765 55.88235
## Q 61.03896 38.96104
## S 66.30435 33.69565
A person who embarked in C will have a good probability to get survived(55%) compared to Q(38%) and S(33%)
#Pclass vs Survival
ggplot(data = full[1:train_length,] , aes(x=Pclass, fill=Survived)) + geom_bar()
Number of people who survived from 1st class are more than 3rd and then 2nd. But when you take the probability of surviving compared to each class,
ggplot(data = full[1:train_length,] , aes(x=Pclass, fill=Survived)) + geom_bar(position = "fill")
Probability of surviving is more in 1st class compared to 2nd and 3rd.
# Embarked vs Pclass vs Survived vs Sex
ggplot(data = full[1:train_length,] , aes(x=Embarked, fill=Survived)) + geom_bar(position = "fill") + facet_grid(Sex~Pclass)
We can see that almost all females embarked from all ports to 1st class have survived, then 2nd and 3rd.
#SibSp vs survival
ggplot(data = full[1:train_length,] , aes(x=SibSp, fill=Survived)) + geom_bar()
ggplot(data = full[1:train_length,] , aes(x=Parch, fill=Survived)) + geom_bar()
Two graphs looks similar to each other lets see with another feature “family size”
#lets create a column familySize by adding Parch+SibSp and 1(self)
full$familySize <- full$SibSp+full$Parch+1
ggplot(data=full[1:train_length,] , aes(x=familySize , fill=Survived))+geom_histogram(binwidth = 1)
This graph shows family size less than or equal to 4 have survived more compared to greater than size of 4
#Age vs Survival
ggplot(data=full[1:train_length,] , aes(x=Age , fill=Survived))+geom_histogram(binwidth = 5)
## Warning: Removed 177 rows containing non-finite values (stat_bin).
ggplot(data=full[1:train_length,] , aes(x=Age , fill=Survived))+geom_histogram(binwidth = 3, position = "fill")
## Warning: Removed 177 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Age group below 10 years and above 80 years have survived better than other age groups
#Fare vs Survival
#For continous variables we use histogram and for categorical/factor variables we use barplot for better understanding
ggplot(data = full[1:train_length,] , aes(x=Fare, fill=Survived)) + geom_histogram(binwidth = 20 , position = "fill")
## Warning: Removed 28 rows containing missing values (geom_bar).
For those whose ticket price is atmost, their survival is guaranteed
# Lets Impute missing age values with mean
full$Age[is.na(full$Age)] <- mean(full$Age, na.rm = TRUE)
sum(is.na(full$Age))
## [1] 0
# Title of the Passenger vs Survival
full$Title <- gsub('(.*, )|(\\..*)','',full$Name)
unique(full$Title)
## [1] "Mr" "Mrs" "Miss" "Master" "Don"
## [6] "Rev" "Dr" "Mme" "Ms" "Major"
## [11] "Lady" "Sir" "Mlle" "Col" "Capt"
## [16] "the Countess" "Jonkheer" "Dona"
#lets minimize 18 types of titles to 5 types
full$Title[full$Title=='Mlle']<- 'Miss'
full$Title[full$Title == 'Ms']<- 'Miss'
full$Title[full$Title == 'Mme']<- 'Mrs'
full$Title[full$Title == 'Lady']<- 'Miss'
full$Title[full$Title == 'Dona']<- 'Miss'
officer<- c('Capt' ,'Col','Don','Dr','Jonkheer','Major','Rev','Sir','the Countess')
full$Title[full$Title %in% officer]<-'Officer'
full$Title <- as.factor(full$Title)
ggplot(data = full[1:train_length,] , aes(x=Title, fill=Survived)) + geom_bar(position = "fill")
The probability of survival rate is high in case of Miss and Mrs
Here we are going to use only important features that helps us to predicu the survival of the passenger such as Age, Fare, Survived, Pclass, Ticket, Parch, SibSp and Title. and divide the training set to training_data and validation_data to estimate the error rate. Lets place the important features first.
Lets train our data using logistic regression method first:
train_important_features <- full[1:train_length,c("Survived","Pclass","Sex","Age","Fare","SibSp","Parch","Title")]
library(caret)
## Loading required package: lattice
inTrain <- createDataPartition(y=train_important_features$Survived , p=0.7,list=FALSE)
training_data <- train_important_features[inTrain,]
validation_data <- train_important_features[-inTrain,]
model_logistic <- glm(Survived ~ . , data = training_data , family = binomial(link="logit"))
summary(model_logistic)
##
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"),
## data = training_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4446 -0.4842 -0.3532 0.4757 2.5061
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 17.808143 535.412099 0.033 0.973467
## Pclass2 -1.304577 0.395957 -3.295 0.000985 ***
## Pclass3 -2.416807 0.383278 -6.306 2.87e-10 ***
## Sexmale -13.325481 535.411670 -0.025 0.980144
## Age -0.034848 0.011556 -3.015 0.002566 **
## Fare 0.004809 0.003204 1.501 0.133340
## SibSp -0.631258 0.152413 -4.142 3.45e-05 ***
## Parch -0.462827 0.175906 -2.631 0.008511 **
## TitleMiss -13.471964 535.412063 -0.025 0.979926
## TitleMr -3.696676 0.688192 -5.372 7.81e-08 ***
## TitleMrs -13.172408 535.412207 -0.025 0.980372
## TitleOfficer -3.508043 1.011738 -3.467 0.000526 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 832.49 on 624 degrees of freedom
## Residual deviance: 470.69 on 613 degrees of freedom
## AIC: 494.69
##
## Number of Fisher Scoring iterations: 12
#lets see the prediction on validation_data
pred_validation_logistic <- predict(model_logistic, validation_data)
#Lets bring these values to the ground as if 1 for survived and 0 for not survived
pred_validation_logistic <- ifelse(pred_validation_logistic>0.5,1,0)
cf <- confusionMatrix(as.factor(pred_validation_logistic), validation_data$Survived)
cf$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.857143e-01 5.266313e-01 7.314662e-01 8.334733e-01 6.165414e-01
## AccuracyPValue McnemarPValue
## 2.602819e-09 3.568628e-03
Accuracy of logistic model on validation data is 82.3%
Accuracy is good with this model, so lets run this on testing_data
testing_data <- full[892 : 1309 , c("Pclass","Sex","Age","SibSp","Parch","Fare","Title")]
pred_testing_logistic <- predict(model_logistic, testing_data)
pred_testing_logistic <- ifelse(pred_testing_logistic>0.5 , 1,0)
prediction_logistic <- data.frame(test$PassengerId , pred_testing_logistic)
gender_submission <- read.csv('gender_submission.csv',stringsAsFactors = FALSE)
names(prediction_logistic) <- names(gender_submission)
# comparing our predicted output with given output by kaggle
table(ifelse(gender_submission$Survived == prediction_logistic$Survived , "TRUE","FALSE"))
##
## FALSE TRUE
## 34 383
# lets check for NA's and replace them with 0's
prediction_logistic$Survived[is.na(prediction_logistic$Survived)]<-0
write.csv(prediction_logistic,file="prediction_logistic.csv",row.names = F)
Prediction using Decision Tree model:
model_decisiontree <- rpart(Survived~., data= training_data , method = "class")
rpart.plot(model_decisiontree)
pred.validation_decisiontree <- predict(model_decisiontree , validation_data, method = "class")
pred.validation_decisiontree_Survival = ifelse(pred.validation_decisiontree[,2]>0.5,1,0)
cf <- confusionMatrix(as.factor(pred.validation_decisiontree_Survival), validation_data$Survived)
cf$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.969925e-01 5.625533e-01 7.435748e-01 8.436587e-01 6.165414e-01
## AccuracyPValue McnemarPValue
## 1.927747e-10 2.206714e-01
Accuracy with decision tree model is 83%, little bit good compared to logistic regression model on validation data.
#lets run this model on the testing_data set
pred_testing_decisiontree <- predict(model_decisiontree, testing_data)
pred.testing_decisiontree_Survival = ifelse(pred_testing_decisiontree[,2]>0.5,1,0)
prediction_decisiontree <- data.frame(test$PassengerId , pred.testing_decisiontree_Survival)
names(prediction_decisiontree) <- names(gender_submission)
# comparing our predicted output with given output
table(ifelse(gender_submission$Survived == prediction_decisiontree$Survived , "TRUE","FALSE"))
##
## FALSE TRUE
## 22 396
prediction_decisiontree$Survived[is.na(prediction_decisiontree$Survived)]<-0
write.csv(prediction_decisiontree,file="prediction_decisiontree.csv",row.names = F)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
fancyRpartPlot(model_decisiontree)
Let’s predict survival using random forest method:
model_randomforest <- randomForest(Survived~. , data= training_data)
#Error rate
plot(model_randomforest)
pred.validation_randomforest = predict(model_randomforest,validation_data)
cf = confusionMatrix(pred.validation_randomforest , validation_data$Survived)
cf$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 8.233083e-01 6.185391e-01 7.720582e-01 8.671900e-01 6.165414e-01
## AccuracyPValue McnemarPValue
## 1.973669e-13 1.446615e-01
Accuracy is 82.7 better than logistic regression but not decision tree for validation data
#lets predict on test data for randomforest model
pred.testing_randomforest <- predict(model_randomforest , testing_data)
prediction_randomforest <- data.frame(test$PassengerId , ifelse(pred.testing_randomforest==0,0,1))
names(prediction_randomforest) <- names(gender_submission)
prediction_randomforest$Survived[is.na(prediction_randomforest$Survived)]<-0
write.csv(prediction_randomforest,file="prediction_randomforest.csv",row.names = F)
table(ifelse(gender_submission$Survived == prediction_randomforest$Survived , "TRUE","FALSE"))
##
## FALSE TRUE
## 48 370
The accuracy using confusion matrix on validation data is more for decision trees then logistic regression and random forest in this case. Lets upload the output testing data of all cases to Kaggle and find which model is best suitable for Titanic analysis.