Titanic Analysis

The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

This analysis attempts to predicate the probability for survival of the Titanic passengers. In order to do this, I will use the different features available about the passengers, use a subset of the data to train an algorithm and then run the algorithm on the rest of the data set to get a prediction.

Data Loading and Cleaning

library(ggplot2)
library(dplyr)
library(GGally)
library(randomForest)
library(rpart)
library(rpart.plot)

train <- read.csv('train.csv',stringsAsFactors = FALSE)

test <- read.csv('test.csv',stringsAsFactors = FALSE)

full <- bind_rows(train,test)

#Number of rows in Training dataset

train_length = dim(train)[1]

train_length

## [1] 891

#Lets see the structure of the data
str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

# Check for the missing values
colSums(is.na(full))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1           0           0

colSums(full == "")

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0          NA           0           0           0          NA 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0          NA        1014           2

# There are lots of missing values in Age column and no values in Cabin Column and 2 missing values in Embarked
# Lets change the empty/missing values in Embarked to the first choice "C".

full$Embarked[full$Embarked==""]="C"

# Lets see how many features could be taken as features

apply(full,2,function(x)length(unique(x)))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##        1309           3           3        1307           2          99 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           7           8         929         282         187           3

# Lets make Survived, Pclass, Sex, Embarked these features to type factors

cols <- c("Survived","Pclass","Sex","Embarked")

for(feature in cols){
  full[,feature] <- as.factor(full[,feature])
}

#updated structure of the dataset full

str(full)

## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Analysis

Let’s see the relationship between different features

# Sex vs Survival Count

ggplot(data = full[1:train_length,] , aes(x=Sex , fill=Survived)) + geom_bar()

Looks like Female survival rate is more than Male

# Embarked vs Survival Count

ggplot(data = full[1:train_length,] , aes(x=Embarked, fill=Survived)) + geom_bar()

#C = Cherbourg, Q = Queenstown, S = Southampton

The highest number of people who survived are boarded at “S-Southampton” which is greater than “C-Cherbourg” and “Q-Queenstown”

ggplot(data = full[1:train_length,] , aes(x=Embarked, fill=Survived)) + geom_bar(position="fill")

But when it comes to the probability of surviving rate who boarded at different ports is more at “C-Cherbourgh” than “Q” and “S”.

# Lets calculate the probabilities of Surviving who boarded at Ports C , Q and S
t <- table(full[1:train_length,c(12,2)])

prob_t = t

for(i in 1:dim(t)[1]){
prob_t[i,] <- (prob_t[i,]/rowSums(t)[i])*100  
}

prob_t

##         Survived
## Embarked        0        1
##        C 44.11765 55.88235
##        Q 61.03896 38.96104
##        S 66.30435 33.69565

A person who embarked in C will have a good probability to get survived(55%) compared to Q(38%) and S(33%)

#Pclass vs Survival
ggplot(data = full[1:train_length,] , aes(x=Pclass, fill=Survived)) + geom_bar()

Number of people who survived from 1st class are more than 3rd and then 2nd. But when you take the probability of surviving compared to each class,

ggplot(data = full[1:train_length,] , aes(x=Pclass, fill=Survived)) + geom_bar(position = "fill")

Probability of surviving is more in 1st class compared to 2nd and 3rd.

# Embarked vs Pclass vs Survived vs Sex

ggplot(data = full[1:train_length,] , aes(x=Embarked, fill=Survived)) + geom_bar(position = "fill") + facet_grid(Sex~Pclass)

We can see that almost all females embarked from all ports to 1st class have survived, then 2nd and 3rd.

#SibSp vs survival 
ggplot(data = full[1:train_length,] , aes(x=SibSp, fill=Survived)) + geom_bar()

ggplot(data = full[1:train_length,] , aes(x=Parch, fill=Survived)) + geom_bar()

Two graphs looks similar to each other lets see with another feature “family size”

#lets create a column familySize by adding Parch+SibSp and 1(self)
full$familySize <- full$SibSp+full$Parch+1
ggplot(data=full[1:train_length,] , aes(x=familySize , fill=Survived))+geom_histogram(binwidth = 1)

This graph shows family size less than or equal to 4 have survived more compared to greater than size of 4

#Age vs Survival
ggplot(data=full[1:train_length,] , aes(x=Age , fill=Survived))+geom_histogram(binwidth = 5)

## Warning: Removed 177 rows containing non-finite values (stat_bin).

ggplot(data=full[1:train_length,] , aes(x=Age , fill=Survived))+geom_histogram(binwidth = 3, position = "fill")

## Warning: Removed 177 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

Age group below 10 years and above 80 years have survived better than other age groups

#Fare vs Survival
#For continous variables we use histogram and for categorical/factor variables we use barplot for better understanding

ggplot(data = full[1:train_length,] , aes(x=Fare, fill=Survived)) + geom_histogram(binwidth = 20 , position = "fill")

## Warning: Removed 28 rows containing missing values (geom_bar).

For those whose ticket price is atmost, their survival is guaranteed

# Lets Impute missing age values with mean
full$Age[is.na(full$Age)] <- mean(full$Age, na.rm = TRUE)
sum(is.na(full$Age))

## [1] 0

# Title of the Passenger vs Survival
full$Title <- gsub('(.*, )|(\\..*)','',full$Name)
unique(full$Title)

##  [1] "Mr"           "Mrs"          "Miss"         "Master"       "Don"         
##  [6] "Rev"          "Dr"           "Mme"          "Ms"           "Major"       
## [11] "Lady"         "Sir"          "Mlle"         "Col"          "Capt"        
## [16] "the Countess" "Jonkheer"     "Dona"

#lets minimize 18 types of titles to 5 types
full$Title[full$Title=='Mlle']<- 'Miss'
full$Title[full$Title == 'Ms']<- 'Miss'
full$Title[full$Title == 'Mme']<- 'Mrs' 
full$Title[full$Title == 'Lady']<- 'Miss'
full$Title[full$Title == 'Dona']<- 'Miss'

officer<- c('Capt' ,'Col','Don','Dr','Jonkheer','Major','Rev','Sir','the Countess')

full$Title[full$Title %in% officer]<-'Officer'
full$Title <- as.factor(full$Title)

ggplot(data = full[1:train_length,] , aes(x=Title, fill=Survived)) + geom_bar(position = "fill")

The probability of survival rate is high in case of Miss and Mrs

Prediction

Here we are going to use only important features that helps us to predicu the survival of the passenger such as Age, Fare, Survived, Pclass, Ticket, Parch, SibSp and Title. and divide the training set to training_data and validation_data to estimate the error rate. Lets place the important features first.

Lets train our data using logistic regression method first:

train_important_features <- full[1:train_length,c("Survived","Pclass","Sex","Age","Fare","SibSp","Parch","Title")]
library(caret)

## Loading required package: lattice

inTrain <- createDataPartition(y=train_important_features$Survived , p=0.7,list=FALSE)

training_data <- train_important_features[inTrain,]
validation_data <- train_important_features[-inTrain,]


model_logistic <- glm(Survived ~ . , data = training_data , family = binomial(link="logit"))

summary(model_logistic)

## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"), 
##     data = training_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4446  -0.4842  -0.3532   0.4757   2.5061  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   17.808143 535.412099   0.033 0.973467    
## Pclass2       -1.304577   0.395957  -3.295 0.000985 ***
## Pclass3       -2.416807   0.383278  -6.306 2.87e-10 ***
## Sexmale      -13.325481 535.411670  -0.025 0.980144    
## Age           -0.034848   0.011556  -3.015 0.002566 ** 
## Fare           0.004809   0.003204   1.501 0.133340    
## SibSp         -0.631258   0.152413  -4.142 3.45e-05 ***
## Parch         -0.462827   0.175906  -2.631 0.008511 ** 
## TitleMiss    -13.471964 535.412063  -0.025 0.979926    
## TitleMr       -3.696676   0.688192  -5.372 7.81e-08 ***
## TitleMrs     -13.172408 535.412207  -0.025 0.980372    
## TitleOfficer  -3.508043   1.011738  -3.467 0.000526 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 832.49  on 624  degrees of freedom
## Residual deviance: 470.69  on 613  degrees of freedom
## AIC: 494.69
## 
## Number of Fisher Scoring iterations: 12

#lets see the prediction on validation_data

pred_validation_logistic <- predict(model_logistic, validation_data)

#Lets bring these values to the ground as if 1 for survived and 0 for not survived

pred_validation_logistic <- ifelse(pred_validation_logistic>0.5,1,0)

cf <- confusionMatrix(as.factor(pred_validation_logistic), validation_data$Survived)

cf$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   7.857143e-01   5.266313e-01   7.314662e-01   8.334733e-01   6.165414e-01 
## AccuracyPValue  McnemarPValue 
##   2.602819e-09   3.568628e-03

Accuracy of logistic model on validation data is 82.3%

Accuracy is good with this model, so lets run this on testing_data

testing_data <- full[892 : 1309 , c("Pclass","Sex","Age","SibSp","Parch","Fare","Title")]

pred_testing_logistic <- predict(model_logistic, testing_data)

pred_testing_logistic <- ifelse(pred_testing_logistic>0.5 , 1,0)

prediction_logistic <- data.frame(test$PassengerId , pred_testing_logistic)

gender_submission <- read.csv('gender_submission.csv',stringsAsFactors = FALSE)

names(prediction_logistic) <- names(gender_submission)

# comparing our predicted output with given output by kaggle

table(ifelse(gender_submission$Survived == prediction_logistic$Survived , "TRUE","FALSE"))

## 
## FALSE  TRUE 
##    34   383

# lets check for NA's and replace them with 0's

prediction_logistic$Survived[is.na(prediction_logistic$Survived)]<-0

write.csv(prediction_logistic,file="prediction_logistic.csv",row.names = F)

Prediction using Decision Tree model:

model_decisiontree <- rpart(Survived~., data= training_data , method = "class")
rpart.plot(model_decisiontree)

pred.validation_decisiontree <- predict(model_decisiontree , validation_data, method = "class")

pred.validation_decisiontree_Survival = ifelse(pred.validation_decisiontree[,2]>0.5,1,0)

cf <- confusionMatrix(as.factor(pred.validation_decisiontree_Survival), validation_data$Survived)

cf$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   7.969925e-01   5.625533e-01   7.435748e-01   8.436587e-01   6.165414e-01 
## AccuracyPValue  McnemarPValue 
##   1.927747e-10   2.206714e-01

Accuracy with decision tree model is 83%, little bit good compared to logistic regression model on validation data.

#lets run this model on the testing_data set

pred_testing_decisiontree <- predict(model_decisiontree, testing_data)

pred.testing_decisiontree_Survival = ifelse(pred_testing_decisiontree[,2]>0.5,1,0)

prediction_decisiontree <- data.frame(test$PassengerId , pred.testing_decisiontree_Survival)

names(prediction_decisiontree) <- names(gender_submission)

# comparing our predicted output with given output

table(ifelse(gender_submission$Survived == prediction_decisiontree$Survived , "TRUE","FALSE"))

## 
## FALSE  TRUE 
##    22   396

prediction_decisiontree$Survived[is.na(prediction_decisiontree$Survived)]<-0

write.csv(prediction_decisiontree,file="prediction_decisiontree.csv",row.names = F)

library(rattle)

## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

## 
## Attaching package: 'rattle'

## The following object is masked from 'package:randomForest':
## 
##     importance

fancyRpartPlot(model_decisiontree)

Let’s predict survival using random forest method:

model_randomforest <- randomForest(Survived~. , data= training_data)

#Error rate
plot(model_randomforest)

pred.validation_randomforest = predict(model_randomforest,validation_data)

cf = confusionMatrix(pred.validation_randomforest , validation_data$Survived)

cf$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   8.233083e-01   6.185391e-01   7.720582e-01   8.671900e-01   6.165414e-01 
## AccuracyPValue  McnemarPValue 
##   1.973669e-13   1.446615e-01

Accuracy is 82.7 better than logistic regression but not decision tree for validation data

#lets predict on test data for randomforest model

pred.testing_randomforest <- predict(model_randomforest , testing_data)

prediction_randomforest <- data.frame(test$PassengerId , ifelse(pred.testing_randomforest==0,0,1))

names(prediction_randomforest) <- names(gender_submission)

prediction_randomforest$Survived[is.na(prediction_randomforest$Survived)]<-0

write.csv(prediction_randomforest,file="prediction_randomforest.csv",row.names = F)

table(ifelse(gender_submission$Survived == prediction_randomforest$Survived , "TRUE","FALSE"))

## 
## FALSE  TRUE 
##    48   370

Conclusion:

The accuracy using confusion matrix on validation data is more for decision trees then logistic regression and random forest in this case. Lets upload the output testing data of all cases to Kaggle and find which model is best suitable for Titanic analysis.