Synopsis

In this paper 10 different models are used to predict which Titanic passenger survived based on different information. The Titanic dataset is provided by Kaggle competition website. The models covered are:

  1. The Decision Tree
  2. The Naive Bayes
  3. The Logistic Regression
  4. The Support Vector Machine(SVM)
  5. The Kernel Support Vector Machine
  6. The Random Forest
  7. The K Nearest Neighbours(KNN)
  8. The Kernel PCA
  9. The Artificial neural network
  10. The Extreme Gradient (XG) Boost

The purpose of the paper is to discuss the performance of the different models whilst providing the codes for the reader to be able to reproduce the research. This paper does not cover details about the different models which can be easily found on the web.

The best performers are the Random Forest and the Kernel Support Vector Machine. Although nearly all of the other models still had a success rate ranging between 76.40% and 82.85% (with the exception of the Kernel PCA) which suggest that the quality of the data plays a much bigger part in prediction success than model picking. It is also interesting to note that all models performed better at predicting non-survival (up to 89% success) than survival (up to 69%).

The difference between the different models is more important in terms of time performance. The Artificial Neural Network was the longest with 2.97 seconds and the shortest was the KNN with 0.1 seconds. Moreover, the KNN had a better success rate than the Artificial Neural Network.

The datasets

The creation of the 4 datasets used in this paper is explained in the paper “Cleaning the Titanic Data” (http://rpubs.com/charlydethibault/348566).

  1. The training set which is 80% of the training set provided by Kaggle (713 observations with 8 variables)
  2. The validation set which is 20% of the training set provided by Kaggle (178 observations with 8 variables)
  3. The train set from Kaggle that we use to train our model before submitting our predictions to Kaggle (observations with 8 variables)
  4. The test set which is provided by Kaggle and which will be used to submit our best model (418 observations with 8 variables)

The original dataset is available on Kaggle(https://www.kaggle.com/c/titanic)

library(readr) # to read  csv files
training_set_clean <- read_csv("training_set_clean.csv") 
validation_set_clean <- read_csv("validation_set_clean.csv")
final_training_set_clean <- read_csv("final_training_set_clean.csv")
test_clean <- read_csv("test_clean.csv")
head(training_set_clean,1)
## # A tibble: 1 x 8
##   Survived Pclass   Sex        Age SibSp Parch       Fare Embarked
##      <int>  <int> <chr>      <dbl> <int> <int>      <dbl>    <chr>
## 1        0      3  male -0.5972816     1     0 -0.5368636        S

As shown above, 8 variables are kept for the analysis:

  1. Survived: Binary outcome, 1 means survived. 0 means not survived
  2. Pclass: Class level. 1 = First, 2 = Second, 3 = Third
  3. Sex: Binary outcome, male or female
  4. Age: Age of the passenger feature scaled to normalise the variable
  5. SibSp: # of siblings / spouses aboard the Titanic
  6. Parch: # of parents / children aboard the Titanic
  7. Fare: Ticket fare feature scaled to normalise the variable
  8. Embarked: embarkment location of the passenger

For the “test_clean” dataset, there the same information except the “Survived” column.

Some of the variables need to be encoded as factors such as “Survived”,“Pclass”,“Sex”, and “Embarked”.

str(training_set_clean,give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    713 obs. of  8 variables:
##  $ Survived: int  0 1 1 1 0 0 1 1 1 0 ...
##  $ Pclass  : int  3 1 3 1 3 3 3 3 1 3 ...
##  $ Sex     : chr  "male" "female" "female" "female" ...
##  $ Age     : num  -0.597 0.671 -0.28 0.433 0.433 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 1 0 0 ...
##  $ Parch   : int  0 0 0 0 0 0 2 1 0 0 ...
##  $ Fare    : num  -0.537 0.852 -0.522 0.457 -0.52 ...
##  $ Embarked: chr  "S" "C" "S" "S" ...
str(validation_set_clean,give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    178 obs. of  8 variables:
##  $ Survived: int  0 0 1 0 0 1 0 0 0 1 ...
##  $ Pclass  : int  1 3 2 3 2 1 1 3 3 3 ...
##  $ Sex     : chr  "male" "male" "female" "female" ...
##  $ Age     : num  1.731 -1.849 -1.023 -1.023 0.423 ...
##  $ SibSp   : int  0 3 1 0 0 0 0 0 2 1 ...
##  $ Parch   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Fare    : num  0.3034 -0.1919 -0.0472 -0.4046 -0.1126 ...
##  $ Embarked: chr  "S" "S" "C" "S" ...
str(final_training_set_clean,give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    891 obs. of  8 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : chr  "male" "female" "female" "female" ...
##  $ Age     : num  -0.565 0.663 -0.258 0.433 0.433 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare    : num  -0.502 0.786 -0.489 0.42 -0.486 ...
##  $ Embarked: chr  "S" "C" "S" "S" ...
str(test_clean,give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    418 obs. of  8 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  0.386 1.37 2.55 -0.205 -0.598 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Fare       : num  -0.497 -0.512 -0.464 -0.482 -0.417 ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...
#Encode as factor
training_set_clean$Survived <- as.factor(training_set_clean$Survived)
training_set_clean$Pclass <- as.factor(training_set_clean$Pclass)
training_set_clean$Sex <- as.factor(training_set_clean$Sex)
training_set_clean$Embarked <- as.factor(training_set_clean$Embarked)

validation_set_clean$Survived <- as.factor(validation_set_clean$Survived)
validation_set_clean$Pclass <- as.factor(validation_set_clean$Pclass)
validation_set_clean$Sex <- as.factor(validation_set_clean$Sex)
validation_set_clean$Embarked <- as.factor(validation_set_clean$Embarked)

final_training_set_clean$Survived <- as.factor(final_training_set_clean$Survived)
final_training_set_clean$Pclass <- as.factor(final_training_set_clean$Pclass)
final_training_set_clean$Sex <- as.factor(final_training_set_clean$Sex)
final_training_set_clean$Embarked <- as.factor(final_training_set_clean$Embarked)

test_clean$Pclass <- as.factor(test_clean$Pclass)
test_clean$Sex <- as.factor(test_clean$Sex)
test_clean$Embarked <- as.factor(test_clean$Embarked)

1. The Decision Tree

On the below part we have created our decision tree and plot the results in two different ways. The decision on the left is not as good looking than the one on the right but provide very sensitive information. The legs of the “Sex” split is much bigger than the other legs which means that this variable is by far the most important one to predict survival.

The Decision Tree Model Creation

library(rpart)
start.time <- Sys.time()
Dtree <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
             data=training_set_clean,
             method="class")
Dtree_pred = predict(Dtree, newdata = validation_set_clean[-1], type = 'class')# -1 remove what we attempt to predict
end.time <- Sys.time()
time.takenDT <- end.time - start.time
par(mfrow=c(1,2)) # print 2 charts horizontally
plot(Dtree)
text(Dtree)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(Dtree)

Now we test our model on our validation set and and create the confusion matrix to define the reliability of our model. The confusion Matrix enables to analyse the the success rate per prediction ( survival or non-survival ). On the horizontal axis you have the actual classes and on the vertical axis you have the predicted class.

The Decision Tree Confusion Matrix

# creation of confusion matrix
Dtree_cm = table(t(validation_set_clean[, 1]), Dtree_pred)
# compute the overall success rate
Dtree_success = sum(diag(Dtree_cm))/sum(Dtree_cm)*100 
# compute the success rate to predict survival
Dtree_success_survived = Dtree_cm[2,2]/(Dtree_cm[2,1]+Dtree_cm[2,2])*100
# compute the success rate to predict casualty
Dtree_success_notsurvived = Dtree_cm[1,1]/(Dtree_cm[1,2]+Dtree_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
Dtree_cm
##    Dtree_pred
##      0  1
##   0 95 15
##   1 25 43

The prediction success rate is 77.53% , the prediction success for survival is 63.24% while the prediction success for non-survival is 86.36%. The Model took 0.03 seconds to perform the prediction.

2. The Naive Bayes

The Naive Bayes Model Creation

library(e1071)
start.time <- Sys.time()
NBayes = naiveBayes(x = training_set_clean[-1],
                        y = training_set_clean$Survived)
# -1 removes the "Survived"" column as it is what we plan to predict
NBayes_pred = predict(NBayes, newdata = validation_set_clean[-1], type = 'class')
end.time <- Sys.time()
time.takenNB <- end.time - start.time

The Naive Bayes Confusion Matrix

# creation of confusion matrix
NBayes_cm = table(t(validation_set_clean[, 1]), NBayes_pred)
# compute the overall success rate
NBayes_success = sum(diag(NBayes_cm))/sum(NBayes_cm)*100 
# compute the success rate to predict survival
NBayes_success_survived = NBayes_cm[2,2]/(NBayes_cm[2,1]+NBayes_cm[2,2])*100 
# compute the success rate to predict casualty
NBayes_success_notsurvived = NBayes_cm[1,1]/(NBayes_cm[1,2]+NBayes_cm[1,1])*100
NBayes_cm
##    NBayes_pred
##      0  1
##   0 97 13
##   1 23 45

The prediction success rate is 79.78% , the prediction success for survival is 66.18% while the prediction success for non-survival is 88.18%. The Model took 0.042 seconds to perform the prediction.

3. The Logistic Regression

The Logistic Regression Model Creation

start.time <- Sys.time()
LRegression = glm(formula = Survived ~ .,
                 family = binomial,
                 data = training_set_clean)
LRegression_pred = predict(LRegression, type = 'response', validation_set_clean[-1])
LRegression_pred = ifelse(LRegression_pred > 0.5, 1, 0)
end.time <- Sys.time()
time.takenLR <- end.time - start.time

The Logistic Regression Model analysis

LRegression
## 
## Call:  glm(formula = Survived ~ ., family = binomial, data = training_set_clean)
## 
## Coefficients:
## (Intercept)      Pclass2      Pclass3      Sexmale          Age  
##     3.08676     -0.90927     -2.11209     -2.74126     -0.49880  
##       SibSp        Parch         Fare    EmbarkedQ    EmbarkedS  
##    -0.27739     -0.05257     -0.01061     -0.01883     -0.59336  
## 
## Degrees of Freedom: 712 Total (i.e. Null);  703 Residual
## Null Deviance:       949.9 
## Residual Deviance: 629.3     AIC: 649.3

The most negative variable is “sexmale” which means, as the decision tree showed earlier, that being a man on the Titanic was that factor that reduced the most the chances of survival. The next two most negative variables for chances of survival were the Pclass2 and Pclass 3 which means that people had a much higher chance of survival in first class.

The Logistic Regression Confusion Matrix

# creation of confusion matrix
LRegression_cm = table(t(validation_set_clean[, 1]), LRegression_pred)
# compute the overall success rate
LRegression_success = sum(diag(LRegression_cm))/sum(LRegression_cm)*100
# compute the success rate to predict survival
LRegression_success_survived = LRegression_cm[2,2]/(LRegression_cm[2,1]+LRegression_cm[2,2])*100
# compute the success rate to predict casualty
LRegression_success_notsurvived = LRegression_cm[1,1]/(LRegression_cm[1,2]+LRegression_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
LRegression_cm
##    LRegression_pred
##      0  1
##   0 91 19
##   1 21 47

The prediction success rate is 77.53% , the prediction success for survival is 69.12% while the prediction success for non-survival is 82.73%. The Model took 0.02 seconds to perform the prediction.

4. The Support Vector Machine(SVM)

The Support Vector Machine(SVM) Model Creation

library(e1071)
start.time <- Sys.time()
SVM = svm(formula = Survived ~ .,
                 data = training_set_clean,
                 type = 'C-classification',
                 kernel = 'linear') 
SVM_pred = predict(SVM, newdata = validation_set_clean[-1])
end.time <- Sys.time()
time.takenSVM <- end.time - start.time

The Support Vector Machine(SVM) Confusion Matrix

# creation of confusion matrix
SVM_cm = table(t(validation_set_clean[, 1]), SVM_pred)
# compute the overall success rate
SVM_success = sum(diag(SVM_cm))/sum(SVM_cm)*100
# compute the success rate to predict survival
SVM_success_survived = SVM_cm[2,2]/(SVM_cm[2,1]+SVM_cm[2,2])*100
# compute the success rate to predict casualty
SVM_success_notsurvived = SVM_cm[1,1]/(SVM_cm[1,2]+SVM_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
SVM_cm
##    SVM_pred
##      0  1
##   0 91 19
##   1 23 45

The prediction success rate is 76.4% , the prediction success for survival is 66.18% while the prediction success for non-survival is 82.73%. The Model took 0.1 seconds to perform the prediction.

5. The Kernel Support Vector Machine(SVM)

The Kernel Support Vector Machine(SVM) Model Creation

The model is very similar to the SVM as only the “kernel” variable in the model has to be changed from linear to radial.

library(e1071)
start.time <- Sys.time()
KSVM = svm(formula = Survived ~ .,
                 data = training_set_clean,
                 type = 'C-classification',
                 kernel = 'radial')# linear for SVM , radial for kernel
KSVM_pred = predict(KSVM, newdata = validation_set_clean[-1])
end.time <- Sys.time()
time.takenKSVM <- end.time - start.time

The Kernel Support Vector Machine(SVM) Confusion Matrix

# creation of confusion matrix
KSVM_cm = table(t(validation_set_clean[, 1]), KSVM_pred)
# compute the overall success rate
KSVM_success = sum(diag(KSVM_cm))/sum(KSVM_cm)*100
# compute the success rate to predict survival
KSVM_success_survived = KSVM_cm[2,2]/(KSVM_cm[2,1]+KSVM_cm[2,2])*100
# compute the success rate to predict casualty
KSVM_success_notsurvived = KSVM_cm[1,1]/(KSVM_cm[1,2]+KSVM_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
KSVM_cm
##    KSVM_pred
##      0  1
##   0 96 14
##   1 17 51

The prediction success rate is 82.58% , the prediction success for survival is 75% while the prediction success for non-survival is 87.27%. The Model took 0.22 seconds to perform the prediction.

6. The Random Forest

The Random Forest Model Creation

library(randomForest)
start.time <- Sys.time()
set.seed(123)
RForest = randomForest(x = training_set_clean[-1],
                          y = training_set_clean$Survived,
                          ntree = 500) # number of trees you want in the forest
# Predicting the Test set results
RForest_pred = predict(RForest, newdata = validation_set_clean)
end.time <- Sys.time()
time.takenRF <- end.time - start.time

Inside the Model

RForest
## 
## Call:
##  randomForest(x = training_set_clean[-1], y = training_set_clean$Survived,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.97%
## Confusion matrix:
##     0   1 class.error
## 0 401  38  0.08656036
## 1  83 191  0.30291971

The details of the model above shows a error rate expectancy of 16.97% and that the model is much better at predicting casuality than survival

The Random Forest Confusion Matrix

# creation of confusion matrix 
RF_cm = table(t(validation_set_clean[, 1]), RForest_pred)
# compute the overall success rate
RF_success = sum(RF_cm[1,1]+RF_cm[2,2])/sum(RF_cm)*100
# compute the success rate to predict survival
RF_success_survived = RF_cm[2,2]/(RF_cm[2,1]+RF_cm[2,2])*100
# compute the success rate to predict casualty
RF_success_notsurvived = RF_cm[1,1]/(RF_cm[1,2]+RF_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
RF_cm
##    RForest_pred
##      0  1
##   0 98 12
##   1 22 46

The prediction success rate is 80.9% , the prediction success for survival is 67.65% while the prediction success for non-survival is 89.09%. The Model took 0.38 seconds to perform the prediction.

The K Nearest Neighbours(KNN)

The Data Preparation

To run the model we need to change the characters into factors.

#create new dataframe for knn
training_set_cleanKNN <- as.data.frame(training_set_clean)
validation_set_cleanKNN <- as.data.frame(validation_set_clean)
# change characters into integers
training_set_cleanKNN$Sex <- sapply(as.character(training_set_cleanKNN$Sex), switch, 'male' = 0, 'female' = 1)
validation_set_cleanKNN$Sex <- sapply(as.character(validation_set_cleanKNN$Sex), switch, 'male' = 0, 'female' = 1)
training_set_cleanKNN$Embarked <- sapply(as.character(training_set_cleanKNN$Embarked), switch, 'C' = 0, 'Q' = 1, 'S' = 2)
validation_set_cleanKNN$Embarked <- sapply(as.character(validation_set_cleanKNN$Embarked), switch, 'C' = 0, 'Q' = 1, 'S' = 2)
# set integeres as factors
training_set_cleanKNN$Sex <- as.factor(training_set_cleanKNN$Sex)
validation_set_cleanKNN$Sex <- as.factor(validation_set_cleanKNN$Sex)
training_set_cleanKNN$Embarked <- as.factor(training_set_cleanKNN$Embarked)
validation_set_cleanKNN$Embarked <- as.factor(validation_set_cleanKNN$Embarked)

Finding best number of clusters with the elbow method

The elbow method compares the amount of variance against the number of clusters. Not enough clusters lead to too much variance, while too many clusters lead to overfitting. The approach is to choose the number of clusters that participates to the biggest variance reduction. This is the elbow of the chart below.

set.seed(6)
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(training_set_cleanKNN, i)$withinss)
plot(1:10,
     wcss,
     type = 'b',
     main = paste('The Elbow Method'),
     xlab = 'Number of clusters',
     ylab = 'WCSS') 

The elbow method is not straightforward in this situation as we do not have a clear elbow. However, The slope tends to decrease after 5 clusters.

The K Nearest Neighbours(KNN) Model

# Fitting K-NN to the Training set and Predicting the Test set results
library(class)
start.time <- Sys.time()
KNN_pred = knn(train = training_set_cleanKNN[, -1],
             test = validation_set_cleanKNN[, -1],
             cl = training_set_cleanKNN[, 1],
             k = 5,
             prob = TRUE)
end.time <- Sys.time()
time.takenKNN <- end.time - start.time

The K Nearest Neighbours(KNN) Confusion Matrix

# create confusion matrix
KNN_cm = table(t(validation_set_clean[, 1]), KNN_pred)
# compute the overall success rate
KNN_success = sum(KNN_cm[1,1]+KNN_cm[2,2])/sum(KNN_cm)*100
# compute the success rate to predict survival
KNN_success_survived = KNN_cm[2,2]/(KNN_cm[2,1]+KNN_cm[2,2])*100
# compute the success rate to predict casualty
KNN_success_notsurvived = KNN_cm[1,1]/(KNN_cm[1,2]+KNN_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
KNN_cm
##    KNN_pred
##      0  1
##   0 95 15
##   1 21 47

The prediction success rate is 79.78% , the prediction success for survival is 69.12% while the prediction success for non-survival is 86.36%. The Model took 0.01 seconds to perform the prediction.

8. The Kernel PCA

The Kernel PCA enables to reduce the number of vectors by merging it.

The Kernel PCA Model Creation

library(kernlab)
start.time <- Sys.time()
kpca = kpca(~., data = training_set_clean[-1], kernel = 'rbfdot', features = 2) # use kernel trick with two featurs
training_set_pca = as.data.frame(predict(kpca, training_set_clean)) # transform the 2 features into dataframe
training_set_pca$Survived = training_set_clean$Survived # add the purchase column
validation_set_clean_pca = as.data.frame(predict(kpca, validation_set_clean))# transform the 2 features into dataframe
validation_set_clean_pca$Survived = validation_set_clean$Survived# add the purchase column


# Fitting Logistic Regression to the Training set
kpca = glm(formula = Survived ~ .,
                 family = binomial,
                 data = training_set_pca)

# Predicting the Test set results
prob_pred = predict(kpca, type = 'response', newdata = validation_set_clean_pca)
kpca_pred = ifelse(prob_pred > 0.5, 1, 0)
end.time <- Sys.time()
time.takenkpca <- end.time - start.time

The The Kernel PCA Confusion Matrix

# create confusion matrix
kpca_cm = table(t(validation_set_clean[, 1]), kpca_pred)
# compute the overall success rate
kpca_success = sum(kpca_cm[1,1]+kpca_cm[2,2])/sum(kpca_cm)*100
# compute the success rate to predict survival
kpca_success_survived = kpca_cm[2,2]/(kpca_cm[2,1]+kpca_cm[2,2])*100
# compute the success rate to predict casualty
kpca_success_notsurvived = kpca_cm[1,1]/(kpca_cm[1,2]+kpca_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
kpca_cm
##    kpca_pred
##      0  1
##   0 97 13
##   1 47 21

The prediction success rate is 66.29% , the prediction success for survival is 30.88% while the prediction success for non-survival is 88.18%. The Model took 1.35 seconds to perform the prediction.

9. The Artificial Neural Network(ANN)

Artificial Neural Network Model Creation

In order to enable reproducibility, we had to work on only one thread which makes the model much slower than if we used h2o.init(nthreads = -1) which would have picked the optimum thread you could have used given your computer.

library(h2o)
start.time <- Sys.time()
h2o.init(nthreads = 1) # -1 means use all CPUs on the host (Default) # if you want to reproduce you can only use one thread
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\Charly\AppData\Local\Temp\Rtmpy4feEq/h2o_Charly_started_from_r.out
##     C:\Users\Charly\AppData\Local\Temp\Rtmpy4feEq/h2o_Charly_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: . Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 314 milliseconds 
##     H2O cluster version:        3.16.0.2 
##     H2O cluster version age:    1 month and 16 days  
##     H2O cluster name:           H2O_started_from_R_Charly_tky049 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.75 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  1 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.2 (2017-09-28)
ann = h2o.deeplearning(y = 'Survived',
                         training_frame = as.h2o(training_set_clean),
                         activation = 'Rectifier',
                         hidden = c(5,5), # number of hidden layers and hidden neurons
                         epochs = 500,#  How many times the dataset should be iterated
                         train_samples_per_iteration = -2,
                         reproducible = TRUE,
                         seed = 123) 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |=================================================================| 100%
# Predicting the Test set results
ann_pred = h2o.predict(ann, newdata = as.h2o(validation_set_clean[-1]))
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
#ann_pred = (ann_pred > 0.5)
ann_pred = as.vector(ann_pred$predict)
end.time <- Sys.time()
time.takenann <- end.time - start.time
#h2o.shutdown()

Inside the Model

In the confusion matrix of the model, we see that the model has a prediction error rate of .16 and thus a success rate of 84% using the training set.

ann
## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  DeepLearning_model_R_1516092219966_1 
## Status of Neuron Layers: predicting Survived, 2-class classification, bernoulli distribution, CrossEntropy loss, 122 weights/biases, 5.6 KB, 74,865 training samples, mini-batch size 1
##   layer units      type dropout       l1       l2 mean_rate rate_rms
## 1     1    15     Input  0.00 %                                     
## 2     2     5 Rectifier  0.00 % 0.000000 0.000000  0.201471 0.403857
## 3     3     5 Rectifier  0.00 % 0.000000 0.000000  0.001013 0.001453
## 4     4     2   Softmax         0.000000 0.000000  0.002048 0.001156
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000    0.078170   0.309146  0.555095 0.109815
## 3 0.000000   -0.151936   0.524780  0.936496 0.092960
## 4 0.000000    0.140691   2.892392 -0.000000 0.377377
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
## 
## MSE:  0.1240915
## RMSE:  0.3522663
## LogLoss:  0.4031934
## Mean Per-Class Error:  0.1744218
## AUC:  0.8719884
## Gini:  0.7439769
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error      Rate
## 0      390  49 0.111617   =49/439
## 1       65 209 0.237226   =65/274
## Totals 455 258 0.159888  =114/713
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.392579 0.785714 196
## 2                       max f2  0.143356 0.809979 285
## 3                 max f0point5  0.565507 0.836347 149
## 4                 max accuracy  0.565507 0.842917 149
## 5                max precision  0.996546 1.000000   0
## 6                   max recall  0.066585 1.000000 395
## 7              max specificity  0.996546 1.000000   0
## 8             max absolute_mcc  0.565507 0.666457 149
## 9   max min_per_class_accuracy  0.284162 0.810219 233
## 10 max mean_per_class_accuracy  0.392579 0.825578 196
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

The Artificial Neural Network Confusion Matrix

# creation of confusion matrix
ann_cm = table(t(validation_set_clean[, 1]), ann_pred)
# compute the overall success rate
ann_success = sum(ann_cm[1,1]+ann_cm[2,2])/sum(ann_cm)*100
# compute the success rate to predict survival
ann_success_survived = ann_cm[2,2]/(ann_cm[2,1]+ann_cm[2,2])*100
# compute the success rate to predict casualty
ann_success_notsurvived = ann_cm[1,1]/(ann_cm[1,2]+ann_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
ann_cm
##    ann_pred
##      0  1
##   0 92 18
##   1 17 51

The prediction success rate is 80.34% , the prediction success for survival is 75% while the prediction success for non-survival is 83.64%. The Model took 8.9 seconds to perform the prediction.

10. The Extreme Gradient (XG) Boost

The Data Preparation

The factors should be changed into numeric.

#Check which variables are factors
str(training_set_cleanKNN,give.attr = FALSE)
## 'data.frame':    713 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 3 3 1 3 ...
##  $ Sex     : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
##  $ Age     : num  -0.597 0.671 -0.28 0.433 0.433 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 1 0 0 ...
##  $ Parch   : int  0 0 0 0 0 0 2 1 0 0 ...
##  $ Fare    : num  -0.537 0.852 -0.522 0.457 -0.52 ...
##  $ Embarked: Factor w/ 3 levels "0","1","2": 3 1 3 3 3 2 3 3 3 3 ...
training_set_cleanXG <- training_set_cleanKNN
validation_set_cleanXG <- validation_set_cleanKNN
training_set_cleanXG$Survived <- as.numeric(training_set_cleanXG$Survived)
validation_set_cleanXG$Survived <- as.numeric(validation_set_cleanXG$Survived)
training_set_cleanXG$Pclass <- as.numeric(training_set_cleanXG$Pclass )
validation_set_cleanXG$Pclass  <- as.numeric(validation_set_cleanXG$Pclass)
training_set_cleanXG$Sex <- as.numeric(training_set_cleanXG$Sex)
validation_set_cleanXG$Sex <- as.numeric(validation_set_cleanXG$Sex)
training_set_cleanXG$Embarked <- as.numeric(training_set_cleanXG$Embarked)
validation_set_cleanXG$Embarked <- as.numeric(validation_set_cleanXG$Embarked)

The XGBoost Model Creation

#Fitting XGBoost to the Training set
#install.packages('xgboost')
library(xgboost)
start.time <- Sys.time()
classifierxg = xgboost(data = as.matrix(training_set_cleanXG[-1]), label = training_set_cleanXG$Survived-1, nrounds = 10)
## [1]  train-rmse:0.417741 
## [2]  train-rmse:0.365128 
## [3]  train-rmse:0.331008 
## [4]  train-rmse:0.310080 
## [5]  train-rmse:0.293606 
## [6]  train-rmse:0.282539 
## [7]  train-rmse:0.271436 
## [8]  train-rmse:0.262833 
## [9]  train-rmse:0.256366 
## [10] train-rmse:0.250135
# Predicting the Test set results

xg_pred = predict(classifierxg, newdata = as.matrix(validation_set_cleanXG[-1]))
xg_pred = (xg_pred >= 0.5)
end.time <- Sys.time()
time.takenxg <- end.time - start.time

The XGBoost Confusion Matrix

# creation of confusion matrix
xg_cm = table(t(validation_set_cleanXG[, 1]), xg_pred)
# compute the overall success rate
xg_success = sum(xg_cm[1,1]+xg_cm[2,2])/sum(xg_cm)*100
# compute the success rate to predict survival
xg_success_survived = xg_cm[2,2]/(xg_cm[2,1]+xg_cm[2,2])*100
# compute the success rate to predict casualty
xg_success_notsurvived = xg_cm[1,1]/(xg_cm[1,2]+xg_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
xg_cm
##    xg_pred
##     FALSE TRUE
##   1    96   14
##   2    25   43

The prediction success rate is 78.09% , the prediction success for survival is 63.24% while the prediction success for non-survival is 87.27%. The Model took 0.24 seconds to perform the prediction.

Result Summary

The table below shows that the best performers are the Kernel Support Vector Machine and the Random Forest. Those two models are going to be use in the next part of the paper to test prediction on the Kaggle website.

The worst performer was the Kernel PCA with 66.29% which is much lower than the other models. It is also the model which is second in term of computational time.

If we exclude the worst performer (Kernel PCA) and the best performer(Kernel Support Vector Machine), there is only 3.37% difference between the remaining worst performing model and best performing model which suggests that model decision does not play a major part of performance.

Results also show that all models were better predictors for non-survival than for survival. This is a common flaw in models. As those will attempt to provide the highest success rate possible and there were more death and survival, models will tend to discriminate survival rate. One way to cope with this is using a ROC curve which will compare the true positive with false positive rate. This will however not be covered in this paper as Kaggle requires the best prediction on the two outcomes combined.

Artificial neural network was also the model which by far needed time to compute.

Overall prediction success Survival prediction success Death prediction success Time(in sec)
Decision Tree 77.52809 63.23529 86.36364 0.0266991
Naive Bayes 79.77528 66.17647 88.18182 0.0416100
Logistic Regression 77.52809 69.11765 82.72727 0.0172069
Support Vector Machine(SVM) 76.40449 66.17647 82.72727 0.0997958
Kernel Support Vector Machine 82.58427 75.00000 87.27273 0.2212310
Random Forest 80.89888 67.64706 89.09091 0.3845222
K nearest neighbours(KNN) 79.77528 69.11765 86.36364 0.0140412
Kernel PCA 66.29213 30.88235 88.18182 1.3543029
Artificial Neural Network 80.33708 75.00000 83.63636 8.9003930
XGBoost 78.08989 63.23529 87.27273 0.2407510

Submit prediction to Kaggle

The Random Forest

library(randomForest)
set.seed(123)
RForestfinal = randomForest(x = final_training_set_clean[-1],
                          y = final_training_set_clean$Survived,
                          ntree = 500) # number of trees you want in the forest
# Predicting the Test set results
RForest_predfinal = predict(RForestfinal, newdata = test_clean)
#Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test_clean$PassengerId, Survived = RForest_predfinal)
write.csv(submit, file = "RFpredict.csv", row.names = FALSE)

The submission did not do better than an earlier submission I have performed with Kernel support vector Machine.

The Kernel Support Vector Machine

library(e1071)
start.time <- Sys.time()
KSVM = svm(formula = Survived ~ .,
                 data = final_training_set_clean,
                 type = 'C-classification',
                 kernel = 'radial')# linear for SVM , radial for kernel
KSVM_predfinal = predict(KSVM, newdata = test_clean)
#Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test_clean$PassengerId, Survived = KSVM_predfinal)
write.csv(submit, file = "KSVM.csv", row.names = FALSE)

The Combo Prediction

Finally we use all models to predict which passenger were moer likely to survive or not.

The results did not improve which suggests that models have shown limitations and that more work should be performed in the data preparation to improve predictions.