In this paper 10 different models are used to predict which Titanic passenger survived based on different information. The Titanic dataset is provided by Kaggle competition website. The models covered are:
The purpose of the paper is to discuss the performance of the different models whilst providing the codes for the reader to be able to reproduce the research. This paper does not cover details about the different models which can be easily found on the web.
The best performers are the Random Forest and the Kernel Support Vector Machine. Although nearly all of the other models still had a success rate ranging between 76.40% and 82.85% (with the exception of the Kernel PCA) which suggest that the quality of the data plays a much bigger part in prediction success than model picking. It is also interesting to note that all models performed better at predicting non-survival (up to 89% success) than survival (up to 69%).
The difference between the different models is more important in terms of time performance. The Artificial Neural Network was the longest with 2.97 seconds and the shortest was the KNN with 0.1 seconds. Moreover, the KNN had a better success rate than the Artificial Neural Network.
The creation of the 4 datasets used in this paper is explained in the paper “Cleaning the Titanic Data” (http://rpubs.com/charlydethibault/348566).
The original dataset is available on Kaggle(https://www.kaggle.com/c/titanic)
library(readr) # to read csv files
training_set_clean <- read_csv("training_set_clean.csv")
validation_set_clean <- read_csv("validation_set_clean.csv")
final_training_set_clean <- read_csv("final_training_set_clean.csv")
test_clean <- read_csv("test_clean.csv")
head(training_set_clean,1)
## # A tibble: 1 x 8
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## <int> <int> <chr> <dbl> <int> <int> <dbl> <chr>
## 1 0 3 male -0.5972816 1 0 -0.5368636 S
As shown above, 8 variables are kept for the analysis:
For the “test_clean” dataset, there the same information except the “Survived” column.
Some of the variables need to be encoded as factors such as “Survived”,“Pclass”,“Sex”, and “Embarked”.
str(training_set_clean,give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame': 713 obs. of 8 variables:
## $ Survived: int 0 1 1 1 0 0 1 1 1 0 ...
## $ Pclass : int 3 1 3 1 3 3 3 3 1 3 ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num -0.597 0.671 -0.28 0.433 0.433 ...
## $ SibSp : int 1 1 0 1 0 0 0 1 0 0 ...
## $ Parch : int 0 0 0 0 0 0 2 1 0 0 ...
## $ Fare : num -0.537 0.852 -0.522 0.457 -0.52 ...
## $ Embarked: chr "S" "C" "S" "S" ...
str(validation_set_clean,give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame': 178 obs. of 8 variables:
## $ Survived: int 0 0 1 0 0 1 0 0 0 1 ...
## $ Pclass : int 1 3 2 3 2 1 1 3 3 3 ...
## $ Sex : chr "male" "male" "female" "female" ...
## $ Age : num 1.731 -1.849 -1.023 -1.023 0.423 ...
## $ SibSp : int 0 3 1 0 0 0 0 0 2 1 ...
## $ Parch : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Fare : num 0.3034 -0.1919 -0.0472 -0.4046 -0.1126 ...
## $ Embarked: chr "S" "S" "C" "S" ...
str(final_training_set_clean,give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame': 891 obs. of 8 variables:
## $ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num -0.565 0.663 -0.258 0.433 0.433 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num -0.502 0.786 -0.489 0.42 -0.486 ...
## $ Embarked: chr "S" "C" "S" "S" ...
str(test_clean,give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame': 418 obs. of 8 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 0.386 1.37 2.55 -0.205 -0.598 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Fare : num -0.497 -0.512 -0.464 -0.482 -0.417 ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
#Encode as factor
training_set_clean$Survived <- as.factor(training_set_clean$Survived)
training_set_clean$Pclass <- as.factor(training_set_clean$Pclass)
training_set_clean$Sex <- as.factor(training_set_clean$Sex)
training_set_clean$Embarked <- as.factor(training_set_clean$Embarked)
validation_set_clean$Survived <- as.factor(validation_set_clean$Survived)
validation_set_clean$Pclass <- as.factor(validation_set_clean$Pclass)
validation_set_clean$Sex <- as.factor(validation_set_clean$Sex)
validation_set_clean$Embarked <- as.factor(validation_set_clean$Embarked)
final_training_set_clean$Survived <- as.factor(final_training_set_clean$Survived)
final_training_set_clean$Pclass <- as.factor(final_training_set_clean$Pclass)
final_training_set_clean$Sex <- as.factor(final_training_set_clean$Sex)
final_training_set_clean$Embarked <- as.factor(final_training_set_clean$Embarked)
test_clean$Pclass <- as.factor(test_clean$Pclass)
test_clean$Sex <- as.factor(test_clean$Sex)
test_clean$Embarked <- as.factor(test_clean$Embarked)
On the below part we have created our decision tree and plot the results in two different ways. The decision on the left is not as good looking than the one on the right but provide very sensitive information. The legs of the “Sex” split is much bigger than the other legs which means that this variable is by far the most important one to predict survival.
library(rpart)
start.time <- Sys.time()
Dtree <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
data=training_set_clean,
method="class")
Dtree_pred = predict(Dtree, newdata = validation_set_clean[-1], type = 'class')# -1 remove what we attempt to predict
end.time <- Sys.time()
time.takenDT <- end.time - start.time
par(mfrow=c(1,2)) # print 2 charts horizontally
plot(Dtree)
text(Dtree)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(Dtree)
Now we test our model on our validation set and and create the confusion matrix to define the reliability of our model. The confusion Matrix enables to analyse the the success rate per prediction ( survival or non-survival ). On the horizontal axis you have the actual classes and on the vertical axis you have the predicted class.
# creation of confusion matrix
Dtree_cm = table(t(validation_set_clean[, 1]), Dtree_pred)
# compute the overall success rate
Dtree_success = sum(diag(Dtree_cm))/sum(Dtree_cm)*100
# compute the success rate to predict survival
Dtree_success_survived = Dtree_cm[2,2]/(Dtree_cm[2,1]+Dtree_cm[2,2])*100
# compute the success rate to predict casualty
Dtree_success_notsurvived = Dtree_cm[1,1]/(Dtree_cm[1,2]+Dtree_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
Dtree_cm
## Dtree_pred
## 0 1
## 0 95 15
## 1 25 43
The prediction success rate is 77.53% , the prediction success for survival is 63.24% while the prediction success for non-survival is 86.36%. The Model took 0.03 seconds to perform the prediction.
library(e1071)
start.time <- Sys.time()
NBayes = naiveBayes(x = training_set_clean[-1],
y = training_set_clean$Survived)
# -1 removes the "Survived"" column as it is what we plan to predict
NBayes_pred = predict(NBayes, newdata = validation_set_clean[-1], type = 'class')
end.time <- Sys.time()
time.takenNB <- end.time - start.time
# creation of confusion matrix
NBayes_cm = table(t(validation_set_clean[, 1]), NBayes_pred)
# compute the overall success rate
NBayes_success = sum(diag(NBayes_cm))/sum(NBayes_cm)*100
# compute the success rate to predict survival
NBayes_success_survived = NBayes_cm[2,2]/(NBayes_cm[2,1]+NBayes_cm[2,2])*100
# compute the success rate to predict casualty
NBayes_success_notsurvived = NBayes_cm[1,1]/(NBayes_cm[1,2]+NBayes_cm[1,1])*100
NBayes_cm
## NBayes_pred
## 0 1
## 0 97 13
## 1 23 45
The prediction success rate is 79.78% , the prediction success for survival is 66.18% while the prediction success for non-survival is 88.18%. The Model took 0.042 seconds to perform the prediction.
start.time <- Sys.time()
LRegression = glm(formula = Survived ~ .,
family = binomial,
data = training_set_clean)
LRegression_pred = predict(LRegression, type = 'response', validation_set_clean[-1])
LRegression_pred = ifelse(LRegression_pred > 0.5, 1, 0)
end.time <- Sys.time()
time.takenLR <- end.time - start.time
LRegression
##
## Call: glm(formula = Survived ~ ., family = binomial, data = training_set_clean)
##
## Coefficients:
## (Intercept) Pclass2 Pclass3 Sexmale Age
## 3.08676 -0.90927 -2.11209 -2.74126 -0.49880
## SibSp Parch Fare EmbarkedQ EmbarkedS
## -0.27739 -0.05257 -0.01061 -0.01883 -0.59336
##
## Degrees of Freedom: 712 Total (i.e. Null); 703 Residual
## Null Deviance: 949.9
## Residual Deviance: 629.3 AIC: 649.3
The most negative variable is “sexmale” which means, as the decision tree showed earlier, that being a man on the Titanic was that factor that reduced the most the chances of survival. The next two most negative variables for chances of survival were the Pclass2 and Pclass 3 which means that people had a much higher chance of survival in first class.
# creation of confusion matrix
LRegression_cm = table(t(validation_set_clean[, 1]), LRegression_pred)
# compute the overall success rate
LRegression_success = sum(diag(LRegression_cm))/sum(LRegression_cm)*100
# compute the success rate to predict survival
LRegression_success_survived = LRegression_cm[2,2]/(LRegression_cm[2,1]+LRegression_cm[2,2])*100
# compute the success rate to predict casualty
LRegression_success_notsurvived = LRegression_cm[1,1]/(LRegression_cm[1,2]+LRegression_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
LRegression_cm
## LRegression_pred
## 0 1
## 0 91 19
## 1 21 47
The prediction success rate is 77.53% , the prediction success for survival is 69.12% while the prediction success for non-survival is 82.73%. The Model took 0.02 seconds to perform the prediction.
library(e1071)
start.time <- Sys.time()
SVM = svm(formula = Survived ~ .,
data = training_set_clean,
type = 'C-classification',
kernel = 'linear')
SVM_pred = predict(SVM, newdata = validation_set_clean[-1])
end.time <- Sys.time()
time.takenSVM <- end.time - start.time
# creation of confusion matrix
SVM_cm = table(t(validation_set_clean[, 1]), SVM_pred)
# compute the overall success rate
SVM_success = sum(diag(SVM_cm))/sum(SVM_cm)*100
# compute the success rate to predict survival
SVM_success_survived = SVM_cm[2,2]/(SVM_cm[2,1]+SVM_cm[2,2])*100
# compute the success rate to predict casualty
SVM_success_notsurvived = SVM_cm[1,1]/(SVM_cm[1,2]+SVM_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
SVM_cm
## SVM_pred
## 0 1
## 0 91 19
## 1 23 45
The prediction success rate is 76.4% , the prediction success for survival is 66.18% while the prediction success for non-survival is 82.73%. The Model took 0.1 seconds to perform the prediction.
The model is very similar to the SVM as only the “kernel” variable in the model has to be changed from linear to radial.
library(e1071)
start.time <- Sys.time()
KSVM = svm(formula = Survived ~ .,
data = training_set_clean,
type = 'C-classification',
kernel = 'radial')# linear for SVM , radial for kernel
KSVM_pred = predict(KSVM, newdata = validation_set_clean[-1])
end.time <- Sys.time()
time.takenKSVM <- end.time - start.time
# creation of confusion matrix
KSVM_cm = table(t(validation_set_clean[, 1]), KSVM_pred)
# compute the overall success rate
KSVM_success = sum(diag(KSVM_cm))/sum(KSVM_cm)*100
# compute the success rate to predict survival
KSVM_success_survived = KSVM_cm[2,2]/(KSVM_cm[2,1]+KSVM_cm[2,2])*100
# compute the success rate to predict casualty
KSVM_success_notsurvived = KSVM_cm[1,1]/(KSVM_cm[1,2]+KSVM_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
KSVM_cm
## KSVM_pred
## 0 1
## 0 96 14
## 1 17 51
The prediction success rate is 82.58% , the prediction success for survival is 75% while the prediction success for non-survival is 87.27%. The Model took 0.22 seconds to perform the prediction.
library(randomForest)
start.time <- Sys.time()
set.seed(123)
RForest = randomForest(x = training_set_clean[-1],
y = training_set_clean$Survived,
ntree = 500) # number of trees you want in the forest
# Predicting the Test set results
RForest_pred = predict(RForest, newdata = validation_set_clean)
end.time <- Sys.time()
time.takenRF <- end.time - start.time
RForest
##
## Call:
## randomForest(x = training_set_clean[-1], y = training_set_clean$Survived, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.97%
## Confusion matrix:
## 0 1 class.error
## 0 401 38 0.08656036
## 1 83 191 0.30291971
The details of the model above shows a error rate expectancy of 16.97% and that the model is much better at predicting casuality than survival
# creation of confusion matrix
RF_cm = table(t(validation_set_clean[, 1]), RForest_pred)
# compute the overall success rate
RF_success = sum(RF_cm[1,1]+RF_cm[2,2])/sum(RF_cm)*100
# compute the success rate to predict survival
RF_success_survived = RF_cm[2,2]/(RF_cm[2,1]+RF_cm[2,2])*100
# compute the success rate to predict casualty
RF_success_notsurvived = RF_cm[1,1]/(RF_cm[1,2]+RF_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
RF_cm
## RForest_pred
## 0 1
## 0 98 12
## 1 22 46
The prediction success rate is 80.9% , the prediction success for survival is 67.65% while the prediction success for non-survival is 89.09%. The Model took 0.38 seconds to perform the prediction.
To run the model we need to change the characters into factors.
#create new dataframe for knn
training_set_cleanKNN <- as.data.frame(training_set_clean)
validation_set_cleanKNN <- as.data.frame(validation_set_clean)
# change characters into integers
training_set_cleanKNN$Sex <- sapply(as.character(training_set_cleanKNN$Sex), switch, 'male' = 0, 'female' = 1)
validation_set_cleanKNN$Sex <- sapply(as.character(validation_set_cleanKNN$Sex), switch, 'male' = 0, 'female' = 1)
training_set_cleanKNN$Embarked <- sapply(as.character(training_set_cleanKNN$Embarked), switch, 'C' = 0, 'Q' = 1, 'S' = 2)
validation_set_cleanKNN$Embarked <- sapply(as.character(validation_set_cleanKNN$Embarked), switch, 'C' = 0, 'Q' = 1, 'S' = 2)
# set integeres as factors
training_set_cleanKNN$Sex <- as.factor(training_set_cleanKNN$Sex)
validation_set_cleanKNN$Sex <- as.factor(validation_set_cleanKNN$Sex)
training_set_cleanKNN$Embarked <- as.factor(training_set_cleanKNN$Embarked)
validation_set_cleanKNN$Embarked <- as.factor(validation_set_cleanKNN$Embarked)
The elbow method compares the amount of variance against the number of clusters. Not enough clusters lead to too much variance, while too many clusters lead to overfitting. The approach is to choose the number of clusters that participates to the biggest variance reduction. This is the elbow of the chart below.
set.seed(6)
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(training_set_cleanKNN, i)$withinss)
plot(1:10,
wcss,
type = 'b',
main = paste('The Elbow Method'),
xlab = 'Number of clusters',
ylab = 'WCSS')
The elbow method is not straightforward in this situation as we do not have a clear elbow. However, The slope tends to decrease after 5 clusters.
# Fitting K-NN to the Training set and Predicting the Test set results
library(class)
start.time <- Sys.time()
KNN_pred = knn(train = training_set_cleanKNN[, -1],
test = validation_set_cleanKNN[, -1],
cl = training_set_cleanKNN[, 1],
k = 5,
prob = TRUE)
end.time <- Sys.time()
time.takenKNN <- end.time - start.time
# create confusion matrix
KNN_cm = table(t(validation_set_clean[, 1]), KNN_pred)
# compute the overall success rate
KNN_success = sum(KNN_cm[1,1]+KNN_cm[2,2])/sum(KNN_cm)*100
# compute the success rate to predict survival
KNN_success_survived = KNN_cm[2,2]/(KNN_cm[2,1]+KNN_cm[2,2])*100
# compute the success rate to predict casualty
KNN_success_notsurvived = KNN_cm[1,1]/(KNN_cm[1,2]+KNN_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
KNN_cm
## KNN_pred
## 0 1
## 0 95 15
## 1 21 47
The prediction success rate is 79.78% , the prediction success for survival is 69.12% while the prediction success for non-survival is 86.36%. The Model took 0.01 seconds to perform the prediction.
The Kernel PCA enables to reduce the number of vectors by merging it.
library(kernlab)
start.time <- Sys.time()
kpca = kpca(~., data = training_set_clean[-1], kernel = 'rbfdot', features = 2) # use kernel trick with two featurs
training_set_pca = as.data.frame(predict(kpca, training_set_clean)) # transform the 2 features into dataframe
training_set_pca$Survived = training_set_clean$Survived # add the purchase column
validation_set_clean_pca = as.data.frame(predict(kpca, validation_set_clean))# transform the 2 features into dataframe
validation_set_clean_pca$Survived = validation_set_clean$Survived# add the purchase column
# Fitting Logistic Regression to the Training set
kpca = glm(formula = Survived ~ .,
family = binomial,
data = training_set_pca)
# Predicting the Test set results
prob_pred = predict(kpca, type = 'response', newdata = validation_set_clean_pca)
kpca_pred = ifelse(prob_pred > 0.5, 1, 0)
end.time <- Sys.time()
time.takenkpca <- end.time - start.time
# create confusion matrix
kpca_cm = table(t(validation_set_clean[, 1]), kpca_pred)
# compute the overall success rate
kpca_success = sum(kpca_cm[1,1]+kpca_cm[2,2])/sum(kpca_cm)*100
# compute the success rate to predict survival
kpca_success_survived = kpca_cm[2,2]/(kpca_cm[2,1]+kpca_cm[2,2])*100
# compute the success rate to predict casualty
kpca_success_notsurvived = kpca_cm[1,1]/(kpca_cm[1,2]+kpca_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
kpca_cm
## kpca_pred
## 0 1
## 0 97 13
## 1 47 21
The prediction success rate is 66.29% , the prediction success for survival is 30.88% while the prediction success for non-survival is 88.18%. The Model took 1.35 seconds to perform the prediction.
In order to enable reproducibility, we had to work on only one thread which makes the model much slower than if we used h2o.init(nthreads = -1) which would have picked the optimum thread you could have used given your computer.
library(h2o)
start.time <- Sys.time()
h2o.init(nthreads = 1) # -1 means use all CPUs on the host (Default) # if you want to reproduce you can only use one thread
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\Charly\AppData\Local\Temp\Rtmpy4feEq/h2o_Charly_started_from_r.out
## C:\Users\Charly\AppData\Local\Temp\Rtmpy4feEq/h2o_Charly_started_from_r.err
##
##
## Starting H2O JVM and connecting: . Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 314 milliseconds
## H2O cluster version: 3.16.0.2
## H2O cluster version age: 1 month and 16 days
## H2O cluster name: H2O_started_from_R_Charly_tky049
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.75 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 1
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.2 (2017-09-28)
ann = h2o.deeplearning(y = 'Survived',
training_frame = as.h2o(training_set_clean),
activation = 'Rectifier',
hidden = c(5,5), # number of hidden layers and hidden neurons
epochs = 500,# How many times the dataset should be iterated
train_samples_per_iteration = -2,
reproducible = TRUE,
seed = 123)
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|======= | 11%
|
|=================================================================| 100%
# Predicting the Test set results
ann_pred = h2o.predict(ann, newdata = as.h2o(validation_set_clean[-1]))
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
#ann_pred = (ann_pred > 0.5)
ann_pred = as.vector(ann_pred$predict)
end.time <- Sys.time()
time.takenann <- end.time - start.time
#h2o.shutdown()
In the confusion matrix of the model, we see that the model has a prediction error rate of .16 and thus a success rate of 84% using the training set.
ann
## Model Details:
## ==============
##
## H2OBinomialModel: deeplearning
## Model ID: DeepLearning_model_R_1516092219966_1
## Status of Neuron Layers: predicting Survived, 2-class classification, bernoulli distribution, CrossEntropy loss, 122 weights/biases, 5.6 KB, 74,865 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms
## 1 1 15 Input 0.00 %
## 2 2 5 Rectifier 0.00 % 0.000000 0.000000 0.201471 0.403857
## 3 3 5 Rectifier 0.00 % 0.000000 0.000000 0.001013 0.001453
## 4 4 2 Softmax 0.000000 0.000000 0.002048 0.001156
## momentum mean_weight weight_rms mean_bias bias_rms
## 1
## 2 0.000000 0.078170 0.309146 0.555095 0.109815
## 3 0.000000 -0.151936 0.524780 0.936496 0.092960
## 4 0.000000 0.140691 2.892392 -0.000000 0.377377
##
##
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
##
## MSE: 0.1240915
## RMSE: 0.3522663
## LogLoss: 0.4031934
## Mean Per-Class Error: 0.1744218
## AUC: 0.8719884
## Gini: 0.7439769
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 390 49 0.111617 =49/439
## 1 65 209 0.237226 =65/274
## Totals 455 258 0.159888 =114/713
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.392579 0.785714 196
## 2 max f2 0.143356 0.809979 285
## 3 max f0point5 0.565507 0.836347 149
## 4 max accuracy 0.565507 0.842917 149
## 5 max precision 0.996546 1.000000 0
## 6 max recall 0.066585 1.000000 395
## 7 max specificity 0.996546 1.000000 0
## 8 max absolute_mcc 0.565507 0.666457 149
## 9 max min_per_class_accuracy 0.284162 0.810219 233
## 10 max mean_per_class_accuracy 0.392579 0.825578 196
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
# creation of confusion matrix
ann_cm = table(t(validation_set_clean[, 1]), ann_pred)
# compute the overall success rate
ann_success = sum(ann_cm[1,1]+ann_cm[2,2])/sum(ann_cm)*100
# compute the success rate to predict survival
ann_success_survived = ann_cm[2,2]/(ann_cm[2,1]+ann_cm[2,2])*100
# compute the success rate to predict casualty
ann_success_notsurvived = ann_cm[1,1]/(ann_cm[1,2]+ann_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
ann_cm
## ann_pred
## 0 1
## 0 92 18
## 1 17 51
The prediction success rate is 80.34% , the prediction success for survival is 75% while the prediction success for non-survival is 83.64%. The Model took 8.9 seconds to perform the prediction.
The factors should be changed into numeric.
#Check which variables are factors
str(training_set_cleanKNN,give.attr = FALSE)
## 'data.frame': 713 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 3 3 1 3 ...
## $ Sex : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
## $ Age : num -0.597 0.671 -0.28 0.433 0.433 ...
## $ SibSp : int 1 1 0 1 0 0 0 1 0 0 ...
## $ Parch : int 0 0 0 0 0 0 2 1 0 0 ...
## $ Fare : num -0.537 0.852 -0.522 0.457 -0.52 ...
## $ Embarked: Factor w/ 3 levels "0","1","2": 3 1 3 3 3 2 3 3 3 3 ...
training_set_cleanXG <- training_set_cleanKNN
validation_set_cleanXG <- validation_set_cleanKNN
training_set_cleanXG$Survived <- as.numeric(training_set_cleanXG$Survived)
validation_set_cleanXG$Survived <- as.numeric(validation_set_cleanXG$Survived)
training_set_cleanXG$Pclass <- as.numeric(training_set_cleanXG$Pclass )
validation_set_cleanXG$Pclass <- as.numeric(validation_set_cleanXG$Pclass)
training_set_cleanXG$Sex <- as.numeric(training_set_cleanXG$Sex)
validation_set_cleanXG$Sex <- as.numeric(validation_set_cleanXG$Sex)
training_set_cleanXG$Embarked <- as.numeric(training_set_cleanXG$Embarked)
validation_set_cleanXG$Embarked <- as.numeric(validation_set_cleanXG$Embarked)
#Fitting XGBoost to the Training set
#install.packages('xgboost')
library(xgboost)
start.time <- Sys.time()
classifierxg = xgboost(data = as.matrix(training_set_cleanXG[-1]), label = training_set_cleanXG$Survived-1, nrounds = 10)
## [1] train-rmse:0.417741
## [2] train-rmse:0.365128
## [3] train-rmse:0.331008
## [4] train-rmse:0.310080
## [5] train-rmse:0.293606
## [6] train-rmse:0.282539
## [7] train-rmse:0.271436
## [8] train-rmse:0.262833
## [9] train-rmse:0.256366
## [10] train-rmse:0.250135
# Predicting the Test set results
xg_pred = predict(classifierxg, newdata = as.matrix(validation_set_cleanXG[-1]))
xg_pred = (xg_pred >= 0.5)
end.time <- Sys.time()
time.takenxg <- end.time - start.time
# creation of confusion matrix
xg_cm = table(t(validation_set_cleanXG[, 1]), xg_pred)
# compute the overall success rate
xg_success = sum(xg_cm[1,1]+xg_cm[2,2])/sum(xg_cm)*100
# compute the success rate to predict survival
xg_success_survived = xg_cm[2,2]/(xg_cm[2,1]+xg_cm[2,2])*100
# compute the success rate to predict casualty
xg_success_notsurvived = xg_cm[1,1]/(xg_cm[1,2]+xg_cm[1,1])*100
# predicted in vertical and actual horizontal ( 68 survival in validation set)
xg_cm
## xg_pred
## FALSE TRUE
## 1 96 14
## 2 25 43
The prediction success rate is 78.09% , the prediction success for survival is 63.24% while the prediction success for non-survival is 87.27%. The Model took 0.24 seconds to perform the prediction.
The table below shows that the best performers are the Kernel Support Vector Machine and the Random Forest. Those two models are going to be use in the next part of the paper to test prediction on the Kaggle website.
The worst performer was the Kernel PCA with 66.29% which is much lower than the other models. It is also the model which is second in term of computational time.
If we exclude the worst performer (Kernel PCA) and the best performer(Kernel Support Vector Machine), there is only 3.37% difference between the remaining worst performing model and best performing model which suggests that model decision does not play a major part of performance.
Results also show that all models were better predictors for non-survival than for survival. This is a common flaw in models. As those will attempt to provide the highest success rate possible and there were more death and survival, models will tend to discriminate survival rate. One way to cope with this is using a ROC curve which will compare the true positive with false positive rate. This will however not be covered in this paper as Kaggle requires the best prediction on the two outcomes combined.
Artificial neural network was also the model which by far needed time to compute.
Overall prediction success | Survival prediction success | Death prediction success | Time(in sec) | |
---|---|---|---|---|
Decision Tree | 77.52809 | 63.23529 | 86.36364 | 0.0266991 |
Naive Bayes | 79.77528 | 66.17647 | 88.18182 | 0.0416100 |
Logistic Regression | 77.52809 | 69.11765 | 82.72727 | 0.0172069 |
Support Vector Machine(SVM) | 76.40449 | 66.17647 | 82.72727 | 0.0997958 |
Kernel Support Vector Machine | 82.58427 | 75.00000 | 87.27273 | 0.2212310 |
Random Forest | 80.89888 | 67.64706 | 89.09091 | 0.3845222 |
K nearest neighbours(KNN) | 79.77528 | 69.11765 | 86.36364 | 0.0140412 |
Kernel PCA | 66.29213 | 30.88235 | 88.18182 | 1.3543029 |
Artificial Neural Network | 80.33708 | 75.00000 | 83.63636 | 8.9003930 |
XGBoost | 78.08989 | 63.23529 | 87.27273 | 0.2407510 |
library(randomForest)
set.seed(123)
RForestfinal = randomForest(x = final_training_set_clean[-1],
y = final_training_set_clean$Survived,
ntree = 500) # number of trees you want in the forest
# Predicting the Test set results
RForest_predfinal = predict(RForestfinal, newdata = test_clean)
#Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test_clean$PassengerId, Survived = RForest_predfinal)
write.csv(submit, file = "RFpredict.csv", row.names = FALSE)
The submission did not do better than an earlier submission I have performed with Kernel support vector Machine.
library(e1071)
start.time <- Sys.time()
KSVM = svm(formula = Survived ~ .,
data = final_training_set_clean,
type = 'C-classification',
kernel = 'radial')# linear for SVM , radial for kernel
KSVM_predfinal = predict(KSVM, newdata = test_clean)
#Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test_clean$PassengerId, Survived = KSVM_predfinal)
write.csv(submit, file = "KSVM.csv", row.names = FALSE)
Finally we use all models to predict which passenger were moer likely to survive or not.
The results did not improve which suggests that models have shown limitations and that more work should be performed in the data preparation to improve predictions.