The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
This project investigates the likelihood of survival. This is done by employing Machine Learning tools to predict which passengers survived the tragedy. The dataset was supplied by “Kaggle”, an online platform for predictive modelling and analytics competitions.
Each record in the dataset describes a passenger. The attributes are defined as follows:
1. PassengerId: Unique passenger identification number
2. Survived: Did a passenger survive or not (0 = died, 1 = survived)
3. Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
4. Name: Name of Passenger
5. Sex: Sex of Passenger
6. Age: Passenger Age
7. SibSp: Number of Siblings/Spouses Aboard
8. Parch: Number of Parents/Children Aboard
9. Ticket: Ticket Number
10. Fare: Passenger Fare
11. Cabin: Cabin
12. Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Lets start off with loading the required packages.
# load packages
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
The training and test dataset are provided by Kaggle, so we’ll load them into R for predictive modeling.
# load the CSV file from the local directory
datasetTrain <- read.csv("data/train.csv", header=TRUE, sep = ",")
datasetTest <- read.csv("data/test.csv", header=TRUE, sep = ",")
Let’s better understand our data, by performing some descriprtive statistics and visualizations.
# dimensions of dataset
dim(datasetTrain)
## [1] 891 12
# atrtribute class type
sapply(datasetTrain, class)
## PassengerId Survived Pclass Name Sex Age
## "integer" "integer" "integer" "factor" "factor" "numeric"
## SibSp Parch Ticket Fare Cabin Embarked
## "integer" "integer" "factor" "numeric" "factor" "factor"
# peek
head(datasetTrain)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
From the sample values shown in the dataset, we can see that PassengerId, Name, Ticket, Cabin and Embarked will not be useful in our analysis, so we can probably remove it.
# remove redundant variables
datasetTrain <- datasetTrain[,c(-1, -4, -9, -11, -12)]
# summary
summary(datasetTrain)
## Survived Pclass Sex Age
## Min. :0.0000 Min. :1.000 female:314 Min. : 0.42
## 1st Qu.:0.0000 1st Qu.:2.000 male :577 1st Qu.:20.12
## Median :0.0000 Median :3.000 Median :28.00
## Mean :0.3838 Mean :2.309 Mean :29.70
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:38.00
## Max. :1.0000 Max. :3.000 Max. :80.00
## NA's :177
## SibSp Parch Fare
## Min. :0.000 Min. :0.0000 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 7.91
## Median :0.000 Median :0.0000 Median : 14.45
## Mean :0.523 Mean :0.3816 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 Max. :512.33
##
Also note that sex is a factor variable, and we may have some missing data (NA). We may need to remove the records (or impute values) with NA values for some analysis and modeling techiques.
Next, we need to convert sex to a numeric variable and ‘survived’ (our output variable) to a factor.
# convert survived to factor
datasetTrain[,1] <- as.factor((datasetTrain[,1]))
Let’s take a closer look at the class distributions
# class distribution
cbind(freq=table(datasetTrain$Survived), percentage=prop.table(table(datasetTrain$Survived))*100)
## freq percentage
## 0 549 61.61616
## 1 342 38.38384
There is some imbalance in the Class values. We observe an approximate 60% to 40% split for Died (= 0) and Survived (= 1) in the class values.
Finally, lets look at the correlation between the attributes. We have to exclude the rows with NA values when calculating the correlations.
data <- datasetTrain
data[,3] <- as.numeric((data[,3]))
complete_cases <- complete.cases(data)
cor(data[complete_cases,2:5])
## Pclass Sex Age SibSp
## Pclass 1.00000000 0.15546030 -0.36922602 0.06724737
## Sex 0.15546030 1.00000000 0.09325358 -0.10394968
## Age -0.36922602 0.09325358 1.00000000 -0.30824676
## SibSp 0.06724737 -0.10394968 -0.30824676 1.00000000
Values above 0.75 or below -0.75 are indicative of high positive or high negative correlation. From the above results, variables are not highly correlated in this dataset.
# barplot of males and females who survived
barplot(table(datasetTrain$Survived, datasetTrain[,3]))
legend("topleft", legend = c("Died", "Survived"), fill=c("black","grey"))
We’ll try some linear and non-linear algorithms:
* Linear Algorithms: Logistic Regression (LG), Linear Discriminate Analysis (LDA) and Regularized Logistic Regression (GLMNET).
* Non-Linear Algorithms: k-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naive Bayes (NB) and Support Vector Machines with Radial Basis Functions (SVM).
Lets define a test harness.
We will use 10-fold cross validation with 3 repeats. This is a good standard test harness configuration. It is a binary classification problem. For simplicity, we will use Accuracy and Kappa metrics.
# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# LG
set.seed(7)
fit.glm <- train(Survived~., data=datasetTrain, method="glm", metric=metric, trControl=trainControl)
# LDA
set.seed(7)
fit.lda <- train(Survived~., data=datasetTrain, method="lda", metric=metric, trControl=trainControl)
## Loading required package: MASS
# GLMNET
set.seed(7)
fit.glmnet <- train(Survived~., data=datasetTrain, method="glmnet", metric=metric,
trControl=trainControl)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-3
# KNN
set.seed(7)
fit.knn <- train(Survived~., data=datasetTrain, method="knn", metric=metric, trControl=trainControl)
# CART
set.seed(7)
fit.cart <- train(Survived~., data=datasetTrain, method="rpart", metric=metric,
trControl=trainControl)
## Loading required package: rpart
# Naive Bayes
set.seed(7)
fit.nb <- train(Survived~., data=datasetTrain, method="nb", metric=metric, trControl=trainControl)
## Loading required package: klaR
# SVM
set.seed(7)
fit.svm <- train(Survived~., data=datasetTrain, method="svmRadial", metric=metric,
trControl=trainControl)
## Loading required package: kernlab
# Compare algorithms
results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, KNN=fit.knn,
CART=fit.cart, NB=fit.nb, SVM=fit.svm))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: LG, LDA, GLMNET, KNN, CART, NB, SVM
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LG 0.6901 0.7614 0.8028 0.7955 0.8421 0.8732 0
## LDA 0.6806 0.7500 0.7833 0.7853 0.8327 0.8611 0
## GLMNET 0.6901 0.7641 0.8028 0.7969 0.8451 0.8732 0
## KNN 0.6197 0.6620 0.7042 0.6956 0.7289 0.7917 0
## CART 0.6338 0.7474 0.7833 0.7703 0.8028 0.8472 0
## NB 0.6901 0.7359 0.7762 0.7763 0.8056 0.8611 0
## SVM 0.7222 0.7895 0.8252 0.8230 0.8606 0.8889 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LG 0.3350 0.4863 0.5761 0.5715 0.6665 0.7363 0
## LDA 0.3088 0.4750 0.5471 0.5492 0.6505 0.7080 0
## GLMNET 0.3350 0.4931 0.5761 0.5741 0.6703 0.7363 0
## KNN 0.1886 0.2852 0.3635 0.3555 0.4180 0.5694 0
## CART 0.2257 0.4311 0.5188 0.5026 0.5857 0.6615 0
## NB 0.3350 0.4378 0.5359 0.5328 0.6111 0.7113 0
## SVM 0.3955 0.5541 0.6278 0.6218 0.6986 0.7610 0
dotplot(results)
SVM had the highest accuracy with with 82%.
We would apply a Box-Cox transform to flatten out the distribution.
# Compare algorithms
# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# LG
set.seed(7)
fit.glm <- train(Survived~., data=datasetTrain, method="glm", metric=metric, preProc=c("BoxCox"),
trControl=trainControl)
# LDA
set.seed(7)
fit.lda <- train(Survived~., data=datasetTrain, method="lda", metric=metric, preProc=c("BoxCox"),
trControl=trainControl)
# GLMNET
set.seed(7)
fit.glmnet <- train(Survived~., data=datasetTrain, method="glmnet", metric=metric,
preProc=c("BoxCox"), trControl=trainControl)
# KNN
set.seed(7)
fit.knn <- train(Survived~., data=datasetTrain, method="knn", metric=metric, preProc=c("BoxCox"),
trControl=trainControl)
# CART
set.seed(7)
fit.cart <- train(Survived~., data=datasetTrain, method="rpart", metric=metric,
preProc=c("BoxCox"), trControl=trainControl)
# Naive Bayes
set.seed(7)
fit.nb <- train(Survived~., data=datasetTrain, method="nb", metric=metric, preProc=c("BoxCox"),
trControl=trainControl)
# SVM
set.seed(7)
fit.svm <- train(Survived~., data=datasetTrain, method="svmRadial", metric=metric,
preProc=c("BoxCox"), trControl=trainControl)
# Compare algorithms
transformResults <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, KNN=fit.knn,
CART=fit.cart, NB=fit.nb, SVM=fit.svm))
summary(transformResults)
##
## Call:
## summary.resamples(object = transformResults)
##
## Models: LG, LDA, GLMNET, KNN, CART, NB, SVM
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LG 0.6901 0.7641 0.8169 0.7997 0.8451 0.8732 0
## LDA 0.6806 0.7606 0.7902 0.7876 0.8281 0.8611 0
## GLMNET 0.6901 0.7641 0.8169 0.7997 0.8451 0.8732 0
## KNN 0.6111 0.6620 0.6901 0.6941 0.7324 0.7917 0
## CART 0.6338 0.7474 0.7833 0.7703 0.8028 0.8472 0
## NB 0.6901 0.7465 0.7778 0.7768 0.8056 0.8750 0
## SVM 0.7465 0.7945 0.8322 0.8300 0.8702 0.8889 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LG 0.3350 0.5016 0.6105 0.5794 0.6703 0.7363 0
## LDA 0.3007 0.4923 0.5625 0.5526 0.6369 0.7080 0
## GLMNET 0.3350 0.4974 0.6105 0.5793 0.6703 0.7363 0
## KNN 0.1537 0.2793 0.3509 0.3544 0.4345 0.5645 0
## CART 0.2257 0.4311 0.5188 0.5026 0.5857 0.6615 0
## NB 0.3350 0.4535 0.5431 0.5333 0.6014 0.7387 0
## SVM 0.4458 0.5642 0.6387 0.6360 0.7220 0.7637 0
dotplot(transformResults)
The accuracy of SVM slightly increased to 83%. Still not good enough.
Let’s try some tuning of the top algorithms, specifically SVM and see if we can lift the accuracy.
The SVM implementation has two parameters that we can tune with caret package. The sigma which is a smoothing term, and C which is a cost constraint.
# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
set.seed(7)
grid <- expand.grid(.sigma=c(0.025, 0.05, 0.1, 0.15), .C=seq(1, 10, by=1))
fit.svm <- train(Survived~., data=datasetTrain, method="svmRadial", metric=metric, tuneGrid=grid,
preProc=c("BoxCox"), trControl=trainControl)
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel
##
## 891 samples
## 6 predictor
## 2 classes: '0', '1'
##
## Pre-processing: Box-Cox transformation (2)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 642, 643, 643, 642, 643, 643, ...
## Resampling results across tuning parameters:
##
## sigma C Accuracy Kappa Accuracy SD Kappa SD
## 0.025 1 0.7940402 0.5625769 0.04995250 0.10890772
## 0.025 2 0.8010629 0.5774007 0.04541514 0.10043513
## 0.025 3 0.8066706 0.5893258 0.04670475 0.10293467
## 0.025 4 0.8127413 0.6027869 0.04657574 0.10211953
## 0.025 5 0.8145996 0.6070040 0.04575838 0.09991166
## 0.025 6 0.8169536 0.6126585 0.04151477 0.09042901
## 0.025 7 0.8178860 0.6143981 0.04233197 0.09193154
## 0.025 8 0.8169666 0.6123880 0.04065050 0.08813937
## 0.025 9 0.8183685 0.6153956 0.03941643 0.08490302
## 0.025 10 0.8169666 0.6120991 0.03932924 0.08467801
## 0.050 1 0.8085355 0.5932273 0.04583082 0.10146726
## 0.050 2 0.8160211 0.6104235 0.04227677 0.09137473
## 0.050 3 0.8183685 0.6149608 0.04074875 0.08713463
## 0.050 4 0.8216419 0.6207747 0.04224858 0.09079218
## 0.050 5 0.8249087 0.6267092 0.04416730 0.09536913
## 0.050 6 0.8267932 0.6300088 0.04280513 0.09275737
## 0.050 7 0.8272561 0.6309847 0.04358552 0.09404835
## 0.050 8 0.8235068 0.6226572 0.04145917 0.08919894
## 0.050 9 0.8230373 0.6214722 0.04226837 0.09057956
## 0.050 10 0.8225678 0.6203849 0.04160268 0.08930255
## 0.100 1 0.8197835 0.6181426 0.04056333 0.08664961
## 0.100 2 0.8249152 0.6258870 0.04143611 0.08991476
## 0.100 3 0.8249087 0.6243778 0.04181122 0.09049287
## 0.100 4 0.8258412 0.6261855 0.04033479 0.08753920
## 0.100 5 0.8258477 0.6264062 0.04063415 0.08783926
## 0.100 6 0.8230373 0.6204421 0.04178908 0.08990879
## 0.100 7 0.8220983 0.6185721 0.04042532 0.08705429
## 0.100 8 0.8239502 0.6231632 0.04280879 0.09191613
## 0.100 9 0.8235003 0.6223639 0.04379623 0.09437874
## 0.100 10 0.8220918 0.6192665 0.04574102 0.09853571
## 0.150 1 0.8249152 0.6257358 0.04113215 0.08920021
## 0.150 2 0.8267801 0.6283858 0.04207758 0.09117273
## 0.150 3 0.8267540 0.6293762 0.04480330 0.09661076
## 0.150 4 0.8248891 0.6257498 0.04699275 0.10157985
## 0.150 5 0.8207225 0.6171830 0.04811438 0.10401942
## 0.150 6 0.8193271 0.6143938 0.04680067 0.10127745
## 0.150 7 0.8169992 0.6098799 0.04870386 0.10474778
## 0.150 8 0.8151213 0.6061833 0.05017747 0.10784505
## 0.150 9 0.8137194 0.6034476 0.05147910 0.11035277
## 0.150 10 0.8127869 0.6019100 0.05227623 0.11196409
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05 and C = 7.
plot(fit.svm)
Let’s look at 4 ensemble methods:
We will use the same test harness as before including the Box-Cox transform that flattens out the distributions.
# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# Bagged CART
set.seed(7)
fit.treebag <- train(Survived~., data=datasetTrain, method="treebag", metric=metric,
trControl=trainControl)
## Loading required package: ipred
## Loading required package: plyr
## Loading required package: e1071
# Random Forest
set.seed(7)
fit.rf <- train(Survived~., data=datasetTrain, method="rf", metric=metric, preProc=c("BoxCox"),
trControl=trainControl)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
# Stochastic Gradient Boosting
set.seed(7)
fit.gbm <- train(Survived~., data=datasetTrain, method="gbm", metric=metric, preProc=c("BoxCox"),
trControl=trainControl, verbose=FALSE)
## Loading required package: gbm
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: parallel
## Loaded gbm 2.1.1
# C5.0
set.seed(7)
fit.c50 <- train(Survived~., data=datasetTrain, method="C5.0", metric=metric, preProc=c("BoxCox"),
trControl=trainControl)
## Loading required package: C50
# Compare results
ensembleResults <- resamples(list(BAG=fit.treebag, RF=fit.rf, GBM=fit.gbm, C50=fit.c50))
summary(ensembleResults)
##
## Call:
## summary.resamples(object = ensembleResults)
##
## Models: BAG, RF, GBM, C50
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## BAG 0.6806 0.7746 0.8099 0.7998 0.8327 0.9296 0
## RF 0.7361 0.7887 0.8310 0.8188 0.8451 0.9014 0
## GBM 0.7361 0.7945 0.8252 0.8253 0.8732 0.9167 0
## C50 0.7361 0.7782 0.8252 0.8198 0.8592 0.8873 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## BAG 0.3323 0.5286 0.6007 0.5802 0.6463 0.8550 0
## RF 0.4458 0.5420 0.6401 0.6152 0.6706 0.7971 0
## GBM 0.4223 0.5703 0.6299 0.6315 0.7287 0.8248 0
## C50 0.4223 0.5265 0.6286 0.6171 0.7084 0.7693 0
dotplot(ensembleResults)
Lets finalize the SVM model to use on our entire training set.
We will have to remove the rows with missing values from the training dataset as well as the validation dataset. The code below shows the preparation of the pre-processing parameters using the training dataset.
# prepare parameters for data transform
# set.seed(7)
model <- svm(Survived ~ ., data = datasetTrain)
testData <- datasetTest[,c(-1, -8, -10, -11)]
preprocessParams <- preProcess(testData, method=c("BoxCox"))
testData$Age[is.na(testData$Age)] <- 0
testData$Fare[is.na(testData$Fare)] <- 0
testData <- predict(preprocessParams, testData)
## Age
predictions <- predict(model, testData, type="class")
submit <- data.frame(PassengerId = datasetTest$PassengerId, Survived = predictions)
write.csv(submit, file = "firstSVM.csv", row.names = FALSE)