1. Introduction
2. Problem Definiton
3. Prepare Problem
4. Analyze Data
- Descriptive statistics
- Data visualizations
5. Evaluate Algorithms
6. Improve Accuracy
- Algorithm Tuning
- Ensembles
7. Finalize Model

1. Introduction

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

2. Problem Definiton

This project investigates the likelihood of survival. This is done by employing Machine Learning tools to predict which passengers survived the tragedy. The dataset was supplied by “Kaggle”, an online platform for predictive modelling and analytics competitions.

Each record in the dataset describes a passenger. The attributes are defined as follows:
1. PassengerId: Unique passenger identification number
2. Survived: Did a passenger survive or not (0 = died, 1 = survived)
3. Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
4. Name: Name of Passenger
5. Sex: Sex of Passenger
6. Age: Passenger Age
7. SibSp: Number of Siblings/Spouses Aboard
8. Parch: Number of Parents/Children Aboard
9. Ticket: Ticket Number
10. Fare: Passenger Fare
11. Cabin: Cabin
12. Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

3. Prepare Problem

Lets start off with loading the required packages.

# load packages
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

library(corrplot)

The training and test dataset are provided by Kaggle, so we’ll load them into R for predictive modeling.

# load the CSV file from the local directory
datasetTrain <- read.csv("data/train.csv", header=TRUE, sep = ",")
datasetTest <- read.csv("data/test.csv", header=TRUE, sep = ",")

4. Analyze Data

Let’s better understand our data, by performing some descriprtive statistics and visualizations.

Descriptive statistics

# dimensions of dataset
dim(datasetTrain)

## [1] 891  12

# atrtribute class type
sapply(datasetTrain, class)

## PassengerId    Survived      Pclass        Name         Sex         Age 
##   "integer"   "integer"   "integer"    "factor"    "factor"   "numeric" 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##   "integer"   "integer"    "factor"   "numeric"    "factor"    "factor"

# peek
head(datasetTrain)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q

From the sample values shown in the dataset, we can see that PassengerId, Name, Ticket, Cabin and Embarked will not be useful in our analysis, so we can probably remove it.

# remove redundant variables
datasetTrain <- datasetTrain[,c(-1, -4, -9, -11, -12)]
# summary
summary(datasetTrain)

##     Survived          Pclass          Sex           Age       
##  Min.   :0.0000   Min.   :1.000   female:314   Min.   : 0.42  
##  1st Qu.:0.0000   1st Qu.:2.000   male  :577   1st Qu.:20.12  
##  Median :0.0000   Median :3.000                Median :28.00  
##  Mean   :0.3838   Mean   :2.309                Mean   :29.70  
##  3rd Qu.:1.0000   3rd Qu.:3.000                3rd Qu.:38.00  
##  Max.   :1.0000   Max.   :3.000                Max.   :80.00  
##                                                NA's   :177    
##      SibSp           Parch             Fare       
##  Min.   :0.000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   Max.   :512.33  
##

Also note that sex is a factor variable, and we may have some missing data (NA). We may need to remove the records (or impute values) with NA values for some analysis and modeling techiques.
Next, we need to convert sex to a numeric variable and ‘survived’ (our output variable) to a factor.

# convert survived to factor
datasetTrain[,1] <- as.factor((datasetTrain[,1]))

Let’s take a closer look at the class distributions

# class distribution
cbind(freq=table(datasetTrain$Survived), percentage=prop.table(table(datasetTrain$Survived))*100)

##   freq percentage
## 0  549   61.61616
## 1  342   38.38384

There is some imbalance in the Class values. We observe an approximate 60% to 40% split for Died (= 0) and Survived (= 1) in the class values.

Finally, lets look at the correlation between the attributes. We have to exclude the rows with NA values when calculating the correlations.

data <- datasetTrain
data[,3] <- as.numeric((data[,3]))
complete_cases <- complete.cases(data)
cor(data[complete_cases,2:5])

##             Pclass         Sex         Age       SibSp
## Pclass  1.00000000  0.15546030 -0.36922602  0.06724737
## Sex     0.15546030  1.00000000  0.09325358 -0.10394968
## Age    -0.36922602  0.09325358  1.00000000 -0.30824676
## SibSp   0.06724737 -0.10394968 -0.30824676  1.00000000

Values above 0.75 or below -0.75 are indicative of high positive or high negative correlation. From the above results, variables are not highly correlated in this dataset.

Data visualizations

# barplot of males and females who survived
barplot(table(datasetTrain$Survived, datasetTrain[,3]))
legend("topleft", legend = c("Died", "Survived"), fill=c("black","grey"))

5. Evaluate Algorithms

We’ll try some linear and non-linear algorithms:
* Linear Algorithms: Logistic Regression (LG), Linear Discriminate Analysis (LDA) and Regularized Logistic Regression (GLMNET).
* Non-Linear Algorithms: k-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naive Bayes (NB) and Support Vector Machines with Radial Basis Functions (SVM).

Test options and evaluation metric.

Lets define a test harness.
We will use 10-fold cross validation with 3 repeats. This is a good standard test harness configuration. It is a binary classification problem. For simplicity, we will use Accuracy and Kappa metrics.

# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"

Spot-Check Algorithms

# LG
set.seed(7)
fit.glm <- train(Survived~., data=datasetTrain, method="glm", metric=metric, trControl=trainControl)
# LDA
set.seed(7)
fit.lda <- train(Survived~., data=datasetTrain, method="lda", metric=metric, trControl=trainControl)

## Loading required package: MASS

# GLMNET
set.seed(7)
fit.glmnet <- train(Survived~., data=datasetTrain, method="glmnet", metric=metric,
    trControl=trainControl)

## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-3

# KNN
set.seed(7)
fit.knn <- train(Survived~., data=datasetTrain, method="knn", metric=metric, trControl=trainControl)
# CART
set.seed(7)
fit.cart <- train(Survived~., data=datasetTrain, method="rpart", metric=metric,
    trControl=trainControl)

## Loading required package: rpart

# Naive Bayes
set.seed(7)
fit.nb <- train(Survived~., data=datasetTrain, method="nb", metric=metric, trControl=trainControl)

## Loading required package: klaR

# SVM
set.seed(7)
fit.svm <- train(Survived~., data=datasetTrain, method="svmRadial", metric=metric,
    trControl=trainControl)

## Loading required package: kernlab

# Compare algorithms
results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, KNN=fit.knn,
    CART=fit.cart, NB=fit.nb, SVM=fit.svm))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: LG, LDA, GLMNET, KNN, CART, NB, SVM 
## Number of resamples: 30 
## 
## Accuracy 
##          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## LG     0.6901  0.7614 0.8028 0.7955  0.8421 0.8732    0
## LDA    0.6806  0.7500 0.7833 0.7853  0.8327 0.8611    0
## GLMNET 0.6901  0.7641 0.8028 0.7969  0.8451 0.8732    0
## KNN    0.6197  0.6620 0.7042 0.6956  0.7289 0.7917    0
## CART   0.6338  0.7474 0.7833 0.7703  0.8028 0.8472    0
## NB     0.6901  0.7359 0.7762 0.7763  0.8056 0.8611    0
## SVM    0.7222  0.7895 0.8252 0.8230  0.8606 0.8889    0
## 
## Kappa 
##          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## LG     0.3350  0.4863 0.5761 0.5715  0.6665 0.7363    0
## LDA    0.3088  0.4750 0.5471 0.5492  0.6505 0.7080    0
## GLMNET 0.3350  0.4931 0.5761 0.5741  0.6703 0.7363    0
## KNN    0.1886  0.2852 0.3635 0.3555  0.4180 0.5694    0
## CART   0.2257  0.4311 0.5188 0.5026  0.5857 0.6615    0
## NB     0.3350  0.4378 0.5359 0.5328  0.6111 0.7113    0
## SVM    0.3955  0.5541 0.6278 0.6218  0.6986 0.7610    0

dotplot(results)

SVM had the highest accuracy with with 82%.

Evaluate Algorithms

We would apply a Box-Cox transform to flatten out the distribution.

# Compare algorithms
# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# LG
set.seed(7)
fit.glm <- train(Survived~., data=datasetTrain, method="glm", metric=metric, preProc=c("BoxCox"),
    trControl=trainControl)
# LDA
set.seed(7)
fit.lda <- train(Survived~., data=datasetTrain, method="lda", metric=metric, preProc=c("BoxCox"),
    trControl=trainControl)
# GLMNET
set.seed(7)
fit.glmnet <- train(Survived~., data=datasetTrain, method="glmnet", metric=metric,
    preProc=c("BoxCox"), trControl=trainControl)
# KNN
set.seed(7)
fit.knn <- train(Survived~., data=datasetTrain, method="knn", metric=metric, preProc=c("BoxCox"),
    trControl=trainControl)
# CART
set.seed(7)
fit.cart <- train(Survived~., data=datasetTrain, method="rpart", metric=metric,
    preProc=c("BoxCox"), trControl=trainControl)
# Naive Bayes
set.seed(7)
fit.nb <- train(Survived~., data=datasetTrain, method="nb", metric=metric, preProc=c("BoxCox"),
    trControl=trainControl)
# SVM
set.seed(7)
fit.svm <- train(Survived~., data=datasetTrain, method="svmRadial", metric=metric,
    preProc=c("BoxCox"), trControl=trainControl)
# Compare algorithms
transformResults <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, KNN=fit.knn,
    CART=fit.cart, NB=fit.nb, SVM=fit.svm))
summary(transformResults)

## 
## Call:
## summary.resamples(object = transformResults)
## 
## Models: LG, LDA, GLMNET, KNN, CART, NB, SVM 
## Number of resamples: 30 
## 
## Accuracy 
##          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## LG     0.6901  0.7641 0.8169 0.7997  0.8451 0.8732    0
## LDA    0.6806  0.7606 0.7902 0.7876  0.8281 0.8611    0
## GLMNET 0.6901  0.7641 0.8169 0.7997  0.8451 0.8732    0
## KNN    0.6111  0.6620 0.6901 0.6941  0.7324 0.7917    0
## CART   0.6338  0.7474 0.7833 0.7703  0.8028 0.8472    0
## NB     0.6901  0.7465 0.7778 0.7768  0.8056 0.8750    0
## SVM    0.7465  0.7945 0.8322 0.8300  0.8702 0.8889    0
## 
## Kappa 
##          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## LG     0.3350  0.5016 0.6105 0.5794  0.6703 0.7363    0
## LDA    0.3007  0.4923 0.5625 0.5526  0.6369 0.7080    0
## GLMNET 0.3350  0.4974 0.6105 0.5793  0.6703 0.7363    0
## KNN    0.1537  0.2793 0.3509 0.3544  0.4345 0.5645    0
## CART   0.2257  0.4311 0.5188 0.5026  0.5857 0.6615    0
## NB     0.3350  0.4535 0.5431 0.5333  0.6014 0.7387    0
## SVM    0.4458  0.5642 0.6387 0.6360  0.7220 0.7637    0

dotplot(transformResults)

The accuracy of SVM slightly increased to 83%. Still not good enough.

6. Improve Accuracy

Let’s try some tuning of the top algorithms, specifically SVM and see if we can lift the accuracy.

Algorithm Tuning

The SVM implementation has two parameters that we can tune with caret package. The sigma which is a smoothing term, and C which is a cost constraint.

# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
set.seed(7)
grid <- expand.grid(.sigma=c(0.025, 0.05, 0.1, 0.15), .C=seq(1, 10, by=1))
fit.svm <- train(Survived~., data=datasetTrain, method="svmRadial", metric=metric, tuneGrid=grid,
    preProc=c("BoxCox"), trControl=trainControl)
print(fit.svm)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 891 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: Box-Cox transformation (2) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 642, 643, 643, 642, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   sigma  C   Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.025   1  0.7940402  0.5625769  0.04995250   0.10890772
##   0.025   2  0.8010629  0.5774007  0.04541514   0.10043513
##   0.025   3  0.8066706  0.5893258  0.04670475   0.10293467
##   0.025   4  0.8127413  0.6027869  0.04657574   0.10211953
##   0.025   5  0.8145996  0.6070040  0.04575838   0.09991166
##   0.025   6  0.8169536  0.6126585  0.04151477   0.09042901
##   0.025   7  0.8178860  0.6143981  0.04233197   0.09193154
##   0.025   8  0.8169666  0.6123880  0.04065050   0.08813937
##   0.025   9  0.8183685  0.6153956  0.03941643   0.08490302
##   0.025  10  0.8169666  0.6120991  0.03932924   0.08467801
##   0.050   1  0.8085355  0.5932273  0.04583082   0.10146726
##   0.050   2  0.8160211  0.6104235  0.04227677   0.09137473
##   0.050   3  0.8183685  0.6149608  0.04074875   0.08713463
##   0.050   4  0.8216419  0.6207747  0.04224858   0.09079218
##   0.050   5  0.8249087  0.6267092  0.04416730   0.09536913
##   0.050   6  0.8267932  0.6300088  0.04280513   0.09275737
##   0.050   7  0.8272561  0.6309847  0.04358552   0.09404835
##   0.050   8  0.8235068  0.6226572  0.04145917   0.08919894
##   0.050   9  0.8230373  0.6214722  0.04226837   0.09057956
##   0.050  10  0.8225678  0.6203849  0.04160268   0.08930255
##   0.100   1  0.8197835  0.6181426  0.04056333   0.08664961
##   0.100   2  0.8249152  0.6258870  0.04143611   0.08991476
##   0.100   3  0.8249087  0.6243778  0.04181122   0.09049287
##   0.100   4  0.8258412  0.6261855  0.04033479   0.08753920
##   0.100   5  0.8258477  0.6264062  0.04063415   0.08783926
##   0.100   6  0.8230373  0.6204421  0.04178908   0.08990879
##   0.100   7  0.8220983  0.6185721  0.04042532   0.08705429
##   0.100   8  0.8239502  0.6231632  0.04280879   0.09191613
##   0.100   9  0.8235003  0.6223639  0.04379623   0.09437874
##   0.100  10  0.8220918  0.6192665  0.04574102   0.09853571
##   0.150   1  0.8249152  0.6257358  0.04113215   0.08920021
##   0.150   2  0.8267801  0.6283858  0.04207758   0.09117273
##   0.150   3  0.8267540  0.6293762  0.04480330   0.09661076
##   0.150   4  0.8248891  0.6257498  0.04699275   0.10157985
##   0.150   5  0.8207225  0.6171830  0.04811438   0.10401942
##   0.150   6  0.8193271  0.6143938  0.04680067   0.10127745
##   0.150   7  0.8169992  0.6098799  0.04870386   0.10474778
##   0.150   8  0.8151213  0.6061833  0.05017747   0.10784505
##   0.150   9  0.8137194  0.6034476  0.05147910   0.11035277
##   0.150  10  0.8127869  0.6019100  0.05227623   0.11196409
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.05 and C = 7.

plot(fit.svm)

Ensembles

Let’s look at 4 ensemble methods:

Bagging: Bagged CART (BAG) and Random Forest (RF).
Boosting: Stochastic Gradient Boosting (GBM) and C5.0 (C50).

We will use the same test harness as before including the Box-Cox transform that flattens out the distributions.

# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# Bagged CART
set.seed(7)
fit.treebag <- train(Survived~., data=datasetTrain, method="treebag", metric=metric,
    trControl=trainControl)

## Loading required package: ipred
## Loading required package: plyr
## Loading required package: e1071

# Random Forest
set.seed(7)
fit.rf <- train(Survived~., data=datasetTrain, method="rf", metric=metric, preProc=c("BoxCox"),
    trControl=trainControl)

## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

# Stochastic Gradient Boosting
set.seed(7)
fit.gbm <- train(Survived~., data=datasetTrain, method="gbm", metric=metric, preProc=c("BoxCox"),
    trControl=trainControl, verbose=FALSE)

## Loading required package: gbm
## Loading required package: survival
## Loading required package: splines
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: parallel
## Loaded gbm 2.1.1

# C5.0
set.seed(7)
fit.c50 <- train(Survived~., data=datasetTrain, method="C5.0", metric=metric, preProc=c("BoxCox"),
    trControl=trainControl)

## Loading required package: C50

# Compare results
ensembleResults <- resamples(list(BAG=fit.treebag, RF=fit.rf, GBM=fit.gbm, C50=fit.c50))
summary(ensembleResults)

## 
## Call:
## summary.resamples(object = ensembleResults)
## 
## Models: BAG, RF, GBM, C50 
## Number of resamples: 30 
## 
## Accuracy 
##       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## BAG 0.6806  0.7746 0.8099 0.7998  0.8327 0.9296    0
## RF  0.7361  0.7887 0.8310 0.8188  0.8451 0.9014    0
## GBM 0.7361  0.7945 0.8252 0.8253  0.8732 0.9167    0
## C50 0.7361  0.7782 0.8252 0.8198  0.8592 0.8873    0
## 
## Kappa 
##       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## BAG 0.3323  0.5286 0.6007 0.5802  0.6463 0.8550    0
## RF  0.4458  0.5420 0.6401 0.6152  0.6706 0.7971    0
## GBM 0.4223  0.5703 0.6299 0.6315  0.7287 0.8248    0
## C50 0.4223  0.5265 0.6286 0.6171  0.7084 0.7693    0

dotplot(ensembleResults)

7. Finalize Model

Lets finalize the SVM model to use on our entire training set.

We will have to remove the rows with missing values from the training dataset as well as the validation dataset. The code below shows the preparation of the pre-processing parameters using the training dataset.

# prepare parameters for data transform
# set.seed(7)
model <- svm(Survived ~ ., data = datasetTrain)
testData <- datasetTest[,c(-1, -8, -10, -11)]
preprocessParams <- preProcess(testData, method=c("BoxCox"))
testData$Age[is.na(testData$Age)] <- 0
testData$Fare[is.na(testData$Fare)] <- 0
testData <- predict(preprocessParams, testData)

## Age

predictions <- predict(model, testData, type="class")
submit <- data.frame(PassengerId = datasetTest$PassengerId, Survived = predictions)
write.csv(submit, file = "firstSVM.csv", row.names = FALSE)

Titanic: Machine Learning from Disaster

Ekaba Ononse Bisong

April 20, 2016