1. Introduction

This is a classification challenge on Kaggle. I will use different classification methods to predict the survival on the Titanic.
- Logistics Regression
- Classification tree
- Random forest
- Gradient Boosting
- Support Vector Machine

Details and datasets can be found here : https://www.kaggle.com/c/titanic

2. Load packages, import data and check data

train <- read_csv("C:/Users/todu6001/Downloads/train.csv")

## Parsed with column specification:
## cols(
##   PassengerId = col_integer(),
##   Survived = col_integer(),
##   Pclass = col_integer(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_integer(),
##   Parch = col_integer(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

test <- read_csv("C:/Users/todu6001/Downloads/test.csv")

## Parsed with column specification:
## cols(
##   PassengerId = col_integer(),
##   Pclass = col_integer(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_integer(),
##   Parch = col_integer(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

#Combining both train and test to create  a full data
full  <- bind_rows(train, test)

#Data summary
str(full)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

summary(full)

##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.3838   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                 NA's   :418                                        
##      Sex                 Age            SibSp            Parch      
##  Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
##                     Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                     NA's   :263                                     
##     Ticket               Fare            Cabin          
##  Length:1309        Min.   :  0.000   Length:1309       
##  Class :character   1st Qu.:  7.896   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character  
##                     Mean   : 33.295                     
##                     3rd Qu.: 31.275                     
##                     Max.   :512.329                     
##                     NA's   :1                           
##    Embarked        
##  Length:1309       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

aggr_plot <- aggr(full, col=c('navyblue','red'), 
                  numbers=TRUE, sortVars=TRUE, 
                  labels=names(data),
                  cex.axis=.7, gap=3, 
                  ylab=c("Histogram of missing data","Pattern"))

## Warning in plot.aggr(res, ...): not enough horizontal space to display
## frequencies

## 
##  Variables sorted by number of missings: 
##     Variable        Count
##        Cabin 0.7746371276
##     Survived 0.3193277311
##          Age 0.2009167303
##     Embarked 0.0015278839
##         Fare 0.0007639419
##  PassengerId 0.0000000000
##       Pclass 0.0000000000
##         Name 0.0000000000
##          Sex 0.0000000000
##        SibSp 0.0000000000
##        Parch 0.0000000000
##       Ticket 0.0000000000

We can see that CABIN, AGE, EMBARKED and FARE variables have missing data. We have 1309 observations and 12 variables.

3. Clean data.

3.1. Name and title.

A name might not explain a lot in this situation however, a title is different. Passenger title not only can be used as a gender classification but also their Rank in society. The higher Rank they have, the higher chance people will let them stay alive.

# Check missing
sum(is.na(full$Name))

## [1] 0

# Take title from name
full$Title <- gsub('(.*, )|(\\..*)', '', full$Name)

# Show title counts by sex
table(full$Sex, full$Title)

##         
##          Capt Col Don Dona  Dr Jonkheer Lady Major Master Miss Mlle Mme
##   female    0   0   0    1   1        0    1     0      0  260    2   1
##   male      1   4   1    0   7        1    0     2     61    0    0   0
##         
##           Mr Mrs  Ms Rev Sir the Countess
##   female   0 197   2   0   0            1
##   male   757   0   0   8   1            0

We can group the title into “MR”,“MISS”,“MRS”,“MASTER” and “Other”.

# Titles with very low cell counts to be combined to "others"
rare_title <- c('Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don', 
                'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')

# Also reassign mlle, ms, and mme accordingly
full$Title[full$Title == 'Mlle']        <- 'Miss' 
full$Title[full$Title == 'Ms']          <- 'Miss'
full$Title[full$Title == 'Mme']         <- 'Mrs' 
full$Title[full$Title %in% rare_title]  <- 'Others'

# Show title counts by sex again
table(full$Sex, full$Title)

##         
##          Master Miss  Mr Mrs Others
##   female      0  264   0 198      4
##   male       61    0 757   0     25

3.2. Family size.

The size of the family also might be an interesting factor to consider.

#Check missing 
sum(is.na(full$SibSp))

## [1] 0

sum(is.na(full$Parch))

## [1] 0

# Create a family size variable including the passenger themselves
full$Fsize <- full$SibSp + full$Parch + 1

#Use ggplot2 to visualize the relationship between family size & survival
ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +
  geom_bar(stat='count', position='dodge') +
  scale_x_continuous(breaks=c(1:11)) +
  labs(x = 'Family Size') +
  theme_few()

We can see that people who are single have highest rate of “dying” compare to “surviving” among all others size.

3.3. Cabin variable.

#Check missing
sum(is.na(full$Cabin))

## [1] 1014

#Have a look at the data
full$Cabin [1:50]

##  [1] NA            "C85"         NA            "C123"        NA           
##  [6] NA            "E46"         NA            NA            NA           
## [11] "G6"          "C103"        NA            NA            NA           
## [16] NA            NA            NA            NA            NA           
## [21] NA            "D56"         NA            "A6"          NA           
## [26] NA            NA            "C23 C25 C27" NA            NA           
## [31] NA            "B78"         NA            NA            NA           
## [36] NA            NA            NA            NA            NA           
## [41] NA            NA            NA            NA            NA           
## [46] NA            NA            NA            NA            NA

This variable appears to have a lot of missing value and there is not much to analyze from this so I decided to drop this variable.

3.4. Age and Fare missing

# Process Age and fare Column
sum(is.na(full$Age))

## [1] 263

sum(is.na(full$Fare))

## [1] 1

full$PassengerId[is.na(full$Fare)]

## [1] 1044

#Make factor variable factors
factor_vars <- c('PassengerId','Pclass','Sex','Embarked',
               'Fsize','Survived','Title')

full[factor_vars] <- lapply(full[factor_vars], function(x) as.factor(x))

Age is probably one of the key factor that will contribute to the model. However, it seems to have a lot of missing value. Since this is a continous variables, we can try to fill in the missing value using MICE package.

# Use mice package to predict missing age and missing fare
set.seed(123)
mice_predict <- mice(full[, !names(full) %in% c('PassengerId','Name','Ticket','Cabin','Family','Survived','Fsize','Title')], method='rf')

## 
##  iter imp variable
##   1   1  Age  Fare  Embarked
##   1   2  Age  Fare  Embarked
##   1   3  Age  Fare  Embarked
##   1   4  Age  Fare  Embarked
##   1   5  Age  Fare  Embarked
##   2   1  Age  Fare  Embarked
##   2   2  Age  Fare  Embarked
##   2   3  Age  Fare  Embarked
##   2   4  Age  Fare  Embarked
##   2   5  Age  Fare  Embarked
##   3   1  Age  Fare  Embarked
##   3   2  Age  Fare  Embarked
##   3   3  Age  Fare  Embarked
##   3   4  Age  Fare  Embarked
##   3   5  Age  Fare  Embarked
##   4   1  Age  Fare  Embarked
##   4   2  Age  Fare  Embarked
##   4   3  Age  Fare  Embarked
##   4   4  Age  Fare  Embarked
##   4   5  Age  Fare  Embarked
##   5   1  Age  Fare  Embarked
##   5   2  Age  Fare  Embarked
##   5   3  Age  Fare  Embarked
##   5   4  Age  Fare  Embarked
##   5   5  Age  Fare  Embarked

full_mice <-complete(mice_predict)

aggr_plot <- aggr(full_mice, col=c('navyblue','red'), 
                  numbers=TRUE, sortVars=TRUE, 
                  labels=names(data), cex.axis=.7,
                  gap=3, 
                  ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable Count
##    Pclass     0
##       Sex     0
##       Age     0
##     SibSp     0
##     Parch     0
##      Fare     0
##  Embarked     0

There is no missing value means all the data points have been filled. Now we proceed to check means and distribution using bar graph.

# Check distribution before and after for AGE
par(mfrow=c(1,2))
hist(full$Age, freq=F, main='Before Replacement', 
     col='red', ylim=c(0,0.04))
hist(full_mice$Age, freq=F, main='After Replacement', 
     col='blue', ylim=c(0,0.04))

# Check distribution before and after for FARE
par(mfrow=c(1,2))
hist(full$Fare, freq=F, main='Before Replacement', 
     col='red', ylim=c(0,0.04))
hist(full_mice$Fare, freq=F, main='After Replacement', 
     col='blue', ylim=c(0,0.04))

There is no different between the Before and After data set. We proceed to replace the predicted points into real data.

#Replace value into the data set
full$Age <- full_mice$Age
full$Fare <- full_mice$Fare
full$Embarked <- full_mice$Embarked

4. Prediction.

Now we have finished cleaning the data we need to generate classification model.

In order to compare the result for different model, I follow these steps to generate consitent result :
1. Fit the classification model
2. Tune the attributes if necessary
3. Generate classification rate
4. Generate ROC curve and AUC score

AUC score represents the accuracy of model. An area of 1 represents a perfect test; an area of .5 represents a worthless test. A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:

.90-1 = excellent (A)
.80-.90 = good (B)
.70-.80 = fair (C)
.60-.70 = poor (D)
.50-.60 = fail (F)

# Check for missing again 
aggr_plot <- aggr(full, col=c('navyblue','red'), 
                  numbers=TRUE, sortVars=TRUE, 
                  labels=names(data),
                  cex.axis=.7, gap=3, 
                  ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##     Variable     Count
##        Cabin 0.7746371
##     Survived 0.3193277
##  PassengerId 0.0000000
##       Pclass 0.0000000
##         Name 0.0000000
##          Sex 0.0000000
##          Age 0.0000000
##        SibSp 0.0000000
##        Parch 0.0000000
##       Ticket 0.0000000
##         Fare 0.0000000
##     Embarked 0.0000000
##        Title 0.0000000
##        Fsize 0.0000000

# Split the data back into a train set and a test set
trainx <- full[1:891,]
testx <- full[892:1309,]
test_sample <- sample(1:nrow(trainx),350) #This is a validate set to validate accuracy

View(trainx[test_sample,])
#

I split the data back to train and test data after added new variable and clean all missing data. I also split the train data into an sample data for classification validation.

4.1. Logistic regression.

I generated the model based on Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, Fsize and Title Let’s start with the basic logistics regression model

#Logistics regression model with sample dataset
logit_death <- glm(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                   Fare + Embarked + Fsize + Title, 
                   data= trainx[-test_sample,],
                   family=binomial)

summary(logit_death)

## 
## Call:
## glm(formula = factor(Survived) ~ Pclass + Sex + Age + SibSp + 
##     Parch + Fare + Embarked + Fsize + Title, family = binomial, 
##     data = trainx[-test_sample, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4791  -0.4782  -0.3159   0.4545   2.6051  
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.277e+01  1.695e+03   0.013 0.989285    
## Pclass2     -1.520e+00  4.342e-01  -3.500 0.000465 ***
## Pclass3     -2.567e+00  4.365e-01  -5.881 4.08e-09 ***
## Sexmale     -1.754e+01  1.695e+03  -0.010 0.991745    
## Age         -2.273e-02  1.199e-02  -1.896 0.057935 .  
## SibSp       -1.568e+00  9.909e+01  -0.016 0.987376    
## Parch       -1.781e+00  9.909e+01  -0.018 0.985657    
## Fare         2.110e-03  3.174e-03   0.665 0.506205    
## Embarked2    1.655e-01  5.426e-01   0.305 0.760391    
## Embarked3   -2.883e-01  3.523e-01  -0.819 0.413021    
## Fsize2       1.406e+00  9.909e+01   0.014 0.988677    
## Fsize3       3.557e+00  1.982e+02   0.018 0.985680    
## Fsize4       4.462e+00  2.973e+02   0.015 0.988025    
## Fsize5       4.157e+00  3.964e+02   0.010 0.991632    
## Fsize6       4.920e+00  4.955e+02   0.010 0.992078    
## Fsize7       6.911e+00  5.945e+02   0.012 0.990726    
## Fsize8      -5.959e+00  1.179e+03  -0.005 0.995968    
## Fsize11             NA         NA      NA       NA    
## TitleMiss   -1.912e+01  1.695e+03  -0.011 0.991002    
## TitleMr     -4.724e+00  1.010e+00  -4.678 2.89e-06 ***
## TitleMrs    -1.853e+01  1.695e+03  -0.011 0.991277    
## TitleOthers -4.938e+00  1.245e+00  -3.965 7.33e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 716.95  on 540  degrees of freedom
## Residual deviance: 399.27  on 520  degrees of freedom
## AIC: 441.27
## 
## Number of Fisher Scoring iterations: 15

#Predict on validation data
predict_logit <- predict(logit_death,
                        trainx[test_sample,],
                        type="response")

## Warning: contrasts dropped from factor Embarked

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

fitted.results <- ifelse(predict_logit > 0.5,1,0)

mean(fitted.results == trainx[test_sample,]$Survived)

## [1] 0.8171429

It’s amazing to see only Pclass 2, Pclass3 , Title Mr and Title Others are significant while all the other attribute aren’t.

AUC score AND ROC CURVE

roc_pr <-  prediction(predict_logit, 
                      trainx[test_sample,]$Survived)

perform <- performance(roc_pr,measure = "tpr",
                       x.measure = "fpr")

plot(perform)

auc <- performance(roc_pr,measure="auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.8432458

The classification error is 81% and AUC score is 0.84.

4.2. Classification

TREE MODEL WITH TREE() package. So there are multiple packages that can perform classification tree. However, I find it rather difficult to generate ROC curve from tree() model, therefore, I decided to try the rpart package to generate tree model.

### DECISION TREE USING RPART 
rpart_deathrate <- rpart(formula = Survived ~ Pclass + Sex + Age + 
                         SibSp + Parch + Fare +
                         Embarked + Fsize + Title, 
                       trainx[test_sample,])

printcp(rpart_deathrate)

## 
## Classification tree:
## rpart(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Embarked + Fsize + Title, data = trainx[test_sample, 
##     ])
## 
## Variables actually used in tree construction:
## [1] Age      Embarked Fsize    Pclass   Sex      Title   
## 
## Root node error: 138/350 = 0.39429
## 
## n= 350 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.442029      0   1.00000 1.00000 0.066251
## 2 0.028986      1   0.55797 0.57971 0.056926
## 3 0.021739      2   0.52899 0.60145 0.057661
## 4 0.014493      5   0.45652 0.60145 0.057661
## 5 0.010000      7   0.42754 0.61594 0.058133

plotcp(rpart_deathrate)

predict_rpart <- predict(rpart_deathrate,trainx[-test_sample,], type="prob")[,2]

## Warning: contrasts dropped from factor Embarked

predict_rpart_class <- predict(rpart_deathrate,trainx[-test_sample,], type="class")

## Warning: contrasts dropped from factor Embarked

t <- table(predict_rpart_class,trainx[-test_sample,]$Survived)

sum(diag(t))/sum(t)

## [1] 0.8151571

The classification error rate is 81%

Next I will use the same method to generate the ROC curve for comparison between different method.

roc_rpart <- prediction(predict_rpart,trainx[-test_sample,]$Survived)
rpart_perform <- performance(roc_rpart,measure = "tpr",
                             x.measure = "fpr")

plot(rpart_perform)

auc_rpart <- performance(roc_rpart,measure="auc")
auc_rpart <- auc_rpart@y.values[[1]]
auc_rpart

## [1] 0.8013033

The AUC score is 0.801 . This seems have lower accuracy than regular logistics regression.

4.3. Random Forest with Parameter tuning

One of the most famous classification method is the Random Forest model. The simple format is

simplerf.deathrate <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                                  Fare + Embarked + Fsize + Title,
                                data = trainx[test_sample,],
                                ntree=500,
                                mtry=3)


simplerf.deathrate

## 
## Call:
##  randomForest(formula = factor(Survived) ~ Pclass + Sex + Age +      SibSp + Parch + Fare + Embarked + Fsize + Title, data = trainx[test_sample,      ], ntree = 500, mtry = 3) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.86%
## Confusion matrix:
##     0  1 class.error
## 0 192 20  0.09433962
## 1  46 92  0.33333333

However, there is a bunch of parameters that we can explore with Randomforest package however I will only focus on:
- ntree Number of trees to grow.
- mtry Number of variables randomly sampled as candidates at each split.

TUNE MTRY USING TUNERF function

tune <- tuneRF (trainx[test_sample,c("Pclass","Sex","Age","SibSp","Parch", 
                          "Fare","Embarked","Fsize","Title")],
                trainx[test_sample,]$Survived,
                ntreeTry = 500)

## mtry = 3  OOB error = 19.43% 
## Searching left ...
## mtry = 2     OOB error = 20.57% 
## -0.05882353 0.05 
## Searching right ...
## mtry = 6     OOB error = 22.57% 
## -0.1617647 0.05

x <- trainx[test_sample,c("Pclass","Sex","Age","SibSp","Parch", 
               "Fare","Embarked","Fsize","Title")]

mtry <- sqrt(ncol(x))

tunegrid <- expand.grid(.mtry=mtry)

control <- trainControl(method="repeatedcv",
                        number=10, 
                        repeats=3,
                        search="random")
metric <- "Accuracy"

It seems like both tuning methods recommend mtry =3 .

Let’s run the model with mtry =3 and ntree =1000. This time we generate a table of ntree, starting from 50 and increasing up to 1000 ntree to see which value generate lowest error rate.

rf_deathrate <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                                  Fare + Embarked + Fsize + Title,
                                data = trainx[test_sample,],
                                ntree=1000,
                                mtry=3, do.trace=50)

## ntree      OOB      1      2
##    50:  20.00%  9.91% 35.51%
##   100:  19.71%  9.91% 34.78%
##   150:  19.71%  9.91% 34.78%
##   200:  20.57% 10.38% 36.23%
##   250:  19.43%  8.49% 36.23%
##   300:  20.29%  8.96% 37.68%
##   350:  20.29%  9.43% 36.96%
##   400:  20.57%  9.43% 37.68%
##   450:  20.29%  9.43% 36.96%
##   500:  20.86% 10.38% 36.96%
##   550:  20.29% 10.38% 35.51%
##   600:  21.14% 10.85% 36.96%
##   650:  21.14% 10.85% 36.96%
##   700:  20.29%  9.91% 36.23%
##   750:  20.57% 10.38% 36.23%
##   800:  20.57% 10.38% 36.23%
##   850:  20.57% 10.38% 36.23%
##   900:  20.86% 10.85% 36.23%
##   950:  20.00%  9.43% 36.23%
##  1000:  20.00%  9.43% 36.23%

rf_deathrate

## 
## Call:
##  randomForest(formula = factor(Survived) ~ Pclass + Sex + Age +      SibSp + Parch + Fare + Embarked + Fsize + Title, data = trainx[test_sample,      ], ntree = 1000, mtry = 3, do.trace = 50) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 20%
## Confusion matrix:
##     0  1 class.error
## 0 192 20  0.09433962
## 1  50 88  0.36231884

The table shows that at ntree = 650 and 750 the error is lowest. However, it’s not significantly different from the default ntree = 500.

pred <- predict(rf_deathrate,trainx[-test_sample,])

mean(pred == trainx[-test_sample,]$Survived)

## [1] 0.8373383

predsimp <- predict(simplerf.deathrate,trainx[-test_sample,])

mean(predsimp == trainx[-test_sample,]$Survived)

## [1] 0.8391867

The prediction accuracy of the default model is actually better than the tune model in this case.

Let’s generate the ROC curve.

pred_rf <- predict(rf_deathrate,
                trainx[-test_sample,],
                type="prob")[,2]

roc_rf <- prediction(pred_rf,trainx[-test_sample,]$Survived)
perform_rf <- performance(roc_rf,measure = "tpr",
                             x.measure = "fpr")

plot(perform_rf)

auc_rf <- performance(roc_rf,measure="auc")
auc_rf <- auc_rf@y.values[[1]]
auc_rf

## [1] 0.8826511

The AUC score is 0.88 which is significantly better than logistics regression and decision tree method.

4.4. Gradient Boosting

Gradient Boosting is an amazing methods to boost the accuraccy of predictive model. Details can be found in this source : https://www.analyticsvidhya.com/blog/2015/09/complete-guide-boosting-methods/

#TUNING PARAMETERS FOR GBM 
gbmGrid <-  expand.grid(interaction.depth = c(1, 5, 9), 
                        n.trees = (1:30)*50, 
                        shrinkage = 0.01,
                        n.minobsinnode = 20)
fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10,
                           ## Estimate class probabilities
                           classProbs = FALSE,
                           ## Evaluate performance using 
                           ## the following function
                           savePredictions = T)

gbmFit2 <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                   Fare + Embarked + Fsize + Title, 
                 data= trainx[test_sample,], 
                 method = "gbm", 
                
                 trControl = fitControl, 
                 verbose = FALSE, 
                 ## Now specify the exact models 
                 ## to evaluate:
                 tuneGrid = gbmGrid)

## Loading required package: gbm

## Warning: package 'gbm' was built under R version 3.3.3

## Loading required package: survival

## Warning: package 'survival' was built under R version 3.3.3

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.3

## Loading required package: plyr

## Warning: package 'plyr' was built under R version 3.3.3

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.

## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.

## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.

## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.

## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.

## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.

gbmFit2

## Stochastic Gradient Boosting 
## 
## 350 samples
##   9 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 315, 316, 315, 315, 314, 315, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                    50     0.7792232  0.5236165
##   1                   100     0.7797946  0.5252812
##   1                   150     0.7777778  0.5214060
##   1                   200     0.7752386  0.5177372
##   1                   250     0.7749528  0.5182954
##   1                   300     0.7772474  0.5242965
##   1                   350     0.7769048  0.5240855
##   1                   400     0.7769127  0.5244509
##   1                   450     0.7743483  0.5192265
##   1                   500     0.7743567  0.5194277
##   1                   550     0.7723562  0.5152104
##   1                   600     0.7715317  0.5133764
##   1                   650     0.7712288  0.5128436
##   1                   700     0.7744057  0.5191436
##   1                   750     0.7766424  0.5238586
##   1                   800     0.7777932  0.5258077
##   1                   850     0.7775401  0.5249187
##   1                   900     0.7775560  0.5246460
##   1                   950     0.7781036  0.5260849
##   1                  1000     0.7784057  0.5264777
##   1                  1050     0.7789613  0.5274942
##   1                  1100     0.7786835  0.5266365
##   1                  1150     0.7809869  0.5318383
##   1                  1200     0.7803992  0.5299829
##   1                  1250     0.7803828  0.5295040
##   1                  1300     0.7818114  0.5326164
##   1                  1350     0.7803903  0.5295035
##   1                  1400     0.7815668  0.5326768
##   1                  1450     0.7821218  0.5335887
##   1                  1500     0.7807250  0.5310262
##   5                    50     0.7743249  0.4740193
##   5                   100     0.7641078  0.4627010
##   5                   150     0.7735387  0.5015669
##   5                   200     0.7830187  0.5262554
##   5                   250     0.7843903  0.5298762
##   5                   300     0.7857866  0.5328189
##   5                   350     0.7877951  0.5375277
##   5                   400     0.7877951  0.5380345
##   5                   450     0.7880490  0.5393176
##   5                   500     0.7891998  0.5422519
##   5                   550     0.7914944  0.5479439
##   5                   600     0.7906858  0.5454474
##   5                   650     0.7923922  0.5495676
##   5                   700     0.7955121  0.5566724
##   5                   750     0.7949734  0.5556910
##   5                   800     0.7972526  0.5603525
##   5                   850     0.7972446  0.5611585
##   5                   900     0.8000616  0.5676671
##   5                   950     0.7981022  0.5633781
##   5                  1000     0.7977754  0.5628775
##   5                  1050     0.7972199  0.5616146
##   5                  1100     0.7980859  0.5637605
##   5                  1150     0.7986494  0.5649982
##   5                  1200     0.7966732  0.5607047
##   5                  1250     0.7980943  0.5642379
##   5                  1300     0.7958161  0.5595881
##   5                  1350     0.7961270  0.5605021
##   5                  1400     0.7958492  0.5598379
##   5                  1450     0.7964132  0.5614092
##   5                  1500     0.7938487  0.5561551
##   9                    50     0.7743249  0.4740193
##   9                   100     0.7640831  0.4608702
##   9                   150     0.7712530  0.4955789
##   9                   200     0.7783641  0.5153469
##   9                   250     0.7814669  0.5224519
##   9                   300     0.7843576  0.5293234
##   9                   350     0.7849692  0.5315500
##   9                   400     0.7875173  0.5370467
##   9                   450     0.7883828  0.5392792
##   9                   500     0.7895504  0.5426462
##   9                   550     0.7898361  0.5433487
##   9                   600     0.7904160  0.5447184
##   9                   650     0.7915023  0.5471743
##   9                   700     0.7940999  0.5530091
##   9                   750     0.7947208  0.5553146
##   9                   800     0.7952596  0.5562399
##   9                   850     0.7944024  0.5548046
##   9                   900     0.7961172  0.5587507
##   9                   950     0.7946881  0.5555475
##   9                  1000     0.7969585  0.5607300
##   9                  1050     0.7969748  0.5615377
##   9                  1100     0.7949664  0.5569394
##   9                  1150     0.7943866  0.5564008
##   9                  1200     0.7943623  0.5564182
##   9                  1250     0.7941013  0.5558120
##   9                  1300     0.7938394  0.5556119
##   9                  1350     0.7938399  0.5557241
##   9                  1400     0.7941013  0.5559923
##   9                  1450     0.7935299  0.5550190
##   9                  1500     0.7932446  0.5547693
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.01
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 900,
##  interaction.depth = 5, shrinkage = 0.01 and n.minobsinnode = 20.

plot(gbmFit2)

gbm_pred1 <- predict(gbmFit2,trainx[-test_sample,])

## Warning: contrasts dropped from factor Embarked

mean(gbm_pred1 == trainx[-test_sample,]$Survived)

## [1] 0.818854

gbm_pred <- predict(gbmFit2,trainx[-test_sample,],type="prob")

## Warning: contrasts dropped from factor Embarked

The classification accuraccy seems a little bit lower than I expected. Only 81.8%.

roc_gbm <- prediction(gbm_pred[,2],trainx[-test_sample,]$Survived)
perform_gbm <- performance(roc_gbm,measure = "tpr",
                          x.measure = "fpr")

plot(perform_gbm)

auc_gbm <- performance(roc_gbm,measure="auc")
auc_gbm<- auc_gbm@y.values[[1]]
auc_gbm

## [1] 0.8646797

The AUC score is pretty decent compare to all others method, however, it still can’t outperform random forest method with 0.88 .

4.5. Support Vector Machine

Overview:
SVM’s are a group of classifiers considered to be one of the best “out of the box” methods. There are several types of SVM’s. The simplest is the_maximal margin classifier_ (MMC). Though the MMC is elegant and simple, it cannot be applied to most data sets, since it requires the classes be separable by a linear boundary. The support vector classifier is an extension of the maximal margin classifier, and can be applied in a broader range of cases. However, better yet, the support vector machine is a further extension to accommodate non-linear class boundaries.

# SVM required labels for dependent variable instead of numeric value
trainx$Survived <- factor(trainx$Survived, levels =c(0,1), labels=c("died","survived"))

# Setup for cross validation

set.seed(123)
ctrl <- trainControl(method="cv",
                     number = 2,
                     summaryFunction=twoClassSummary,
                     classProbs=TRUE)
# Grid search to fine tune SVM

grid <- expand.grid(sigma = c(.01, .015, 0.2),
                    C = c(0.75, 0.9, 1, 1.1, 1.25))
library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:scales':
## 
##     alpha

## The following object is masked from 'package:ggplot2':
## 
##     alpha

svm_Linear <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                      Fare + Embarked + Fsize + Title, 
                    trainx[test_sample,],
                    method = "svmRadial",
                    metric="ROC",
                    trControl=ctrl,
                    tuneGrid=grid
                    )

svm_Linear

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 350 samples
##   9 predictor
##   2 classes: 'died', 'survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold) 
## Summary of sample sizes: 175, 175 
## Resampling results across tuning parameters:
## 
##   sigma  C     ROC        Sens       Spec     
##   0.010  0.75  0.8292316  0.9009434  0.6376812
##   0.010  0.90  0.8308039  0.9009434  0.6376812
##   0.010  1.00  0.8311457  0.9009434  0.6304348
##   0.010  1.10  0.8305305  0.8962264  0.6594203
##   0.010  1.25  0.8283429  0.9009434  0.6376812
##   0.015  0.75  0.8258819  0.9009434  0.6376812
##   0.015  0.90  0.8251982  0.9009434  0.6449275
##   0.015  1.00  0.8244463  0.8915094  0.6449275
##   0.015  1.10  0.8228056  0.8915094  0.6521739
##   0.015  1.25  0.8226689  0.8915094  0.6521739
##   0.200  0.75  0.8205496  0.8396226  0.6811594
##   0.200  0.90  0.8202078  0.8537736  0.6521739
##   0.200  1.00  0.8229423  0.8396226  0.6666667
##   0.200  1.10  0.8240361  0.8726415  0.6376812
##   0.200  1.25  0.8219169  0.8726415  0.6231884
## 
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.01 and C = 1.

# Generate prediction and calculate accuracy
pred_svm <- predict( svm_Linear,trainx[-test_sample,])

## Warning: contrasts dropped from factor Embarked

mean(pred_svm == trainx[-test_sample,]$Survived)

## [1] 0.8280961

# ROC curve and AUC
pred_svm2 <- predict(svm_Linear,
                     trainx[-test_sample,],
                     type="prob")[,2]

## Warning: contrasts dropped from factor Embarked

roc_svm <- prediction(pred_svm2,trainx[-test_sample,]$Survived)
perform_svm <- performance(roc_svm,measure = "tpr",
                           x.measure = "fpr")
plot(perform_svm)

auc_svm <- performance(roc_svm,measure="auc")
auc_svm <- auc_svm@y.values[[1]]
auc_svm

## [1] 0.8676689

5. Conclusion

*Logistic regression
+Accuracy : 81.71%
+AUC score : 0.8432

*Classification tree
+Accuracy: 81.51%
+AUC score : 0.80

*Random Forest :
+Accuracy : 83.91%
+AUC score : 0.8826

*Gradient Boosting :
+Accuracy : 81.8%
+AUC score : 0.86

*Support Vector Machine
+Accuracy : 82.07%
+AUC score : 0.86

RandomForest has highest AUC score and prediction accuracy. However, Gradient boosting also has decent AUC but pretty low prediction rate. Therefore, in this situation, Random Forest has the best performance .

TITANIC SURVIVAL PREDICTION WITH DIFFERENT CLASSIFICATION METHODS

Dung To

May 11, 2017