1 Introduction
2 Load data into the R console
3 Exploratory Analysis
4 Prediction and performance evaluation of various machine learning models.
5 Final model

1 Introduction

This is my first tab on kaggle. After reading about Megan Risdal’s Kaggle report Megan Kaggle kernel, I am really inspired by her work. So here goes my two cents on Exploratory Data analysis and testing machine learning algorithms on the Titanic data using the amazing caret package.

Plus, numerous amazing packages by Hadley Wickham made my life so much easier. Thanks Hadley.

2 Load data into the R console

2.1 read.csv vs fread function

There are many ways to do this. There is the read.csv in the base package and fread in data.table package. There is minimum difference in Titanic dataset. But for much larger files of several GBs, the fread function in data.table package is extremely fast and simple to use.

2.2 load the packages

I think it is wise not to suppress the package loading information.Though it is not beautiful for output html notebook, but there might be conflicts between packages or any new information after updaing some packages. I read them occasionally.

library(tidyverse) # a collection of packages by Hadley

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(caret)     # a package for machine learning algorithms

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(data.table)# a package for fast data loading into r

## -------------------------------------------------------------------------

## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!

## -------------------------------------------------------------------------

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(tidyr)     # a package for manipulating dataframe in r
library(Amelia)    # a package to visualize missing data

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

2.3 load the data into R and examine the data structures.

# load all characters as factors, change this by adding stringAsFactors=FALSE in the read.csv function as titanic=read.csv("train.csv", stringAsFactors=FALSE )
titanic=read.csv("train.csv",fill=TRUE,na.strings=c("NA",""))

str(titanic) # structure of the data

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

summary(titanic) # summary of the data

##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked  
##  B96 B98    :  4   C   :168  
##  C23 C25 C27:  4   Q   : 77  
##  G6         :  4   S   :644  
##  C22 C26    :  3   NA's:  2  
##  D          :  3             
##  (Other)    :186             
##  NA's       :687

Next, let us viasualiza the missing data

missmap(titanic,main="Titanic Training Data - Missings Map", col=c("yellow", "black"), legend=FALSE)

It shows that there is missing data in Age and Cabin.

# since there is only 2 missing value in Embarked, we could just replace it with the majority of the value 

titanic$Embarked[is.na(titanic$Embarked)]="S"

titanic=titanic%>%as_data_frame()

The last line transformed the data frame into tibble, which is another form of data frame that is good with dplyr for data manipulation.

3 Exploratory Analysis

Exploring the data would be a less defined process. It is a process that asks intuitive questions about the data.

I personally would follow the kind of the “routine” to plot the variation of a single meaningful variables and co-variations between two variables as suggested by Hadley Wickham in his book “R for data science”

3.1 What to do on the different titanic variables?

First, the PassengerId data is not so interesting since it is only used to identify every passenger.

Second, name is imported as a factor but I definitely want to separate it into first and last names as character.

Third, ticket number is not interesting to me, either. So I am going to remove it from the data.

Fourth, Cabin is not so clear to me what it means to have three cabins assigned to the same person. Are they all from the same family? We will need to understand it further after we divide the name into first and last name.

titanic$Cabin%>%summary%>%head(10)

##     B96 B98 C23 C25 C27          G6     C22 C26           D        E101 
##           4           4           4           3           3           3 
##          F2         F33         B18         B20 
##           3           3           2           2

The %>% symbol I used is simply pass data from the left to right following the direction of the arrow. it is from package magrittr. It makes code more readable, easier to write and less variables in between.

3.2 separate the name variable into Lastname and name, and delete the name, Ticket and PassengerId variables.

titanic$Name=titanic$Name%>%as.character

titanic$Name%>%head(20)

##  [1] "Braund, Mr. Owen Harris"                                
##  [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"    
##  [3] "Heikkinen, Miss. Laina"                                 
##  [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"           
##  [5] "Allen, Mr. William Henry"                               
##  [6] "Moran, Mr. James"                                       
##  [7] "McCarthy, Mr. Timothy J"                                
##  [8] "Palsson, Master. Gosta Leonard"                         
##  [9] "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"      
## [10] "Nasser, Mrs. Nicholas (Adele Achem)"                    
## [11] "Sandstrom, Miss. Marguerite Rut"                        
## [12] "Bonnell, Miss. Elizabeth"                               
## [13] "Saundercock, Mr. William Henry"                         
## [14] "Andersson, Mr. Anders Johan"                            
## [15] "Vestrom, Miss. Hulda Amanda Adolfina"                   
## [16] "Hewlett, Mrs. (Mary D Kingcome) "                       
## [17] "Rice, Master. Eugene"                                   
## [18] "Williams, Mr. Charles Eugene"                           
## [19] "Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)"
## [20] "Masselmani, Mrs. Fatima"

It seems that the first character in the Name variable is the last name. The second part, which seprates from the last name by “,” is the title. The name after the title is the first name. I do not understand the meaning of the name in the round brackets. But there is some missing first names like “Hewlett, Mrs. (Mary D Kingcome)”, whose name is Mrs Mary Dunbar Hewlett (née Kingcome), so an easier way is to deal only with the last name.

I get all these from some google search on the names directly.

In conclusion, the structure of the name is Lastname, title Firstname. I am going to leave only title. Plus, I will delete the Ticket and PassengerId

library(tidyr)
library(dplyr)

titanic_nl=titanic%>%separate(col=Name, into=c("Last","First"),sep="\\. ")%>%separate(col=Last, into=c("Last2","Title"),sep=", ")

## Warning: Too many values at 1 locations: 514

titanic_nl[506,"Title"]="Mrs"

titanic_nl%>%summary

##   PassengerId       Survived          Pclass         Last2          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##     Title              First               Sex           Age       
##  Length:891         Length:891         female:314   Min.   : 0.42  
##  Class :character   Class :character   male  :577   1st Qu.:20.12  
##  Mode  :character   Mode  :character                Median :28.00  
##                                                     Mean   :29.70  
##                                                     3rd Qu.:38.00  
##                                                     Max.   :80.00  
##                                                     NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:646   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186           
##  NA's       :687

3.3 After cleaning the data a little bit, let us ask some questions and do some visualization on the data.

This is a recursive step. You do some visualizations and ask some questions. Then based on the information you will modify the question or ask new questions.

Has Jack Dawson and Rose DeWitt survived the crashed or not? (based on the movie Jack should not survived and Rose survived)

library(stringr)
# Let us try the first name Jack
titanic%>%filter( str_detect(Name,"Jack") )

## # A tibble: 1 × 12
##   PassengerId Survived Pclass                      Name    Sex   Age SibSp
##         <int>    <int>  <int>                     <chr> <fctr> <dbl> <int>
## 1         767        0      1 Brewe, Dr. Arthur Jackson   male    NA     0
## # ... with 5 more variables: Parch <int>, Ticket <fctr>, Fare <dbl>,
## #   Cabin <fctr>, Embarked <fctr>

# Let us try the last name Dawson
titanic%>%filter( str_detect(Name,"Dawson") )

## # A tibble: 0 × 12
## # ... with 12 variables: PassengerId <int>, Survived <int>, Pclass <int>,
## #   Name <chr>, Sex <fctr>, Age <dbl>, SibSp <int>, Parch <int>,
## #   Ticket <fctr>, Fare <dbl>, Cabin <fctr>, Embarked <fctr>

It seems that there is only one man called Jackson on board the ship Titanic. There is no man called Jack Dawson on board the ship.

# Let us try the first name Rose
titanic%>%filter( str_detect(Name,"Rose") )

## # A tibble: 1 × 12
##   PassengerId Survived Pclass                       Name    Sex   Age
##         <int>    <int>  <int>                      <chr> <fctr> <dbl>
## 1         856        1      3 Aks, Mrs. Sam (Leah Rosen) female    18
## # ... with 6 more variables: SibSp <int>, Parch <int>, Ticket <fctr>,
## #   Fare <dbl>, Cabin <fctr>, Embarked <fctr>

# Let us try the last name Bukater
titanic%>%filter( str_detect(Name,"Bukater") )

## # A tibble: 0 × 12
## # ... with 12 variables: PassengerId <int>, Survived <int>, Pclass <int>,
## #   Name <chr>, Sex <fctr>, Age <dbl>, SibSp <int>, Parch <int>,
## #   Ticket <fctr>, Fare <dbl>, Cabin <fctr>, Embarked <fctr>

# Let us try the middle name DeWitt 
titanic%>%filter( str_detect(Name,"DeWitt") )

## # A tibble: 0 × 12
## # ... with 12 variables: PassengerId <int>, Survived <int>, Pclass <int>,
## #   Name <chr>, Sex <fctr>, Age <dbl>, SibSp <int>, Parch <int>,
## #   Ticket <fctr>, Fare <dbl>, Cabin <fctr>, Embarked <fctr>

It seems the Rose DeWitt Bukater is also a fictional character. After a little google search, this website confirms my findings that the two main characters are fictional.

http://www.chasingthefrog.com/reelfaces/titanic.php

What is the survival probability in differnt class? ( This is the most asked question in Titanic dataset.)

library(ggplot2)


titanic_nl%>%ggplot(aes(x=Pclass,fill=factor(Survived)))+geom_bar(stat="count",position="fill")

How about survival for different sex?

titanic_nl%>%ggplot(aes(x=Sex,fill=factor(Survived) ))+geom_bar(stat = "count",position="fill")

How about survival for different family sizes in different cabin class?

Let us create Family size variable (Fsize) and adjust the Fare based on family size ( we could see that the Fare paid is family based with some visualization).

titanic_nl=titanic_nl%>%select(-Last2,-First,-Ticket)%>%mutate(Fsize=SibSp+Parch+1, Price=Fare/Fsize)%>%select(-Fare)%>%
  separate(col=Cabin,into=c("Cabin","Others"),sep=1)%>%select(-Others)
titanic_nl$Pclass=titanic_nl$Pclass%>%factor()
titanic_nl$Survived=titanic_nl$Survived%>%factor()

# Let us first get family size distribution in different class
titanic_nl%>%ggplot()+geom_bar(aes(x=Pclass,fill=factor(Fsize)))

titanic_nl%>%ggplot()+geom_boxplot(aes(x=Pclass,y=Fsize,fill=Survived))

Let us get the survival rate in table form.

titanic_nl%>%summary

##   PassengerId    Survived Pclass     Title               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##                                                                  
##       Age            SibSp           Parch           Cabin          
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##  Embarked     Fsize            Price       
##  C:168    Min.   : 1.000   Min.   :  0.00  
##  Q: 77    1st Qu.: 1.000   1st Qu.:  7.25  
##  S:646    Median : 1.000   Median :  8.30  
##           Mean   : 1.905   Mean   : 19.92  
##           3rd Qu.: 2.000   3rd Qu.: 23.67  
##           Max.   :11.000   Max.   :512.33  
##

titanic_nl%>%group_by(Pclass,Fsize)%>%summarise(totnum=n(), rate=sum(Survived==1)/totnum)

## Source: local data frame [21 x 4]
## Groups: Pclass [?]
## 
##    Pclass Fsize totnum      rate
##    <fctr> <dbl>  <int>     <dbl>
## 1       1     1    109 0.5321101
## 2       1     2     70 0.7285714
## 3       1     3     24 0.7500000
## 4       1     4      7 0.7142857
## 5       1     5      2 1.0000000
## 6       1     6      4 0.5000000
## 7       2     1    104 0.3461538
## 8       2     2     34 0.5294118
## 9       2     3     31 0.6774194
## 10      2     4     13 0.7692308
## # ... with 11 more rows

From the data, I could say that if you are traveling alone or in a family, Pclass 1 gives higher survival probability in general.

It seems surprising that a family of size 3 has a lower survival probability in Pclass 1 and 3 compared to a family of 2 or 4 in the same Pclass. Surprisingly, a family of size 3 has the highest survival probability in Pclass 2.

Does the Cabin number has anything to do with survival?

It is kind of hard to tell since there is a lot of missing data.

titanic_nl$Cabin%>%factor%>%summary

##    A    B    C    D    E    F    G    T NA's 
##   15   47   59   33   32   13    4    1  687

3.4 Let us look at the Fares paid by people in different Pclass.

titanic_nl%>%ggplot()+geom_boxplot(aes(x=Pclass,y=Price,fill=Sex))

This is surprising that females paid more in Pclass==1 than male. Let us take a deeper look at Pclass==2 and 3

titanic_nl%>%filter(Pclass==2 | Pclass==3)%>%ggplot()+geom_boxplot(aes(x=Pclass,y=Price,fill=Sex))

In Pclass==2 and 3, female and male paid about the same.

We could also see from the graph that some people did not pay anything for the ticket.

Let us see how many people are there who pays nothing in each Pclass

titanic_nl%>%filter(Price==0)%>%ggplot()+geom_bar(aes(x=Pclass))

Let us see the other information in these people.

titanic_nl%>%filter(Price==0)

## # A tibble: 15 × 12
##    PassengerId Survived Pclass    Title    Sex   Age SibSp Parch Cabin
##          <int>   <fctr> <fctr>    <chr> <fctr> <dbl> <int> <int> <chr>
## 1          180        0      3       Mr   male    36     0     0  <NA>
## 2          264        0      1       Mr   male    40     0     0     B
## 3          272        1      3       Mr   male    25     0     0  <NA>
## 4          278        0      2       Mr   male    NA     0     0  <NA>
## 5          303        0      3       Mr   male    19     0     0  <NA>
## 6          414        0      2       Mr   male    NA     0     0  <NA>
## 7          467        0      2       Mr   male    NA     0     0  <NA>
## 8          482        0      2       Mr   male    NA     0     0  <NA>
## 9          598        0      3       Mr   male    49     0     0  <NA>
## 10         634        0      1       Mr   male    NA     0     0  <NA>
## 11         675        0      2       Mr   male    NA     0     0  <NA>
## 12         733        0      2       Mr   male    NA     0     0  <NA>
## 13         807        0      1       Mr   male    39     0     0     A
## 14         816        0      1       Mr   male    NA     0     0     B
## 15         823        0      1 Jonkheer   male    38     0     0  <NA>
## # ... with 3 more variables: Embarked <fctr>, Fsize <dbl>, Price <dbl>

They are all single adult males with no families embarked from S (SouthHampton). They are in all cabin class 1,2,3. Most of them did not survived the tragedy of titanic.

It is not so useful to use these data for model building. ## Get the clean form of data and correlation martrix

Let us get the final dataset from scratch and ignore Cabin variable for now.

t_final=titanic%>%separate(col=Name, into=c("Lastname","name"), sep="," )%>%select(-name,-Ticket,-PassengerId)%>%mutate(Fsize=SibSp+Parch+1, Price=Fare/Fsize)%>%select(-Cabin,-Fare)

t_final$Lastname=t_final$Lastname%>%factor
t_final$Survived=t_final$Survived%>%factor

Finally, the correlation matrix plot.

pairs(t_final)

4 Prediction and performance evaluation of various machine learning models.

In the blog post from Jason Brownlee, he summarised four steps to improve model performance.

Improve Performance With Data.
Improve Performance With Algorithms.
Improve Performance With Algorithm Tuning.
Improve Performance With Ensembles.

There is nothing we could do about the first step for the Titanic dataset. Let us start with the second and third step using caret package. This is a complete list of machine learning algorithms that caret package support.

4.1 Split it into training and testing datasets.

Further clean the data

Missdata=titanic%>%filter(Fare==0)%>%select(PassengerId)%>%as.data.frame()
Missdata=Missdata%>%unlist
titanic1=titanic[-Missdata,]





# Next we distill the title from the Name column
titanic2=titanic1%>%separate(col=Name, into=c("Last","First"),sep="\\. ")%>%separate(col=Last, into=c("Last2","Title"),sep=", ")

## Warning: Too many values at 1 locations: 506

titanic2[506,"Title"]="Mrs"

titanic2=titanic2%>%select(-Last2,-First,-Ticket)%>%mutate(Fsize=SibSp+Parch+1, Price=Fare/Fsize)%>%select(-Fare)%>%
  separate(col=Cabin,into=c("Cabin","Others"),sep=1)%>%select(-Others)


titanic2$Cabin=titanic2$Cabin%>%factor
titanic2$Title=titanic2$Title%>%factor
titanic2$Pclass=titanic2$Pclass%>%factor
titanic2$Survived=titanic2$Survived%>%factor








t_final2=titanic2%>%mutate(
  femaleclass1=ifelse( Pclass==1 & Sex=="female","Live","Dead")%>%as.factor(),
  maleclass3=ifelse( (Cabin=="F" | is.na(Cabin) )& Pclass==3 & Sex=="male","Dead","Live" )%>%as.factor(),
  maleclass2=ifelse(  is.na(Cabin) & Pclass==2 &Sex=="male","Dead","Live" ) %>%as.factor()
)%>%select(-Cabin)%>%select(-PassengerId)


t_preprocess=t_final2%>%preProcess(method=c("knnImpute","center","scale"))
t_final2=predict(t_preprocess,t_final2)

I have some difficulty trying to fix the two values in Missing. Plus, I have some difficulty trying to impute the Embarked variable.

Split the data

library(caret)
library(doParallel)

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: iterators

## Loading required package: parallel

index=createDataPartition(t_final2$Survived,p=0.9,list=FALSE)
train_tf=t_final2[index,]
test_tf=t_final2[-index,]

# t_final2%>%summary
# t_final%>%summary

I choose 90% of the data to be in training set and 10% of data in testing set. The validation would be carried out by submitting to kaggle.

4.2 Start with the most simple randome forest model as a baseline performance model.

There are 16 model variants and implementation of random forest in caret. Let us start with the most basic 3 random forest models with different implementations. Let us work with the most commonly used rf, ranger implementations.

cl <- makeCluster(5)  
registerDoParallel(cl) 
ctrl=trainControl(method="repeatedcv", number = 10, repeats=3 ,selectionFunction="oneSE")

rf_grid=expand.grid(mtry=2:8)

rf_rf=train(Survived~., data=train_tf, method="rf", trControl=ctrl,metric="Kappa", tuneGrid=rf_grid)

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_ranger=train(Survived~., data=train_tf, method="ranger", trControl=ctrl,metric="Kappa",tuneGrid=rf_grid)

## Loading required package: e1071

## Loading required package: ranger

## 
## Attaching package: 'ranger'

## The following object is masked from 'package:randomForest':
## 
##     importance

rf_rf

## Random Forest 
## 
## 789 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 710, 709, 710, 710, 710, 710, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8183677  0.6091853
##   3     0.8187952  0.6099733
##   4     0.8208891  0.6127323
##   5     0.8221601  0.6138682
##   6     0.8234315  0.6160270
##   7     0.8255522  0.6217755
##   8     0.8221550  0.6155569
## 
## Kappa was used to select the optimal model using  the one SE rule.
## The final value used for the model was mtry = 2.

rf_ranger

## Random Forest 
## 
## 789 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 710, 710, 711, 711, 709, 710, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8171205  0.6064947
##   3     0.8200799  0.6134967
##   4     0.8154276  0.6005553
##   5     0.8247159  0.6190102
##   6     0.8234284  0.6158647
##   7     0.8284863  0.6287345
##   8     0.8268038  0.6263752
## 
## Kappa was used to select the optimal model using  the one SE rule.
## The final value used for the model was mtry = 5.

rf_rf_pred=predict(rf_rf,test_tf)
rf_ranger_pred=predict(rf_ranger,test_tf)



confusionMatrix(rf_rf_pred,test_tf$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 48 11
##          1  5 23
##                                           
##                Accuracy : 0.8161          
##                  95% CI : (0.7186, 0.8911)
##     No Information Rate : 0.6092          
##     P-Value [Acc > NIR] : 2.749e-05       
##                                           
##                   Kappa : 0.6011          
##  Mcnemar's Test P-Value : 0.2113          
##                                           
##             Sensitivity : 0.9057          
##             Specificity : 0.6765          
##          Pos Pred Value : 0.8136          
##          Neg Pred Value : 0.8214          
##              Prevalence : 0.6092          
##          Detection Rate : 0.5517          
##    Detection Prevalence : 0.6782          
##       Balanced Accuracy : 0.7911          
##                                           
##        'Positive' Class : 0               
##

confusionMatrix(rf_ranger_pred,test_tf$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 49 12
##          1  4 22
##                                           
##                Accuracy : 0.8161          
##                  95% CI : (0.7186, 0.8911)
##     No Information Rate : 0.6092          
##     P-Value [Acc > NIR] : 2.749e-05       
##                                           
##                   Kappa : 0.5968          
##  Mcnemar's Test P-Value : 0.08012         
##                                           
##             Sensitivity : 0.9245          
##             Specificity : 0.6471          
##          Pos Pred Value : 0.8033          
##          Neg Pred Value : 0.8462          
##              Prevalence : 0.6092          
##          Detection Rate : 0.5632          
##    Detection Prevalence : 0.7011          
##       Balanced Accuracy : 0.7858          
##                                           
##        'Positive' Class : 0               
##

plot(rf_rf)

plot(rf_ranger)

stopCluster(cl)

It seems that if we increase the repeated times of cross validation, the accuracy could improve.

Let us see if preProcess could improve our performance

cl <- makeCluster(5)  
registerDoParallel(cl)
rf_rf2=train(Survived~., data=train_tf, method="rf", trControl=ctrl,metric="Kappa", tuneGrid=rf_grid,preProcess=c("center","scale"))
rf_ranger2=train(Survived~., data=train_tf, method="ranger", trControl=ctrl,metric="Kappa",tuneGrid=rf_grid,preProcess=c("center","scale"))



rf_rf2

## Random Forest 
## 
## 789 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (28), scaled (28) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 710, 710, 709, 710, 710, 710, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8166425  0.6055581
##   3     0.8187520  0.6099703
##   4     0.8259576  0.6237208
##   5     0.8175292  0.6022585
##   6     0.8196282  0.6058927
##   7     0.8276511  0.6252989
##   8     0.8251300  0.6208158
## 
## Kappa was used to select the optimal model using  the one SE rule.
## The final value used for the model was mtry = 3.

rf_ranger2

## Random Forest 
## 
## 789 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (28), scaled (28) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 710, 710, 709, 711, 711, 710, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8192174  0.6108921
##   3     0.8200613  0.6128125
##   4     0.8234372  0.6183661
##   5     0.8251147  0.6196011
##   6     0.8255741  0.6203797
##   7     0.8276730  0.6261899
##   8     0.8327418  0.6386241
## 
## Kappa was used to select the optimal model using  the one SE rule.
## The final value used for the model was mtry = 7.

rf_rf_pred2=predict(rf_rf2,test_tf)
rf_ranger_pred2=predict(rf_ranger2,test_tf)



confusionMatrix(rf_rf_pred2,test_tf$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 48 11
##          1  5 23
##                                           
##                Accuracy : 0.8161          
##                  95% CI : (0.7186, 0.8911)
##     No Information Rate : 0.6092          
##     P-Value [Acc > NIR] : 2.749e-05       
##                                           
##                   Kappa : 0.6011          
##  Mcnemar's Test P-Value : 0.2113          
##                                           
##             Sensitivity : 0.9057          
##             Specificity : 0.6765          
##          Pos Pred Value : 0.8136          
##          Neg Pred Value : 0.8214          
##              Prevalence : 0.6092          
##          Detection Rate : 0.5517          
##    Detection Prevalence : 0.6782          
##       Balanced Accuracy : 0.7911          
##                                           
##        'Positive' Class : 0               
##

confusionMatrix(rf_ranger_pred2,test_tf$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 50 13
##          1  3 21
##                                           
##                Accuracy : 0.8161          
##                  95% CI : (0.7186, 0.8911)
##     No Information Rate : 0.6092          
##     P-Value [Acc > NIR] : 2.749e-05       
##                                           
##                   Kappa : 0.5923          
##  Mcnemar's Test P-Value : 0.02445         
##                                           
##             Sensitivity : 0.9434          
##             Specificity : 0.6176          
##          Pos Pred Value : 0.7937          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.6092          
##          Detection Rate : 0.5747          
##    Detection Prevalence : 0.7241          
##       Balanced Accuracy : 0.7805          
##                                           
##        'Positive' Class : 0               
##

plot(rf_rf2)

plot(rf_ranger2)

stopCluster(cl)

It seems that the random forest would have an accuracy of around 0.81.

4.3 Next up, the famous Support Vector Machine.

Let us try some other classification algorithms like svmLinear,svmLinear2,svmLinear3, svmPoly and svmRadial

Normally SVM would benefits from normalized data.

First let us try the svm linear models.

cl <- makeCluster(5)  
registerDoParallel(cl)
ctrl_t=trainControl(method="repeatedcv", number = 10, repeats=10 ,selectionFunction="oneSE")


svml1_grid=expand.grid(C=seq(0.001,1,length.out = 10))
svml2_grid=expand.grid(cost=seq(0.001,1,length.out = 10))
svml3_grid=expand.grid(cost=seq(0.001,1,length.out = 10), Loss=c("L1","L2"))


rf_svml1=train(Survived~.,data=train_tf,method="svmLinear",trControl=ctrl_t,tuneGrid=svml1_grid)

## Loading required package: kernlab

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

rf_svml2=train(Survived~.,data=train_tf,method="svmLinear2",trControl=ctrl_t,tuneGrid=svml2_grid)
rf_svml3=train(Survived~.,data=train_tf,method="svmLinear3",trControl=ctrl_t,tuneGrid=svml3_grid)

## Loading required package: LiblineaR

rf_svml1

## Support Vector Machines with Linear Kernel 
## 
## 789 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 711, 710, 710, 709, 709, 710, ... 
## Resampling results across tuning parameters:
## 
##   C      Accuracy   Kappa    
##   0.001  0.6971411  0.2679210
##   0.112  0.8218113  0.6192980
##   0.223  0.8228193  0.6207195
##   0.334  0.8209157  0.6166047
##   0.445  0.8196465  0.6137468
##   0.556  0.8197683  0.6136692
##   0.667  0.8187508  0.6113681
##   0.778  0.8178647  0.6093211
##   0.889  0.8177413  0.6089071
##   1.000  0.8177429  0.6086697
## 
## Accuracy was used to select the optimal model using  the one SE rule.
## The final value used for the model was C = 0.112.

rf_svml2

## Support Vector Machines with Linear Kernel 
## 
## 789 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 710, 710, 711, 711, 711, 710, ... 
## Resampling results across tuning parameters:
## 
##   cost   Accuracy   Kappa    
##   0.001  0.7027205  0.2844091
##   0.112  0.8205138  0.6169497
##   0.223  0.8216419  0.6189227
##   0.334  0.8204978  0.6159045
##   0.445  0.8202478  0.6151151
##   0.556  0.8193616  0.6129245
##   0.667  0.8189819  0.6120691
##   0.778  0.8184772  0.6108475
##   0.889  0.8183473  0.6104651
##   1.000  0.8180925  0.6099795
## 
## Accuracy was used to select the optimal model using  the one SE rule.
## The final value used for the model was cost = 0.112.

rf_svml3

## L2 Regularized Support Vector Machine (dual) with Linear Kernel 
## 
## 789 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 711, 710, 710, 709, 711, 710, ... 
## Resampling results across tuning parameters:
## 
##   cost   Loss  Accuracy   Kappa    
##   0.001  L1    0.8035727  0.5747374
##   0.001  L2    0.6780379  0.2067771
##   0.112  L1    0.8268875  0.6285051
##   0.112  L2    0.8234585  0.6243872
##   0.223  L1    0.8258829  0.6258107
##   0.223  L2    0.8252307  0.6274808
##   0.334  L1    0.8260079  0.6257938
##   0.334  L2    0.8248526  0.6263670
##   0.445  L1    0.8255062  0.6246297
##   0.445  L2    0.8232085  0.6225874
##   0.556  L1    0.8247468  0.6228052
##   0.556  L2    0.8224441  0.6208138
##   0.667  L1    0.8244919  0.6223242
##   0.667  L2    0.8215548  0.6187237
##   0.778  L1    0.8244935  0.6221062
##   0.778  L2    0.8215564  0.6185621
##   0.889  L1    0.8242404  0.6215058
##   0.889  L2    0.8218064  0.6188342
##   1.000  L1    0.8241154  0.6212112
##   1.000  L2    0.8214298  0.6178842
## 
## Accuracy was used to select the optimal model using  the one SE rule.
## The final values used for the model were cost = 0.112 and Loss = L1.

rf_svml1%>%predict(test_tf)%>%confusionMatrix(test_tf$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 47 10
##          1  6 24
##                                           
##                Accuracy : 0.8161          
##                  95% CI : (0.7186, 0.8911)
##     No Information Rate : 0.6092          
##     P-Value [Acc > NIR] : 2.749e-05       
##                                           
##                   Kappa : 0.6054          
##  Mcnemar's Test P-Value : 0.4533          
##                                           
##             Sensitivity : 0.8868          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.8246          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.6092          
##          Detection Rate : 0.5402          
##    Detection Prevalence : 0.6552          
##       Balanced Accuracy : 0.7963          
##                                           
##        'Positive' Class : 0               
##

rf_svml2%>%predict(test_tf)%>%confusionMatrix(test_tf$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 47 10
##          1  6 24
##                                           
##                Accuracy : 0.8161          
##                  95% CI : (0.7186, 0.8911)
##     No Information Rate : 0.6092          
##     P-Value [Acc > NIR] : 2.749e-05       
##                                           
##                   Kappa : 0.6054          
##  Mcnemar's Test P-Value : 0.4533          
##                                           
##             Sensitivity : 0.8868          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.8246          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.6092          
##          Detection Rate : 0.5402          
##    Detection Prevalence : 0.6552          
##       Balanced Accuracy : 0.7963          
##                                           
##        'Positive' Class : 0               
##

rf_svml3%>%predict(test_tf)%>%confusionMatrix(test_tf$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 49 10
##          1  4 24
##                                           
##                Accuracy : 0.8391          
##                  95% CI : (0.7448, 0.9091)
##     No Information Rate : 0.6092          
##     P-Value [Acc > NIR] : 2.843e-06       
##                                           
##                   Kappa : 0.651           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.9245          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.8305          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.6092          
##          Detection Rate : 0.5632          
##    Detection Prevalence : 0.6782          
##       Balanced Accuracy : 0.8152          
##                                           
##        'Positive' Class : 0               
##

stopCluster(cl)

It seems that the performance is not so good as random forest models since the relationship is somewhat non-linear.

Let us try the non-linear svm models with adaptive resampling

adaptControl <- trainControl(method = "adaptive_cv",
                             number = 10, repeats = 10,
                             adaptive = list(min = 10, alpha = 0.05, 
                                             method = "gls", complete = TRUE),
                             search = "random")

cl <- makeCluster(6)  
registerDoParallel(cl)


t1=proc.time()
set.seed(2)
svm_lssvmRadial_Adapt = train(Survived ~ ., data = train_tf,
                                 method = "lssvmRadial", 
                                 trControl = adaptControl, 
                                 preProc = c("center", "scale"),
                                 metric = "Kappa",
                                 tuneLength = 90)
t2=proc.time()

svm_lssvmRadial_Adapt%>%predict(test_tf)%>%confusionMatrix(test_tf$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 48 11
##          1  5 23
##                                           
##                Accuracy : 0.8161          
##                  95% CI : (0.7186, 0.8911)
##     No Information Rate : 0.6092          
##     P-Value [Acc > NIR] : 2.749e-05       
##                                           
##                   Kappa : 0.6011          
##  Mcnemar's Test P-Value : 0.2113          
##                                           
##             Sensitivity : 0.9057          
##             Specificity : 0.6765          
##          Pos Pred Value : 0.8136          
##          Neg Pred Value : 0.8214          
##              Prevalence : 0.6092          
##          Detection Rate : 0.5517          
##    Detection Prevalence : 0.6782          
##       Balanced Accuracy : 0.7911          
##                                           
##        'Positive' Class : 0               
##

(t2-t1)/60

##        user      system     elapsed 
##  0.62883333  0.05366667 13.33333333

stopCluster(cl)

On a sidenote, the svm algorithm does not utilize all the CPU power of a quad-core i7-6700 machine even though I have registered 3 scores. The maximum performance is just 50% of CPU.

5 Final model

After a long computation to spot check and fine tuning many models, I found an algorithm with a high accuracy.

#LogitBoost_Adapt
library(caret)
library(tidyverse)
library(doParallel)
cl <- makeCluster(5)  
registerDoParallel(cl)

t1=proc.time()
set.seed(2)
LogitBoost_Adapt = train(Survived ~ ., data = train_tf,method = "LogitBoost",trControl = adaptControl,
                         preProc = c("center", "scale"), metric = "Kappa",tuneLength = 540)

## Loading required package: caTools

t2=proc.time()
(t2-t1)/60

##       user     system    elapsed 
## 0.18283333 0.03133333 0.52850000

LogitBoost_Adapt%>%predict(test_tf)%>%confusionMatrix(test_tf$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 44  3
##          1  1 14
##                                          
##                Accuracy : 0.9355         
##                  95% CI : (0.843, 0.9821)
##     No Information Rate : 0.7258         
##     P-Value [Acc > NIR] : 3.217e-05      
##                                          
##                   Kappa : 0.8318         
##  Mcnemar's Test P-Value : 0.6171         
##                                          
##             Sensitivity : 0.9778         
##             Specificity : 0.8235         
##          Pos Pred Value : 0.9362         
##          Neg Pred Value : 0.9333         
##              Prevalence : 0.7258         
##          Detection Rate : 0.7097         
##    Detection Prevalence : 0.7581         
##       Balanced Accuracy : 0.9007         
##                                          
##        'Positive' Class : 0              
##

stopCluster(cl)

Titanic prediction

Edward Lu

March 1st, 2017