This is my first tab on kaggle. After reading about Megan Risdal’s Kaggle report Megan Kaggle kernel, I am really inspired by her work. So here goes my two cents on Exploratory Data analysis and testing machine learning algorithms on the Titanic data using the amazing caret package.
Plus, numerous amazing packages by Hadley Wickham made my life so much easier. Thanks Hadley.
There are many ways to do this. There is the read.csv in the base package and fread in data.table package. There is minimum difference in Titanic dataset. But for much larger files of several GBs, the fread function in data.table package is extremely fast and simple to use.
I think it is wise not to suppress the package loading information.Though it is not beautiful for output html notebook, but there might be conflicts between packages or any new information after updaing some packages. I read them occasionally.
library(tidyverse) # a collection of packages by Hadley
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(caret) # a package for machine learning algorithms
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(data.table)# a package for fast data loading into r
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(tidyr) # a package for manipulating dataframe in r
library(Amelia) # a package to visualize missing data
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
# load all characters as factors, change this by adding stringAsFactors=FALSE in the read.csv function as titanic=read.csv("train.csv", stringAsFactors=FALSE )
titanic=read.csv("train.csv",fill=TRUE,na.strings=c("NA",""))
str(titanic) # structure of the data
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
summary(titanic) # summary of the data
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## B96 B98 : 4 C :168
## C23 C25 C27: 4 Q : 77
## G6 : 4 S :644
## C22 C26 : 3 NA's: 2
## D : 3
## (Other) :186
## NA's :687
Next, let us viasualiza the missing data
missmap(titanic,main="Titanic Training Data - Missings Map", col=c("yellow", "black"), legend=FALSE)
It shows that there is missing data in Age and Cabin.
# since there is only 2 missing value in Embarked, we could just replace it with the majority of the value
titanic$Embarked[is.na(titanic$Embarked)]="S"
titanic=titanic%>%as_data_frame()
The last line transformed the data frame into tibble, which is another form of data frame that is good with dplyr for data manipulation.
Exploring the data would be a less defined process. It is a process that asks intuitive questions about the data.
I personally would follow the kind of the “routine” to plot the variation of a single meaningful variables and co-variations between two variables as suggested by Hadley Wickham in his book “R for data science”
First, the PassengerId data is not so interesting since it is only used to identify every passenger.
Second, name is imported as a factor but I definitely want to separate it into first and last names as character.
Third, ticket number is not interesting to me, either. So I am going to remove it from the data.
Fourth, Cabin is not so clear to me what it means to have three cabins assigned to the same person. Are they all from the same family? We will need to understand it further after we divide the name into first and last name.
titanic$Cabin%>%summary%>%head(10)
## B96 B98 C23 C25 C27 G6 C22 C26 D E101
## 4 4 4 3 3 3
## F2 F33 B18 B20
## 3 3 2 2
The %>% symbol I used is simply pass data from the left to right following the direction of the arrow. it is from package magrittr. It makes code more readable, easier to write and less variables in between.
titanic$Name=titanic$Name%>%as.character
titanic$Name%>%head(20)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
## [7] "McCarthy, Mr. Timothy J"
## [8] "Palsson, Master. Gosta Leonard"
## [9] "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
## [10] "Nasser, Mrs. Nicholas (Adele Achem)"
## [11] "Sandstrom, Miss. Marguerite Rut"
## [12] "Bonnell, Miss. Elizabeth"
## [13] "Saundercock, Mr. William Henry"
## [14] "Andersson, Mr. Anders Johan"
## [15] "Vestrom, Miss. Hulda Amanda Adolfina"
## [16] "Hewlett, Mrs. (Mary D Kingcome) "
## [17] "Rice, Master. Eugene"
## [18] "Williams, Mr. Charles Eugene"
## [19] "Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)"
## [20] "Masselmani, Mrs. Fatima"
It seems that the first character in the Name variable is the last name. The second part, which seprates from the last name by “,” is the title. The name after the title is the first name. I do not understand the meaning of the name in the round brackets. But there is some missing first names like “Hewlett, Mrs. (Mary D Kingcome)”, whose name is Mrs Mary Dunbar Hewlett (née Kingcome), so an easier way is to deal only with the last name.
I get all these from some google search on the names directly.
In conclusion, the structure of the name is Lastname, title Firstname. I am going to leave only title. Plus, I will delete the Ticket and PassengerId
library(tidyr)
library(dplyr)
titanic_nl=titanic%>%separate(col=Name, into=c("Last","First"),sep="\\. ")%>%separate(col=Last, into=c("Last2","Title"),sep=", ")
## Warning: Too many values at 1 locations: 514
titanic_nl[506,"Title"]="Mrs"
titanic_nl%>%summary
## PassengerId Survived Pclass Last2
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Title First Sex Age
## Length:891 Length:891 female:314 Min. : 0.42
## Class :character Class :character male :577 1st Qu.:20.12
## Mode :character Mode :character Median :28.00
## Mean :29.70
## 3rd Qu.:38.00
## Max. :80.00
## NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:646
## C22 C26 : 3
## D : 3
## (Other) :186
## NA's :687
This is a recursive step. You do some visualizations and ask some questions. Then based on the information you will modify the question or ask new questions.
library(stringr)
# Let us try the first name Jack
titanic%>%filter( str_detect(Name,"Jack") )
## # A tibble: 1 × 12
## PassengerId Survived Pclass Name Sex Age SibSp
## <int> <int> <int> <chr> <fctr> <dbl> <int>
## 1 767 0 1 Brewe, Dr. Arthur Jackson male NA 0
## # ... with 5 more variables: Parch <int>, Ticket <fctr>, Fare <dbl>,
## # Cabin <fctr>, Embarked <fctr>
# Let us try the last name Dawson
titanic%>%filter( str_detect(Name,"Dawson") )
## # A tibble: 0 × 12
## # ... with 12 variables: PassengerId <int>, Survived <int>, Pclass <int>,
## # Name <chr>, Sex <fctr>, Age <dbl>, SibSp <int>, Parch <int>,
## # Ticket <fctr>, Fare <dbl>, Cabin <fctr>, Embarked <fctr>
It seems that there is only one man called Jackson on board the ship Titanic. There is no man called Jack Dawson on board the ship.
# Let us try the first name Rose
titanic%>%filter( str_detect(Name,"Rose") )
## # A tibble: 1 × 12
## PassengerId Survived Pclass Name Sex Age
## <int> <int> <int> <chr> <fctr> <dbl>
## 1 856 1 3 Aks, Mrs. Sam (Leah Rosen) female 18
## # ... with 6 more variables: SibSp <int>, Parch <int>, Ticket <fctr>,
## # Fare <dbl>, Cabin <fctr>, Embarked <fctr>
# Let us try the last name Bukater
titanic%>%filter( str_detect(Name,"Bukater") )
## # A tibble: 0 × 12
## # ... with 12 variables: PassengerId <int>, Survived <int>, Pclass <int>,
## # Name <chr>, Sex <fctr>, Age <dbl>, SibSp <int>, Parch <int>,
## # Ticket <fctr>, Fare <dbl>, Cabin <fctr>, Embarked <fctr>
# Let us try the middle name DeWitt
titanic%>%filter( str_detect(Name,"DeWitt") )
## # A tibble: 0 × 12
## # ... with 12 variables: PassengerId <int>, Survived <int>, Pclass <int>,
## # Name <chr>, Sex <fctr>, Age <dbl>, SibSp <int>, Parch <int>,
## # Ticket <fctr>, Fare <dbl>, Cabin <fctr>, Embarked <fctr>
It seems the Rose DeWitt Bukater is also a fictional character. After a little google search, this website confirms my findings that the two main characters are fictional.
http://www.chasingthefrog.com/reelfaces/titanic.php
library(ggplot2)
titanic_nl%>%ggplot(aes(x=Pclass,fill=factor(Survived)))+geom_bar(stat="count",position="fill")
titanic_nl%>%ggplot(aes(x=Sex,fill=factor(Survived) ))+geom_bar(stat = "count",position="fill")
Let us create Family size variable (Fsize) and adjust the Fare based on family size ( we could see that the Fare paid is family based with some visualization).
titanic_nl=titanic_nl%>%select(-Last2,-First,-Ticket)%>%mutate(Fsize=SibSp+Parch+1, Price=Fare/Fsize)%>%select(-Fare)%>%
separate(col=Cabin,into=c("Cabin","Others"),sep=1)%>%select(-Others)
titanic_nl$Pclass=titanic_nl$Pclass%>%factor()
titanic_nl$Survived=titanic_nl$Survived%>%factor()
# Let us first get family size distribution in different class
titanic_nl%>%ggplot()+geom_bar(aes(x=Pclass,fill=factor(Fsize)))
titanic_nl%>%ggplot()+geom_boxplot(aes(x=Pclass,y=Fsize,fill=Survived))
Let us get the survival rate in table form.
titanic_nl%>%summary
## PassengerId Survived Pclass Title Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Cabin
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Embarked Fsize Price
## C:168 Min. : 1.000 Min. : 0.00
## Q: 77 1st Qu.: 1.000 1st Qu.: 7.25
## S:646 Median : 1.000 Median : 8.30
## Mean : 1.905 Mean : 19.92
## 3rd Qu.: 2.000 3rd Qu.: 23.67
## Max. :11.000 Max. :512.33
##
titanic_nl%>%group_by(Pclass,Fsize)%>%summarise(totnum=n(), rate=sum(Survived==1)/totnum)
## Source: local data frame [21 x 4]
## Groups: Pclass [?]
##
## Pclass Fsize totnum rate
## <fctr> <dbl> <int> <dbl>
## 1 1 1 109 0.5321101
## 2 1 2 70 0.7285714
## 3 1 3 24 0.7500000
## 4 1 4 7 0.7142857
## 5 1 5 2 1.0000000
## 6 1 6 4 0.5000000
## 7 2 1 104 0.3461538
## 8 2 2 34 0.5294118
## 9 2 3 31 0.6774194
## 10 2 4 13 0.7692308
## # ... with 11 more rows
From the data, I could say that if you are traveling alone or in a family, Pclass 1 gives higher survival probability in general.
It seems surprising that a family of size 3 has a lower survival probability in Pclass 1 and 3 compared to a family of 2 or 4 in the same Pclass. Surprisingly, a family of size 3 has the highest survival probability in Pclass 2.
It is kind of hard to tell since there is a lot of missing data.
titanic_nl$Cabin%>%factor%>%summary
## A B C D E F G T NA's
## 15 47 59 33 32 13 4 1 687
titanic_nl%>%ggplot()+geom_boxplot(aes(x=Pclass,y=Price,fill=Sex))
This is surprising that females paid more in Pclass==1 than male. Let us take a deeper look at Pclass==2 and 3
titanic_nl%>%filter(Pclass==2 | Pclass==3)%>%ggplot()+geom_boxplot(aes(x=Pclass,y=Price,fill=Sex))
In Pclass==2 and 3, female and male paid about the same.
We could also see from the graph that some people did not pay anything for the ticket.
Let us see how many people are there who pays nothing in each Pclass
titanic_nl%>%filter(Price==0)%>%ggplot()+geom_bar(aes(x=Pclass))
Let us see the other information in these people.
titanic_nl%>%filter(Price==0)
## # A tibble: 15 × 12
## PassengerId Survived Pclass Title Sex Age SibSp Parch Cabin
## <int> <fctr> <fctr> <chr> <fctr> <dbl> <int> <int> <chr>
## 1 180 0 3 Mr male 36 0 0 <NA>
## 2 264 0 1 Mr male 40 0 0 B
## 3 272 1 3 Mr male 25 0 0 <NA>
## 4 278 0 2 Mr male NA 0 0 <NA>
## 5 303 0 3 Mr male 19 0 0 <NA>
## 6 414 0 2 Mr male NA 0 0 <NA>
## 7 467 0 2 Mr male NA 0 0 <NA>
## 8 482 0 2 Mr male NA 0 0 <NA>
## 9 598 0 3 Mr male 49 0 0 <NA>
## 10 634 0 1 Mr male NA 0 0 <NA>
## 11 675 0 2 Mr male NA 0 0 <NA>
## 12 733 0 2 Mr male NA 0 0 <NA>
## 13 807 0 1 Mr male 39 0 0 A
## 14 816 0 1 Mr male NA 0 0 B
## 15 823 0 1 Jonkheer male 38 0 0 <NA>
## # ... with 3 more variables: Embarked <fctr>, Fsize <dbl>, Price <dbl>
They are all single adult males with no families embarked from S (SouthHampton). They are in all cabin class 1,2,3. Most of them did not survived the tragedy of titanic.
It is not so useful to use these data for model building. ## Get the clean form of data and correlation martrix
Let us get the final dataset from scratch and ignore Cabin variable for now.
t_final=titanic%>%separate(col=Name, into=c("Lastname","name"), sep="," )%>%select(-name,-Ticket,-PassengerId)%>%mutate(Fsize=SibSp+Parch+1, Price=Fare/Fsize)%>%select(-Cabin,-Fare)
t_final$Lastname=t_final$Lastname%>%factor
t_final$Survived=t_final$Survived%>%factor
Finally, the correlation matrix plot.
pairs(t_final)
In the blog post from Jason Brownlee, he summarised four steps to improve model performance.
There is nothing we could do about the first step for the Titanic dataset. Let us start with the second and third step using caret package. This is a complete list of machine learning algorithms that caret package support.
Further clean the data
Missdata=titanic%>%filter(Fare==0)%>%select(PassengerId)%>%as.data.frame()
Missdata=Missdata%>%unlist
titanic1=titanic[-Missdata,]
# Next we distill the title from the Name column
titanic2=titanic1%>%separate(col=Name, into=c("Last","First"),sep="\\. ")%>%separate(col=Last, into=c("Last2","Title"),sep=", ")
## Warning: Too many values at 1 locations: 506
titanic2[506,"Title"]="Mrs"
titanic2=titanic2%>%select(-Last2,-First,-Ticket)%>%mutate(Fsize=SibSp+Parch+1, Price=Fare/Fsize)%>%select(-Fare)%>%
separate(col=Cabin,into=c("Cabin","Others"),sep=1)%>%select(-Others)
titanic2$Cabin=titanic2$Cabin%>%factor
titanic2$Title=titanic2$Title%>%factor
titanic2$Pclass=titanic2$Pclass%>%factor
titanic2$Survived=titanic2$Survived%>%factor
t_final2=titanic2%>%mutate(
femaleclass1=ifelse( Pclass==1 & Sex=="female","Live","Dead")%>%as.factor(),
maleclass3=ifelse( (Cabin=="F" | is.na(Cabin) )& Pclass==3 & Sex=="male","Dead","Live" )%>%as.factor(),
maleclass2=ifelse( is.na(Cabin) & Pclass==2 &Sex=="male","Dead","Live" ) %>%as.factor()
)%>%select(-Cabin)%>%select(-PassengerId)
t_preprocess=t_final2%>%preProcess(method=c("knnImpute","center","scale"))
t_final2=predict(t_preprocess,t_final2)
I have some difficulty trying to fix the two values in Missing. Plus, I have some difficulty trying to impute the Embarked variable.
Split the data
library(caret)
library(doParallel)
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Loading required package: parallel
index=createDataPartition(t_final2$Survived,p=0.9,list=FALSE)
train_tf=t_final2[index,]
test_tf=t_final2[-index,]
# t_final2%>%summary
# t_final%>%summary
I choose 90% of the data to be in training set and 10% of data in testing set. The validation would be carried out by submitting to kaggle.
There are 16 model variants and implementation of random forest in caret. Let us start with the most basic 3 random forest models with different implementations. Let us work with the most commonly used rf, ranger implementations.
cl <- makeCluster(5)
registerDoParallel(cl)
ctrl=trainControl(method="repeatedcv", number = 10, repeats=3 ,selectionFunction="oneSE")
rf_grid=expand.grid(mtry=2:8)
rf_rf=train(Survived~., data=train_tf, method="rf", trControl=ctrl,metric="Kappa", tuneGrid=rf_grid)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf_ranger=train(Survived~., data=train_tf, method="ranger", trControl=ctrl,metric="Kappa",tuneGrid=rf_grid)
## Loading required package: e1071
## Loading required package: ranger
##
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
##
## importance
rf_rf
## Random Forest
##
## 789 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 710, 709, 710, 710, 710, 710, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8183677 0.6091853
## 3 0.8187952 0.6099733
## 4 0.8208891 0.6127323
## 5 0.8221601 0.6138682
## 6 0.8234315 0.6160270
## 7 0.8255522 0.6217755
## 8 0.8221550 0.6155569
##
## Kappa was used to select the optimal model using the one SE rule.
## The final value used for the model was mtry = 2.
rf_ranger
## Random Forest
##
## 789 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 710, 710, 711, 711, 709, 710, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8171205 0.6064947
## 3 0.8200799 0.6134967
## 4 0.8154276 0.6005553
## 5 0.8247159 0.6190102
## 6 0.8234284 0.6158647
## 7 0.8284863 0.6287345
## 8 0.8268038 0.6263752
##
## Kappa was used to select the optimal model using the one SE rule.
## The final value used for the model was mtry = 5.
rf_rf_pred=predict(rf_rf,test_tf)
rf_ranger_pred=predict(rf_ranger,test_tf)
confusionMatrix(rf_rf_pred,test_tf$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 48 11
## 1 5 23
##
## Accuracy : 0.8161
## 95% CI : (0.7186, 0.8911)
## No Information Rate : 0.6092
## P-Value [Acc > NIR] : 2.749e-05
##
## Kappa : 0.6011
## Mcnemar's Test P-Value : 0.2113
##
## Sensitivity : 0.9057
## Specificity : 0.6765
## Pos Pred Value : 0.8136
## Neg Pred Value : 0.8214
## Prevalence : 0.6092
## Detection Rate : 0.5517
## Detection Prevalence : 0.6782
## Balanced Accuracy : 0.7911
##
## 'Positive' Class : 0
##
confusionMatrix(rf_ranger_pred,test_tf$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 49 12
## 1 4 22
##
## Accuracy : 0.8161
## 95% CI : (0.7186, 0.8911)
## No Information Rate : 0.6092
## P-Value [Acc > NIR] : 2.749e-05
##
## Kappa : 0.5968
## Mcnemar's Test P-Value : 0.08012
##
## Sensitivity : 0.9245
## Specificity : 0.6471
## Pos Pred Value : 0.8033
## Neg Pred Value : 0.8462
## Prevalence : 0.6092
## Detection Rate : 0.5632
## Detection Prevalence : 0.7011
## Balanced Accuracy : 0.7858
##
## 'Positive' Class : 0
##
plot(rf_rf)
plot(rf_ranger)
stopCluster(cl)
It seems that if we increase the repeated times of cross validation, the accuracy could improve.
Let us see if preProcess could improve our performance
cl <- makeCluster(5)
registerDoParallel(cl)
rf_rf2=train(Survived~., data=train_tf, method="rf", trControl=ctrl,metric="Kappa", tuneGrid=rf_grid,preProcess=c("center","scale"))
rf_ranger2=train(Survived~., data=train_tf, method="ranger", trControl=ctrl,metric="Kappa",tuneGrid=rf_grid,preProcess=c("center","scale"))
rf_rf2
## Random Forest
##
## 789 samples
## 12 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (28), scaled (28)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 710, 710, 709, 710, 710, 710, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8166425 0.6055581
## 3 0.8187520 0.6099703
## 4 0.8259576 0.6237208
## 5 0.8175292 0.6022585
## 6 0.8196282 0.6058927
## 7 0.8276511 0.6252989
## 8 0.8251300 0.6208158
##
## Kappa was used to select the optimal model using the one SE rule.
## The final value used for the model was mtry = 3.
rf_ranger2
## Random Forest
##
## 789 samples
## 12 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (28), scaled (28)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 710, 710, 709, 711, 711, 710, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8192174 0.6108921
## 3 0.8200613 0.6128125
## 4 0.8234372 0.6183661
## 5 0.8251147 0.6196011
## 6 0.8255741 0.6203797
## 7 0.8276730 0.6261899
## 8 0.8327418 0.6386241
##
## Kappa was used to select the optimal model using the one SE rule.
## The final value used for the model was mtry = 7.
rf_rf_pred2=predict(rf_rf2,test_tf)
rf_ranger_pred2=predict(rf_ranger2,test_tf)
confusionMatrix(rf_rf_pred2,test_tf$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 48 11
## 1 5 23
##
## Accuracy : 0.8161
## 95% CI : (0.7186, 0.8911)
## No Information Rate : 0.6092
## P-Value [Acc > NIR] : 2.749e-05
##
## Kappa : 0.6011
## Mcnemar's Test P-Value : 0.2113
##
## Sensitivity : 0.9057
## Specificity : 0.6765
## Pos Pred Value : 0.8136
## Neg Pred Value : 0.8214
## Prevalence : 0.6092
## Detection Rate : 0.5517
## Detection Prevalence : 0.6782
## Balanced Accuracy : 0.7911
##
## 'Positive' Class : 0
##
confusionMatrix(rf_ranger_pred2,test_tf$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 50 13
## 1 3 21
##
## Accuracy : 0.8161
## 95% CI : (0.7186, 0.8911)
## No Information Rate : 0.6092
## P-Value [Acc > NIR] : 2.749e-05
##
## Kappa : 0.5923
## Mcnemar's Test P-Value : 0.02445
##
## Sensitivity : 0.9434
## Specificity : 0.6176
## Pos Pred Value : 0.7937
## Neg Pred Value : 0.8750
## Prevalence : 0.6092
## Detection Rate : 0.5747
## Detection Prevalence : 0.7241
## Balanced Accuracy : 0.7805
##
## 'Positive' Class : 0
##
plot(rf_rf2)
plot(rf_ranger2)
stopCluster(cl)
It seems that the random forest would have an accuracy of around 0.81.
Let us try some other classification algorithms like svmLinear,svmLinear2,svmLinear3, svmPoly and svmRadial
Normally SVM would benefits from normalized data.
First let us try the svm linear models.
cl <- makeCluster(5)
registerDoParallel(cl)
ctrl_t=trainControl(method="repeatedcv", number = 10, repeats=10 ,selectionFunction="oneSE")
svml1_grid=expand.grid(C=seq(0.001,1,length.out = 10))
svml2_grid=expand.grid(cost=seq(0.001,1,length.out = 10))
svml3_grid=expand.grid(cost=seq(0.001,1,length.out = 10), Loss=c("L1","L2"))
rf_svml1=train(Survived~.,data=train_tf,method="svmLinear",trControl=ctrl_t,tuneGrid=svml1_grid)
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
rf_svml2=train(Survived~.,data=train_tf,method="svmLinear2",trControl=ctrl_t,tuneGrid=svml2_grid)
rf_svml3=train(Survived~.,data=train_tf,method="svmLinear3",trControl=ctrl_t,tuneGrid=svml3_grid)
## Loading required package: LiblineaR
rf_svml1
## Support Vector Machines with Linear Kernel
##
## 789 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 711, 710, 710, 709, 709, 710, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.001 0.6971411 0.2679210
## 0.112 0.8218113 0.6192980
## 0.223 0.8228193 0.6207195
## 0.334 0.8209157 0.6166047
## 0.445 0.8196465 0.6137468
## 0.556 0.8197683 0.6136692
## 0.667 0.8187508 0.6113681
## 0.778 0.8178647 0.6093211
## 0.889 0.8177413 0.6089071
## 1.000 0.8177429 0.6086697
##
## Accuracy was used to select the optimal model using the one SE rule.
## The final value used for the model was C = 0.112.
rf_svml2
## Support Vector Machines with Linear Kernel
##
## 789 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 710, 710, 711, 711, 711, 710, ...
## Resampling results across tuning parameters:
##
## cost Accuracy Kappa
## 0.001 0.7027205 0.2844091
## 0.112 0.8205138 0.6169497
## 0.223 0.8216419 0.6189227
## 0.334 0.8204978 0.6159045
## 0.445 0.8202478 0.6151151
## 0.556 0.8193616 0.6129245
## 0.667 0.8189819 0.6120691
## 0.778 0.8184772 0.6108475
## 0.889 0.8183473 0.6104651
## 1.000 0.8180925 0.6099795
##
## Accuracy was used to select the optimal model using the one SE rule.
## The final value used for the model was cost = 0.112.
rf_svml3
## L2 Regularized Support Vector Machine (dual) with Linear Kernel
##
## 789 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 711, 710, 710, 709, 711, 710, ...
## Resampling results across tuning parameters:
##
## cost Loss Accuracy Kappa
## 0.001 L1 0.8035727 0.5747374
## 0.001 L2 0.6780379 0.2067771
## 0.112 L1 0.8268875 0.6285051
## 0.112 L2 0.8234585 0.6243872
## 0.223 L1 0.8258829 0.6258107
## 0.223 L2 0.8252307 0.6274808
## 0.334 L1 0.8260079 0.6257938
## 0.334 L2 0.8248526 0.6263670
## 0.445 L1 0.8255062 0.6246297
## 0.445 L2 0.8232085 0.6225874
## 0.556 L1 0.8247468 0.6228052
## 0.556 L2 0.8224441 0.6208138
## 0.667 L1 0.8244919 0.6223242
## 0.667 L2 0.8215548 0.6187237
## 0.778 L1 0.8244935 0.6221062
## 0.778 L2 0.8215564 0.6185621
## 0.889 L1 0.8242404 0.6215058
## 0.889 L2 0.8218064 0.6188342
## 1.000 L1 0.8241154 0.6212112
## 1.000 L2 0.8214298 0.6178842
##
## Accuracy was used to select the optimal model using the one SE rule.
## The final values used for the model were cost = 0.112 and Loss = L1.
rf_svml1%>%predict(test_tf)%>%confusionMatrix(test_tf$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 47 10
## 1 6 24
##
## Accuracy : 0.8161
## 95% CI : (0.7186, 0.8911)
## No Information Rate : 0.6092
## P-Value [Acc > NIR] : 2.749e-05
##
## Kappa : 0.6054
## Mcnemar's Test P-Value : 0.4533
##
## Sensitivity : 0.8868
## Specificity : 0.7059
## Pos Pred Value : 0.8246
## Neg Pred Value : 0.8000
## Prevalence : 0.6092
## Detection Rate : 0.5402
## Detection Prevalence : 0.6552
## Balanced Accuracy : 0.7963
##
## 'Positive' Class : 0
##
rf_svml2%>%predict(test_tf)%>%confusionMatrix(test_tf$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 47 10
## 1 6 24
##
## Accuracy : 0.8161
## 95% CI : (0.7186, 0.8911)
## No Information Rate : 0.6092
## P-Value [Acc > NIR] : 2.749e-05
##
## Kappa : 0.6054
## Mcnemar's Test P-Value : 0.4533
##
## Sensitivity : 0.8868
## Specificity : 0.7059
## Pos Pred Value : 0.8246
## Neg Pred Value : 0.8000
## Prevalence : 0.6092
## Detection Rate : 0.5402
## Detection Prevalence : 0.6552
## Balanced Accuracy : 0.7963
##
## 'Positive' Class : 0
##
rf_svml3%>%predict(test_tf)%>%confusionMatrix(test_tf$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 49 10
## 1 4 24
##
## Accuracy : 0.8391
## 95% CI : (0.7448, 0.9091)
## No Information Rate : 0.6092
## P-Value [Acc > NIR] : 2.843e-06
##
## Kappa : 0.651
## Mcnemar's Test P-Value : 0.1814
##
## Sensitivity : 0.9245
## Specificity : 0.7059
## Pos Pred Value : 0.8305
## Neg Pred Value : 0.8571
## Prevalence : 0.6092
## Detection Rate : 0.5632
## Detection Prevalence : 0.6782
## Balanced Accuracy : 0.8152
##
## 'Positive' Class : 0
##
stopCluster(cl)
It seems that the performance is not so good as random forest models since the relationship is somewhat non-linear.
Let us try the non-linear svm models with adaptive resampling
adaptControl <- trainControl(method = "adaptive_cv",
number = 10, repeats = 10,
adaptive = list(min = 10, alpha = 0.05,
method = "gls", complete = TRUE),
search = "random")
cl <- makeCluster(6)
registerDoParallel(cl)
t1=proc.time()
set.seed(2)
svm_lssvmRadial_Adapt = train(Survived ~ ., data = train_tf,
method = "lssvmRadial",
trControl = adaptControl,
preProc = c("center", "scale"),
metric = "Kappa",
tuneLength = 90)
t2=proc.time()
svm_lssvmRadial_Adapt%>%predict(test_tf)%>%confusionMatrix(test_tf$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 48 11
## 1 5 23
##
## Accuracy : 0.8161
## 95% CI : (0.7186, 0.8911)
## No Information Rate : 0.6092
## P-Value [Acc > NIR] : 2.749e-05
##
## Kappa : 0.6011
## Mcnemar's Test P-Value : 0.2113
##
## Sensitivity : 0.9057
## Specificity : 0.6765
## Pos Pred Value : 0.8136
## Neg Pred Value : 0.8214
## Prevalence : 0.6092
## Detection Rate : 0.5517
## Detection Prevalence : 0.6782
## Balanced Accuracy : 0.7911
##
## 'Positive' Class : 0
##
(t2-t1)/60
## user system elapsed
## 0.62883333 0.05366667 13.33333333
stopCluster(cl)
On a sidenote, the svm algorithm does not utilize all the CPU power of a quad-core i7-6700 machine even though I have registered 3 scores. The maximum performance is just 50% of CPU.
After a long computation to spot check and fine tuning many models, I found an algorithm with a high accuracy.
#LogitBoost_Adapt
library(caret)
library(tidyverse)
library(doParallel)
cl <- makeCluster(5)
registerDoParallel(cl)
t1=proc.time()
set.seed(2)
LogitBoost_Adapt = train(Survived ~ ., data = train_tf,method = "LogitBoost",trControl = adaptControl,
preProc = c("center", "scale"), metric = "Kappa",tuneLength = 540)
## Loading required package: caTools
t2=proc.time()
(t2-t1)/60
## user system elapsed
## 0.18283333 0.03133333 0.52850000
LogitBoost_Adapt%>%predict(test_tf)%>%confusionMatrix(test_tf$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 44 3
## 1 1 14
##
## Accuracy : 0.9355
## 95% CI : (0.843, 0.9821)
## No Information Rate : 0.7258
## P-Value [Acc > NIR] : 3.217e-05
##
## Kappa : 0.8318
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 0.9778
## Specificity : 0.8235
## Pos Pred Value : 0.9362
## Neg Pred Value : 0.9333
## Prevalence : 0.7258
## Detection Rate : 0.7097
## Detection Prevalence : 0.7581
## Balanced Accuracy : 0.9007
##
## 'Positive' Class : 0
##
stopCluster(cl)