This is a classification challenge on Kaggle. I will use different classification methods to predict the survival on the Titanic.
- Logistics Regression
- Classification tree
- Random forest
- Gradient Boosting
- Support Vector Machine
Details and datasets can be found here : https://www.kaggle.com/c/titanic
train <- read_csv("C:/Users/todu6001/Downloads/train.csv")
## Parsed with column specification:
## cols(
## PassengerId = col_integer(),
## Survived = col_integer(),
## Pclass = col_integer(),
## Name = col_character(),
## Sex = col_character(),
## Age = col_double(),
## SibSp = col_integer(),
## Parch = col_integer(),
## Ticket = col_character(),
## Fare = col_double(),
## Cabin = col_character(),
## Embarked = col_character()
## )
test <- read_csv("C:/Users/todu6001/Downloads/test.csv")
## Parsed with column specification:
## cols(
## PassengerId = col_integer(),
## Pclass = col_integer(),
## Name = col_character(),
## Sex = col_character(),
## Age = col_double(),
## SibSp = col_integer(),
## Parch = col_integer(),
## Ticket = col_character(),
## Fare = col_double(),
## Cabin = col_character(),
## Embarked = col_character()
## )
#Combining both train and test to create a full data
full <- bind_rows(train, test)
#Data summary
str(full)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(full)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.3838 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
## NA's :418
## Sex Age SibSp Parch
## Length:1309 Min. : 0.17 Min. :0.0000 Min. :0.000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## Ticket Fare Cabin
## Length:1309 Min. : 0.000 Length:1309
## Class :character 1st Qu.: 7.896 Class :character
## Mode :character Median : 14.454 Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
## Embarked
## Length:1309
## Class :character
## Mode :character
##
##
##
##
aggr_plot <- aggr(full, col=c('navyblue','red'),
numbers=TRUE, sortVars=TRUE,
labels=names(data),
cex.axis=.7, gap=3,
ylab=c("Histogram of missing data","Pattern"))
## Warning in plot.aggr(res, ...): not enough horizontal space to display
## frequencies
##
## Variables sorted by number of missings:
## Variable Count
## Cabin 0.7746371276
## Survived 0.3193277311
## Age 0.2009167303
## Embarked 0.0015278839
## Fare 0.0007639419
## PassengerId 0.0000000000
## Pclass 0.0000000000
## Name 0.0000000000
## Sex 0.0000000000
## SibSp 0.0000000000
## Parch 0.0000000000
## Ticket 0.0000000000
We can see that CABIN, AGE, EMBARKED and FARE variables have missing data. We have 1309 observations and 12 variables.
A name might not explain a lot in this situation however, a title is different. Passenger title not only can be used as a gender classification but also their Rank in society. The higher Rank they have, the higher chance people will let them stay alive.
# Check missing
sum(is.na(full$Name))
## [1] 0
# Take title from name
full$Title <- gsub('(.*, )|(\\..*)', '', full$Name)
# Show title counts by sex
table(full$Sex, full$Title)
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme
## female 0 0 0 1 1 0 1 0 0 260 2 1
## male 1 4 1 0 7 1 0 2 61 0 0 0
##
## Mr Mrs Ms Rev Sir the Countess
## female 0 197 2 0 0 1
## male 757 0 0 8 1 0
We can group the title into “MR”,“MISS”,“MRS”,“MASTER” and “Other”.
# Titles with very low cell counts to be combined to "others"
rare_title <- c('Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don',
'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')
# Also reassign mlle, ms, and mme accordingly
full$Title[full$Title == 'Mlle'] <- 'Miss'
full$Title[full$Title == 'Ms'] <- 'Miss'
full$Title[full$Title == 'Mme'] <- 'Mrs'
full$Title[full$Title %in% rare_title] <- 'Others'
# Show title counts by sex again
table(full$Sex, full$Title)
##
## Master Miss Mr Mrs Others
## female 0 264 0 198 4
## male 61 0 757 0 25
The size of the family also might be an interesting factor to consider.
#Check missing
sum(is.na(full$SibSp))
## [1] 0
sum(is.na(full$Parch))
## [1] 0
# Create a family size variable including the passenger themselves
full$Fsize <- full$SibSp + full$Parch + 1
#Use ggplot2 to visualize the relationship between family size & survival
ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +
geom_bar(stat='count', position='dodge') +
scale_x_continuous(breaks=c(1:11)) +
labs(x = 'Family Size') +
theme_few()
We can see that people who are single have highest rate of “dying” compare to “surviving” among all others size.
#Check missing
sum(is.na(full$Cabin))
## [1] 1014
#Have a look at the data
full$Cabin [1:50]
## [1] NA "C85" NA "C123" NA
## [6] NA "E46" NA NA NA
## [11] "G6" "C103" NA NA NA
## [16] NA NA NA NA NA
## [21] NA "D56" NA "A6" NA
## [26] NA NA "C23 C25 C27" NA NA
## [31] NA "B78" NA NA NA
## [36] NA NA NA NA NA
## [41] NA NA NA NA NA
## [46] NA NA NA NA NA
This variable appears to have a lot of missing value and there is not much to analyze from this so I decided to drop this variable.
# Process Age and fare Column
sum(is.na(full$Age))
## [1] 263
sum(is.na(full$Fare))
## [1] 1
full$PassengerId[is.na(full$Fare)]
## [1] 1044
#Make factor variable factors
factor_vars <- c('PassengerId','Pclass','Sex','Embarked',
'Fsize','Survived','Title')
full[factor_vars] <- lapply(full[factor_vars], function(x) as.factor(x))
Age is probably one of the key factor that will contribute to the model. However, it seems to have a lot of missing value. Since this is a continous variables, we can try to fill in the missing value using MICE package.
# Use mice package to predict missing age and missing fare
set.seed(123)
mice_predict <- mice(full[, !names(full) %in% c('PassengerId','Name','Ticket','Cabin','Family','Survived','Fsize','Title')], method='rf')
##
## iter imp variable
## 1 1 Age Fare Embarked
## 1 2 Age Fare Embarked
## 1 3 Age Fare Embarked
## 1 4 Age Fare Embarked
## 1 5 Age Fare Embarked
## 2 1 Age Fare Embarked
## 2 2 Age Fare Embarked
## 2 3 Age Fare Embarked
## 2 4 Age Fare Embarked
## 2 5 Age Fare Embarked
## 3 1 Age Fare Embarked
## 3 2 Age Fare Embarked
## 3 3 Age Fare Embarked
## 3 4 Age Fare Embarked
## 3 5 Age Fare Embarked
## 4 1 Age Fare Embarked
## 4 2 Age Fare Embarked
## 4 3 Age Fare Embarked
## 4 4 Age Fare Embarked
## 4 5 Age Fare Embarked
## 5 1 Age Fare Embarked
## 5 2 Age Fare Embarked
## 5 3 Age Fare Embarked
## 5 4 Age Fare Embarked
## 5 5 Age Fare Embarked
full_mice <-complete(mice_predict)
aggr_plot <- aggr(full_mice, col=c('navyblue','red'),
numbers=TRUE, sortVars=TRUE,
labels=names(data), cex.axis=.7,
gap=3,
ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Pclass 0
## Sex 0
## Age 0
## SibSp 0
## Parch 0
## Fare 0
## Embarked 0
There is no missing value means all the data points have been filled. Now we proceed to check means and distribution using bar graph.
# Check distribution before and after for AGE
par(mfrow=c(1,2))
hist(full$Age, freq=F, main='Before Replacement',
col='red', ylim=c(0,0.04))
hist(full_mice$Age, freq=F, main='After Replacement',
col='blue', ylim=c(0,0.04))
# Check distribution before and after for FARE
par(mfrow=c(1,2))
hist(full$Fare, freq=F, main='Before Replacement',
col='red', ylim=c(0,0.04))
hist(full_mice$Fare, freq=F, main='After Replacement',
col='blue', ylim=c(0,0.04))
There is no different between the Before and After data set. We proceed to replace the predicted points into real data.
#Replace value into the data set
full$Age <- full_mice$Age
full$Fare <- full_mice$Fare
full$Embarked <- full_mice$Embarked
Now we have finished cleaning the data we need to generate classification model.
In order to compare the result for different model, I follow these steps to generate consitent result :
1. Fit the classification model
2. Tune the attributes if necessary
3. Generate classification rate
4. Generate ROC curve and AUC score
AUC score represents the accuracy of model. An area of 1 represents a perfect test; an area of .5 represents a worthless test. A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:
.90-1 = excellent (A)
.80-.90 = good (B)
.70-.80 = fair (C)
.60-.70 = poor (D)
.50-.60 = fail (F)
# Check for missing again
aggr_plot <- aggr(full, col=c('navyblue','red'),
numbers=TRUE, sortVars=TRUE,
labels=names(data),
cex.axis=.7, gap=3,
ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Cabin 0.7746371
## Survived 0.3193277
## PassengerId 0.0000000
## Pclass 0.0000000
## Name 0.0000000
## Sex 0.0000000
## Age 0.0000000
## SibSp 0.0000000
## Parch 0.0000000
## Ticket 0.0000000
## Fare 0.0000000
## Embarked 0.0000000
## Title 0.0000000
## Fsize 0.0000000
# Split the data back into a train set and a test set
trainx <- full[1:891,]
testx <- full[892:1309,]
test_sample <- sample(1:nrow(trainx),350) #This is a validate set to validate accuracy
View(trainx[test_sample,])
#
I split the data back to train and test data after added new variable and clean all missing data. I also split the train data into an sample data for classification validation.
I generated the model based on Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, Fsize and Title Let’s start with the basic logistics regression model
#Logistics regression model with sample dataset
logit_death <- glm(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Fsize + Title,
data= trainx[-test_sample,],
family=binomial)
summary(logit_death)
##
## Call:
## glm(formula = factor(Survived) ~ Pclass + Sex + Age + SibSp +
## Parch + Fare + Embarked + Fsize + Title, family = binomial,
## data = trainx[-test_sample, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4791 -0.4782 -0.3159 0.4545 2.6051
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.277e+01 1.695e+03 0.013 0.989285
## Pclass2 -1.520e+00 4.342e-01 -3.500 0.000465 ***
## Pclass3 -2.567e+00 4.365e-01 -5.881 4.08e-09 ***
## Sexmale -1.754e+01 1.695e+03 -0.010 0.991745
## Age -2.273e-02 1.199e-02 -1.896 0.057935 .
## SibSp -1.568e+00 9.909e+01 -0.016 0.987376
## Parch -1.781e+00 9.909e+01 -0.018 0.985657
## Fare 2.110e-03 3.174e-03 0.665 0.506205
## Embarked2 1.655e-01 5.426e-01 0.305 0.760391
## Embarked3 -2.883e-01 3.523e-01 -0.819 0.413021
## Fsize2 1.406e+00 9.909e+01 0.014 0.988677
## Fsize3 3.557e+00 1.982e+02 0.018 0.985680
## Fsize4 4.462e+00 2.973e+02 0.015 0.988025
## Fsize5 4.157e+00 3.964e+02 0.010 0.991632
## Fsize6 4.920e+00 4.955e+02 0.010 0.992078
## Fsize7 6.911e+00 5.945e+02 0.012 0.990726
## Fsize8 -5.959e+00 1.179e+03 -0.005 0.995968
## Fsize11 NA NA NA NA
## TitleMiss -1.912e+01 1.695e+03 -0.011 0.991002
## TitleMr -4.724e+00 1.010e+00 -4.678 2.89e-06 ***
## TitleMrs -1.853e+01 1.695e+03 -0.011 0.991277
## TitleOthers -4.938e+00 1.245e+00 -3.965 7.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 716.95 on 540 degrees of freedom
## Residual deviance: 399.27 on 520 degrees of freedom
## AIC: 441.27
##
## Number of Fisher Scoring iterations: 15
#Predict on validation data
predict_logit <- predict(logit_death,
trainx[test_sample,],
type="response")
## Warning: contrasts dropped from factor Embarked
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
fitted.results <- ifelse(predict_logit > 0.5,1,0)
mean(fitted.results == trainx[test_sample,]$Survived)
## [1] 0.8171429
It’s amazing to see only Pclass 2, Pclass3 , Title Mr and Title Others are significant while all the other attribute aren’t.
AUC score AND ROC CURVE
roc_pr <- prediction(predict_logit,
trainx[test_sample,]$Survived)
perform <- performance(roc_pr,measure = "tpr",
x.measure = "fpr")
plot(perform)
auc <- performance(roc_pr,measure="auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8432458
The classification error is 81% and AUC score is 0.84.
TREE MODEL WITH TREE() package. So there are multiple packages that can perform classification tree. However, I find it rather difficult to generate ROC curve from tree() model, therefore, I decided to try the rpart package to generate tree model.
### DECISION TREE USING RPART
rpart_deathrate <- rpart(formula = Survived ~ Pclass + Sex + Age +
SibSp + Parch + Fare +
Embarked + Fsize + Title,
trainx[test_sample,])
printcp(rpart_deathrate)
##
## Classification tree:
## rpart(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Fare + Embarked + Fsize + Title, data = trainx[test_sample,
## ])
##
## Variables actually used in tree construction:
## [1] Age Embarked Fsize Pclass Sex Title
##
## Root node error: 138/350 = 0.39429
##
## n= 350
##
## CP nsplit rel error xerror xstd
## 1 0.442029 0 1.00000 1.00000 0.066251
## 2 0.028986 1 0.55797 0.57971 0.056926
## 3 0.021739 2 0.52899 0.60145 0.057661
## 4 0.014493 5 0.45652 0.60145 0.057661
## 5 0.010000 7 0.42754 0.61594 0.058133
plotcp(rpart_deathrate)
predict_rpart <- predict(rpart_deathrate,trainx[-test_sample,], type="prob")[,2]
## Warning: contrasts dropped from factor Embarked
predict_rpart_class <- predict(rpart_deathrate,trainx[-test_sample,], type="class")
## Warning: contrasts dropped from factor Embarked
t <- table(predict_rpart_class,trainx[-test_sample,]$Survived)
sum(diag(t))/sum(t)
## [1] 0.8151571
The classification error rate is 81%
Next I will use the same method to generate the ROC curve for comparison between different method.
roc_rpart <- prediction(predict_rpart,trainx[-test_sample,]$Survived)
rpart_perform <- performance(roc_rpart,measure = "tpr",
x.measure = "fpr")
plot(rpart_perform)
auc_rpart <- performance(roc_rpart,measure="auc")
auc_rpart <- auc_rpart@y.values[[1]]
auc_rpart
## [1] 0.8013033
The AUC score is 0.801 . This seems have lower accuracy than regular logistics regression.
One of the most famous classification method is the Random Forest model. The simple format is
simplerf.deathrate <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Fsize + Title,
data = trainx[test_sample,],
ntree=500,
mtry=3)
simplerf.deathrate
##
## Call:
## randomForest(formula = factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Fsize + Title, data = trainx[test_sample, ], ntree = 500, mtry = 3)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 18.86%
## Confusion matrix:
## 0 1 class.error
## 0 192 20 0.09433962
## 1 46 92 0.33333333
However, there is a bunch of parameters that we can explore with Randomforest package however I will only focus on:
- ntree Number of trees to grow.
- mtry Number of variables randomly sampled as candidates at each split.
TUNE MTRY USING TUNERF function
tune <- tuneRF (trainx[test_sample,c("Pclass","Sex","Age","SibSp","Parch",
"Fare","Embarked","Fsize","Title")],
trainx[test_sample,]$Survived,
ntreeTry = 500)
## mtry = 3 OOB error = 19.43%
## Searching left ...
## mtry = 2 OOB error = 20.57%
## -0.05882353 0.05
## Searching right ...
## mtry = 6 OOB error = 22.57%
## -0.1617647 0.05
x <- trainx[test_sample,c("Pclass","Sex","Age","SibSp","Parch",
"Fare","Embarked","Fsize","Title")]
mtry <- sqrt(ncol(x))
tunegrid <- expand.grid(.mtry=mtry)
control <- trainControl(method="repeatedcv",
number=10,
repeats=3,
search="random")
metric <- "Accuracy"
It seems like both tuning methods recommend mtry =3 .
Let’s run the model with mtry =3 and ntree =1000. This time we generate a table of ntree, starting from 50 and increasing up to 1000 ntree to see which value generate lowest error rate.
rf_deathrate <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Fsize + Title,
data = trainx[test_sample,],
ntree=1000,
mtry=3, do.trace=50)
## ntree OOB 1 2
## 50: 20.00% 9.91% 35.51%
## 100: 19.71% 9.91% 34.78%
## 150: 19.71% 9.91% 34.78%
## 200: 20.57% 10.38% 36.23%
## 250: 19.43% 8.49% 36.23%
## 300: 20.29% 8.96% 37.68%
## 350: 20.29% 9.43% 36.96%
## 400: 20.57% 9.43% 37.68%
## 450: 20.29% 9.43% 36.96%
## 500: 20.86% 10.38% 36.96%
## 550: 20.29% 10.38% 35.51%
## 600: 21.14% 10.85% 36.96%
## 650: 21.14% 10.85% 36.96%
## 700: 20.29% 9.91% 36.23%
## 750: 20.57% 10.38% 36.23%
## 800: 20.57% 10.38% 36.23%
## 850: 20.57% 10.38% 36.23%
## 900: 20.86% 10.85% 36.23%
## 950: 20.00% 9.43% 36.23%
## 1000: 20.00% 9.43% 36.23%
rf_deathrate
##
## Call:
## randomForest(formula = factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Fsize + Title, data = trainx[test_sample, ], ntree = 1000, mtry = 3, do.trace = 50)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 20%
## Confusion matrix:
## 0 1 class.error
## 0 192 20 0.09433962
## 1 50 88 0.36231884
The table shows that at ntree = 650 and 750 the error is lowest. However, it’s not significantly different from the default ntree = 500.
pred <- predict(rf_deathrate,trainx[-test_sample,])
mean(pred == trainx[-test_sample,]$Survived)
## [1] 0.8373383
predsimp <- predict(simplerf.deathrate,trainx[-test_sample,])
mean(predsimp == trainx[-test_sample,]$Survived)
## [1] 0.8391867
The prediction accuracy of the default model is actually better than the tune model in this case.
Let’s generate the ROC curve.
pred_rf <- predict(rf_deathrate,
trainx[-test_sample,],
type="prob")[,2]
roc_rf <- prediction(pred_rf,trainx[-test_sample,]$Survived)
perform_rf <- performance(roc_rf,measure = "tpr",
x.measure = "fpr")
plot(perform_rf)
auc_rf <- performance(roc_rf,measure="auc")
auc_rf <- auc_rf@y.values[[1]]
auc_rf
## [1] 0.8826511
The AUC score is 0.88 which is significantly better than logistics regression and decision tree method.
Gradient Boosting is an amazing methods to boost the accuraccy of predictive model. Details can be found in this source : https://www.analyticsvidhya.com/blog/2015/09/complete-guide-boosting-methods/
#TUNING PARAMETERS FOR GBM
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = (1:30)*50,
shrinkage = 0.01,
n.minobsinnode = 20)
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
## Estimate class probabilities
classProbs = FALSE,
## Evaluate performance using
## the following function
savePredictions = T)
gbmFit2 <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Fsize + Title,
data= trainx[test_sample,],
method = "gbm",
trControl = fitControl,
verbose = FALSE,
## Now specify the exact models
## to evaluate:
tuneGrid = gbmGrid)
## Loading required package: gbm
## Warning: package 'gbm' was built under R version 3.3.3
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.3.3
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 3.3.3
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.
## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.
## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.
## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.
## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.
## Warning in gbm.fit(x = structure(c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, :
## variable 16: Fsize8 has no variation.
gbmFit2
## Stochastic Gradient Boosting
##
## 350 samples
## 9 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 315, 316, 315, 315, 314, 315, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7792232 0.5236165
## 1 100 0.7797946 0.5252812
## 1 150 0.7777778 0.5214060
## 1 200 0.7752386 0.5177372
## 1 250 0.7749528 0.5182954
## 1 300 0.7772474 0.5242965
## 1 350 0.7769048 0.5240855
## 1 400 0.7769127 0.5244509
## 1 450 0.7743483 0.5192265
## 1 500 0.7743567 0.5194277
## 1 550 0.7723562 0.5152104
## 1 600 0.7715317 0.5133764
## 1 650 0.7712288 0.5128436
## 1 700 0.7744057 0.5191436
## 1 750 0.7766424 0.5238586
## 1 800 0.7777932 0.5258077
## 1 850 0.7775401 0.5249187
## 1 900 0.7775560 0.5246460
## 1 950 0.7781036 0.5260849
## 1 1000 0.7784057 0.5264777
## 1 1050 0.7789613 0.5274942
## 1 1100 0.7786835 0.5266365
## 1 1150 0.7809869 0.5318383
## 1 1200 0.7803992 0.5299829
## 1 1250 0.7803828 0.5295040
## 1 1300 0.7818114 0.5326164
## 1 1350 0.7803903 0.5295035
## 1 1400 0.7815668 0.5326768
## 1 1450 0.7821218 0.5335887
## 1 1500 0.7807250 0.5310262
## 5 50 0.7743249 0.4740193
## 5 100 0.7641078 0.4627010
## 5 150 0.7735387 0.5015669
## 5 200 0.7830187 0.5262554
## 5 250 0.7843903 0.5298762
## 5 300 0.7857866 0.5328189
## 5 350 0.7877951 0.5375277
## 5 400 0.7877951 0.5380345
## 5 450 0.7880490 0.5393176
## 5 500 0.7891998 0.5422519
## 5 550 0.7914944 0.5479439
## 5 600 0.7906858 0.5454474
## 5 650 0.7923922 0.5495676
## 5 700 0.7955121 0.5566724
## 5 750 0.7949734 0.5556910
## 5 800 0.7972526 0.5603525
## 5 850 0.7972446 0.5611585
## 5 900 0.8000616 0.5676671
## 5 950 0.7981022 0.5633781
## 5 1000 0.7977754 0.5628775
## 5 1050 0.7972199 0.5616146
## 5 1100 0.7980859 0.5637605
## 5 1150 0.7986494 0.5649982
## 5 1200 0.7966732 0.5607047
## 5 1250 0.7980943 0.5642379
## 5 1300 0.7958161 0.5595881
## 5 1350 0.7961270 0.5605021
## 5 1400 0.7958492 0.5598379
## 5 1450 0.7964132 0.5614092
## 5 1500 0.7938487 0.5561551
## 9 50 0.7743249 0.4740193
## 9 100 0.7640831 0.4608702
## 9 150 0.7712530 0.4955789
## 9 200 0.7783641 0.5153469
## 9 250 0.7814669 0.5224519
## 9 300 0.7843576 0.5293234
## 9 350 0.7849692 0.5315500
## 9 400 0.7875173 0.5370467
## 9 450 0.7883828 0.5392792
## 9 500 0.7895504 0.5426462
## 9 550 0.7898361 0.5433487
## 9 600 0.7904160 0.5447184
## 9 650 0.7915023 0.5471743
## 9 700 0.7940999 0.5530091
## 9 750 0.7947208 0.5553146
## 9 800 0.7952596 0.5562399
## 9 850 0.7944024 0.5548046
## 9 900 0.7961172 0.5587507
## 9 950 0.7946881 0.5555475
## 9 1000 0.7969585 0.5607300
## 9 1050 0.7969748 0.5615377
## 9 1100 0.7949664 0.5569394
## 9 1150 0.7943866 0.5564008
## 9 1200 0.7943623 0.5564182
## 9 1250 0.7941013 0.5558120
## 9 1300 0.7938394 0.5556119
## 9 1350 0.7938399 0.5557241
## 9 1400 0.7941013 0.5559923
## 9 1450 0.7935299 0.5550190
## 9 1500 0.7932446 0.5547693
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.01
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 900,
## interaction.depth = 5, shrinkage = 0.01 and n.minobsinnode = 20.
plot(gbmFit2)
gbm_pred1 <- predict(gbmFit2,trainx[-test_sample,])
## Warning: contrasts dropped from factor Embarked
mean(gbm_pred1 == trainx[-test_sample,]$Survived)
## [1] 0.818854
gbm_pred <- predict(gbmFit2,trainx[-test_sample,],type="prob")
## Warning: contrasts dropped from factor Embarked
The classification accuraccy seems a little bit lower than I expected. Only 81.8%.
roc_gbm <- prediction(gbm_pred[,2],trainx[-test_sample,]$Survived)
perform_gbm <- performance(roc_gbm,measure = "tpr",
x.measure = "fpr")
plot(perform_gbm)
auc_gbm <- performance(roc_gbm,measure="auc")
auc_gbm<- auc_gbm@y.values[[1]]
auc_gbm
## [1] 0.8646797
The AUC score is pretty decent compare to all others method, however, it still can’t outperform random forest method with 0.88 .
Overview:
SVM’s are a group of classifiers considered to be one of the best “out of the box” methods. There are several types of SVM’s. The simplest is the_maximal margin classifier_ (MMC). Though the MMC is elegant and simple, it cannot be applied to most data sets, since it requires the classes be separable by a linear boundary. The support vector classifier is an extension of the maximal margin classifier, and can be applied in a broader range of cases. However, better yet, the support vector machine is a further extension to accommodate non-linear class boundaries.
# SVM required labels for dependent variable instead of numeric value
trainx$Survived <- factor(trainx$Survived, levels =c(0,1), labels=c("died","survived"))
# Setup for cross validation
set.seed(123)
ctrl <- trainControl(method="cv",
number = 2,
summaryFunction=twoClassSummary,
classProbs=TRUE)
# Grid search to fine tune SVM
grid <- expand.grid(sigma = c(.01, .015, 0.2),
C = c(0.75, 0.9, 1, 1.1, 1.25))
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:scales':
##
## alpha
## The following object is masked from 'package:ggplot2':
##
## alpha
svm_Linear <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Fsize + Title,
trainx[test_sample,],
method = "svmRadial",
metric="ROC",
trControl=ctrl,
tuneGrid=grid
)
svm_Linear
## Support Vector Machines with Radial Basis Function Kernel
##
## 350 samples
## 9 predictor
## 2 classes: 'died', 'survived'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold)
## Summary of sample sizes: 175, 175
## Resampling results across tuning parameters:
##
## sigma C ROC Sens Spec
## 0.010 0.75 0.8292316 0.9009434 0.6376812
## 0.010 0.90 0.8308039 0.9009434 0.6376812
## 0.010 1.00 0.8311457 0.9009434 0.6304348
## 0.010 1.10 0.8305305 0.8962264 0.6594203
## 0.010 1.25 0.8283429 0.9009434 0.6376812
## 0.015 0.75 0.8258819 0.9009434 0.6376812
## 0.015 0.90 0.8251982 0.9009434 0.6449275
## 0.015 1.00 0.8244463 0.8915094 0.6449275
## 0.015 1.10 0.8228056 0.8915094 0.6521739
## 0.015 1.25 0.8226689 0.8915094 0.6521739
## 0.200 0.75 0.8205496 0.8396226 0.6811594
## 0.200 0.90 0.8202078 0.8537736 0.6521739
## 0.200 1.00 0.8229423 0.8396226 0.6666667
## 0.200 1.10 0.8240361 0.8726415 0.6376812
## 0.200 1.25 0.8219169 0.8726415 0.6231884
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01 and C = 1.
# Generate prediction and calculate accuracy
pred_svm <- predict( svm_Linear,trainx[-test_sample,])
## Warning: contrasts dropped from factor Embarked
mean(pred_svm == trainx[-test_sample,]$Survived)
## [1] 0.8280961
# ROC curve and AUC
pred_svm2 <- predict(svm_Linear,
trainx[-test_sample,],
type="prob")[,2]
## Warning: contrasts dropped from factor Embarked
roc_svm <- prediction(pred_svm2,trainx[-test_sample,]$Survived)
perform_svm <- performance(roc_svm,measure = "tpr",
x.measure = "fpr")
plot(perform_svm)
auc_svm <- performance(roc_svm,measure="auc")
auc_svm <- auc_svm@y.values[[1]]
auc_svm
## [1] 0.8676689
*Logistic regression
+Accuracy : 81.71%
+AUC score : 0.8432
*Classification tree
+Accuracy: 81.51%
+AUC score : 0.80
*Random Forest :
+Accuracy : 83.91%
+AUC score : 0.8826
*Gradient Boosting :
+Accuracy : 81.8%
+AUC score : 0.86
*Support Vector Machine
+Accuracy : 82.07%
+AUC score : 0.86
RandomForest has highest AUC score and prediction accuracy. However, Gradient boosting also has decent AUC but pretty low prediction rate. Therefore, in this situation, Random Forest has the best performance .