In this exercise, I would like to practice Titanic Survival Prediction using Caret.
dim(dataset) # rows and cols of dataset
## [1] 891 12
# The first 10 rows
head(dataset,10)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## 7 7 0 1
## 8 8 0 3
## 9 9 1 3
## 10 10 1 2
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## 7 McCarthy, Mr. Timothy J male 54 0 0
## 8 Palsson, Master. Gosta Leonard male 2 3 1
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2
## 10 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 <NA> S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 <NA> S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 <NA> S
## 6 330877 8.4583 <NA> Q
## 7 17463 51.8625 E46 S
## 8 349909 21.0750 <NA> S
## 9 347742 11.1333 <NA> S
## 10 237736 30.0708 <NA> C
# attributes of dataset
sapply(dataset,class)
## PassengerId Survived Pclass Name Sex Age
## "integer" "integer" "integer" "character" "character" "numeric"
## SibSp Parch Ticket Fare Cabin Embarked
## "integer" "integer" "character" "numeric" "character" "character"
We may not need Passenger ID, Ticket as well as Cabin, we will remove them. We will also convert Survived, Pclass,Sex and Embarked into factor before viewing summary
summary(dataset)
## Survived Pclass Name Sex Age
## 0:549 1:216 Length:891 female:314 Min. : 0.42
## 1:342 2:184 Class :character male :577 1st Qu.:20.12
## 3:491 Mode :character Median :28.00
## Mean :29.70
## 3rd Qu.:38.00
## Max. :80.00
## NA's :177
## SibSp Parch Fare Embarked
## Min. :0.000 Min. :0.0000 Min. : 0.00 C :168
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 7.91 Q : 77
## Median :0.000 Median :0.0000 Median : 14.45 S :644
## Mean :0.523 Mean :0.3816 Mean : 32.20 NA's: 2
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 Max. :512.33
##
We can see we have 177 NA values for Age attribute and 2 NA values for Embarked. This suggests we may need to remove the records (or impute values) with NA values for some analysis and modeling techniques.
In a classification problem you must know the proportion of instances that belong to each class label. This is important because it may highlight an imbalance in the data, that if severe may need to be addressed with rebalancing techniques. In the case of a multi-class classification problem it may expose a class with a small or zero instances that may be candidates for removing from the dataset.
## freq percentage
## 0 549 61.61616
## 1 342 38.38384
61% vs 39% split for the class values which is imbalanced, but not so much that we need to thinking about re balancing, at least not yet.
Lets look at the correlation between the attributes. We have to exclude the rows with NA values (incomplete cases) when calculating the correlations.
## Age SibSp Parch Fare
## Age 1.00000000 -0.3073509 -0.1878965 0.09314252
## SibSp -0.30735094 1.0000000 0.3833375 0.13986049
## Parch -0.18789649 0.3833375 1.0000000 0.20662367
## Fare 0.09314252 0.1398605 0.2066237 1.00000000
There is no correlation among numeric attributes !
Except Age, we can see that other distributions have an exponential shape. We may benefit from log transforms or other power transforms later on.
Let’s use density plots to get a more smoothed look at the distributions.
These plots add more support to our initial ideas. We can see exponential looking distributions. The Age attributes seems not normally distributed. Let’s perform Shapiro-wilk or Kolmogorov-smirnov test to confirm it
This helps point out the skew in many distributions so much so that data looks like outliers (e.g. beyond the whisker of the plots).
Correlation Plots for numeric variables
We retrieve training dataset again before wrangling
dataset <- read.csv(here("data","train.csv"),na.strings="")
We will :
Create a function to convert Name To Title, we will apply for both Train and Test dataset
NameToTitle <- function(data){
data$Title<-gsub('(.*, )|(\\..*)', '', data$Name)
# Title with low counts should combine into "rare" title
rare_title <- c('Capt','Col','Don','Dona','Dr','Jonkheer','Lady','Major','Rev','Sir','the Countess')
# Reassign mlle, ms, and mme accordingly
data$Title[data$Title == 'Mlle'] <- 'Miss'
data$Title[data$Title == 'Ms'] <- 'Miss'
data$Title[data$Title == 'Mme'] <- 'Mrs'
data$Title[data$Title %in% rare_title] <- 'RareTitle'
return(data)
}
Then create Title from Name for train dataset
dataset <- NameToTitle(dataset)
# Check Title count by Sex
table(dataset$Sex,dataset$Title)
##
## Master Miss Mr Mrs RareTitle
## female 0 185 0 126 3
## male 40 0 517 0 20
dataset$Fsize <- dataset$SibSp + dataset$Parch + 1
# Checking the family size and survival
ggplot(dataset, aes(x = Fsize, fill = as.factor(Survived))) +
geom_bar(stat='count', position='dodge') +
xlab("Family members") + scale_fill_discrete(name = "Survived") + ggtitle("Survivors by Number of Family members")
For data wrangling, we would need to:
dataset <- dataset %>%
mutate_at(.vars=c('Survived','Pclass','Embarked'),.funs=as.factor) %>%
select(-c('Sex','SibSp','Parch')) %>% # remove collinearity columns
select(-c('PassengerId','Name','Ticket','Cabin')) # remove unnecessary columns
head(dataset)
## Survived Pclass Age Fare Embarked Title Fsize
## 1 0 3 22 7.2500 S Mr 2
## 2 1 1 38 71.2833 C Mrs 2
## 3 1 3 26 7.9250 S Miss 1
## 4 1 1 35 53.1000 S Mrs 2
## 5 0 3 35 8.0500 S Mr 1
## 6 0 3 NA 8.4583 Q Mr 1
# Check for missing value
colSums(is.na(dataset))
## Survived Pclass Age Fare Embarked Title Fsize
## 0 0 177 0 2 0 0
We found there are missing value on Age, Fare and Embarked. Strategy for treatment missing values as below:
For Embarked, since it’s a factor, 2 missing value can be replaced with its mode.
For Age, we can either replace missing value with mean or median: replace with mean if there is no outlier, instead using median.
Same apply to for missing value in Fare.
# Check for Embarked mode
library(tracerer) # for calc_mode function
Embarked_mode <- calc_mode(dataset$Embarked) # checking Embarked Mode
Embarked_mode
## [1] S
## Levels: C Q S
We can see most of passenger embarked the Titanic from Southampton (S) port, then we should replace the missing value with ‘S’ or Mode of Embarked
dataset$Embarked[is.na(dataset$Embarked)] <- "S"
levels(dataset$Embarked)
## [1] "C" "Q" "S"
We found there are outliers in Age feature, then we better replace the missing value with its median
Age_median <- median(dataset$Age, na.rm=T) # get the Median of Age
Fare_median <- median(dataset$Fare, na.rm=T) # get the Median of Fare
dataset$Age[is.na(dataset$Age)] <- Age_median
dataset$Fare[is.na(dataset$Fare)] <- Fare_median
# check again if any missing values
dataset %>% anyNA()
## [1] FALSE
Missing in Age and Fare was replaced with Median. There no NA anymore in our dataset.
We will finalize dataset by splitting it into Validation and Training/Testing.
# create a list of 80% of the rows in the original dataset we can use for training
set.seed(7)
validationIndex <- createDataPartition(dataset$Survived, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- dataset[-validationIndex,]
# use the remaining 80% of data to training and testing the models
train <- dataset[validationIndex,]
We don’t know what algorithms will perform well on this data before hand. We have to spot-check various different methods and see what looks good then double down on those methods.
Logistic Regression (LG),
Linear Discriminate Analysis (LDA)
Regularized Logistic Regression (GLMNET).
k-Nearest Neighbors (KNN),
Classification and Regression Trees (CART),
Naive Bayes (NB)
Support Vector Machines with Radial Basis Functions (SVM).
We have a good amount of data so we will use 10-fold cross validation with 3 repeats. This is a good standard test harness configuration. It is a binary classification problem. For simplicity, we will use Accuracy and Kappa metrics. We could have gone with the Area Under ROC Curve (AUC) and looked at the sensitivity and specificity to select the best algorithms.
# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# LG - Logistic Regression
set.seed(7)
fit.glm <- train(Survived~., data=train, method="glm",
metric=metric,trControl=trainControl)
# LDA - Linear Discriminate Analysis
set.seed(7)
fit.lda <- train(Survived~., data=train, method="lda",
metric=metric,trControl=trainControl)
# GLMNET - Regularized Logistic Regression
set.seed(7)
fit.glmnet <- train(Survived~., data=train, method="glmnet",
metric=metric,trControl=trainControl)
# KNN - k-Nearest Neighbors
set.seed(7)
fit.knn <- train(Survived~., data=train, method="knn",
metric=metric,trControl=trainControl)
# CART - Classification and Regression Trees (CART),
set.seed(7)
fit.cart <- train(Survived~., data=train, method="rpart",
metric=metric,trControl=trainControl)
# NB - Naive Bayes (NB)
set.seed(7)
Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))
fit.nb <- train(Survived~., data=train, method="nb",
metric=metric,trControl=trainControl,
tuneGrid=Grid)
# SVM - Support Vector Machines with Radial Basis Functions (SVM).
set.seed(7)
fit.svm <- train(Survived~., data=train, method="svmRadial",
metric=metric,trControl=trainControl)
##
## Call:
## summary.resamples(object = results)
##
## Models: LG, LDA, GLMNET, KNN, CART, NB, SVM
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LG 0.7464789 0.7805164 0.8181729 0.8174426 0.8466843 0.8888889 0
## LDA 0.7323944 0.7887324 0.8194444 0.8160146 0.8450704 0.9027778 0
## GLMNET 0.7323944 0.7805164 0.8194444 0.8169666 0.8333333 0.9027778 0
## KNN 0.5915493 0.6944444 0.7202660 0.7260563 0.7605634 0.8169014 0
## CART 0.7323944 0.7777778 0.8028169 0.8029734 0.8421362 0.8888889 0
## NB 0.6619718 0.7333236 0.7500000 0.7586724 0.7909331 0.8309859 0
## SVM 0.7042254 0.8028169 0.8309859 0.8244588 0.8591549 0.9014085 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LG 0.4379947 0.5285212 0.6159694 0.6086985 0.6803949 0.7692308 0
## LDA 0.3931624 0.5348390 0.6125828 0.6036667 0.6686419 0.7967742 0
## GLMNET 0.4197849 0.5302207 0.6151149 0.6055088 0.6473665 0.7967742 0
## KNN 0.1012658 0.3571429 0.4053942 0.4093096 0.5009117 0.5971192 0
## CART 0.4197849 0.5200000 0.5816498 0.5773519 0.6626215 0.7692308 0
## NB 0.1630648 0.3568799 0.4087276 0.4272276 0.5122415 0.6196429 0
## SVM 0.3292848 0.5628848 0.6253298 0.6114487 0.6957238 0.7923109 0
The highest accuracy from SVM with 82.58%. ### 5. Modeling: Ensembles {.tabset} #### Ensembles
Lets look at some boosting and bagging ensemble algorithms on the dataset. There are 4 ensemble methods: ˆ
Bagging: Bagged CART (BAG) and Random Forest (RF). ˆ
Boosting: Stochastic Gradient Boosting (GBM) and C5.0 (C50).
# Bagged CART
set.seed(7)
fit.treebag <- train(Survived~., data=train, method="treebag", metric=metric,trControl=trainControl)
# RF
set.seed(7)
fit.rf <- train(Survived~., data=train, method="rf", metric=metric,trControl=trainControl)
# GBM - Stochastic Gradient Boosting
set.seed(7)
fit.gbm <- train(Survived~., data=train, method="gbm",metric=metric,trControl=trainControl, verbose=FALSE)
# C5.0
set.seed(7)
fit.c50 <- train(Survived~., data=train, method="C5.0", metric=metric,trControl=trainControl)
# Compare results
ensembleResults <- resamples(list(BAG=fit.treebag,RF=fit.rf,GBM=fit.gbm,C50=fit.c50))
summary(ensembleResults)
##
## Call:
## summary.resamples(object = ensembleResults)
##
## Models: BAG, RF, GBM, C50
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## BAG 0.7323944 0.7887324 0.8169014 0.8113524 0.8333333 0.9014085 0
## RF 0.7323944 0.7944542 0.8181729 0.8253521 0.8561718 0.9305556 0
## GBM 0.7605634 0.7887324 0.8169014 0.8216028 0.8333333 0.9154930 0
## C50 0.7464789 0.8028169 0.8392019 0.8337963 0.8606221 0.9014085 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## BAG 0.4294643 0.5318712 0.6080575 0.5968252 0.6470130 0.7893175 0
## RF 0.4362725 0.5669682 0.6115121 0.6245121 0.6917479 0.8529412 0
## GBM 0.4570400 0.5368370 0.6058716 0.6143691 0.6508810 0.8207071 0
## C50 0.4462738 0.5693241 0.6516580 0.6409762 0.7030321 0.7893175 0
dotplot(ensembleResults)
Interesting that C5.0 is now the algorithms with highest accuracy (82.86%), follow by SVM (82.58) and RF (82.20%). We will then select them as our final models for prediction on validation dataset.
Tree algorithms with higher accuracy will be selected for prediction: C50, SVM and RF
# train a model and summarize model
set.seed(7)
finalModel.c50 <- train(Survived~., data=train, method="C5.0",
metric=metric,trControl=trainControl)
print(finalModel.c50)
## C5.0
##
## 714 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.8137259 0.5959738
## rules FALSE 10 0.8151148 0.5999225
## rules FALSE 20 0.8202530 0.6108416
## rules TRUE 1 0.8095201 0.5846711
## rules TRUE 10 0.8067358 0.5833702
## rules TRUE 20 0.8090832 0.5878150
## tree FALSE 1 0.8132694 0.5949868
## tree FALSE 10 0.8207029 0.6113957
## tree FALSE 20 0.8337963 0.6409762
## tree TRUE 1 0.8072053 0.5799236
## tree TRUE 10 0.8072444 0.5827135
## tree TRUE 20 0.8136998 0.5976517
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
## = FALSE.
# estimate on validation dataset
set.seed(7)
predictions <- predict(finalModel.c50, newdata = validation)
confusionMatrix(predictions,validation$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 103 17
## 1 6 51
##
## Accuracy : 0.8701
## 95% CI : (0.8114, 0.9158)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 5.999e-14
##
## Kappa : 0.7168
##
## Mcnemar's Test P-Value : 0.03706
##
## Sensitivity : 0.9450
## Specificity : 0.7500
## Pos Pred Value : 0.8583
## Neg Pred Value : 0.8947
## Prevalence : 0.6158
## Detection Rate : 0.5819
## Detection Prevalence : 0.6780
## Balanced Accuracy : 0.8475
##
## 'Positive' Class : 0
##
We can see that the estimated accuracy on the training dataset was 83.38%. Applying the finalModel in the fit, we can see that the accuracy on the validation dataset was 85.31%. It’s a good prediction with unseen data.
# train a model and summarize model
set.seed(7)
finalModel.svm <- train(Survived~., data=train, method="svmRadial",
metric=metric,trControl=trainControl)
print(finalModel.svm)
## Support Vector Machines with Radial Basis Function Kernel
##
## 714 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8132433 0.5863264
## 0.50 0.8197835 0.6017074
## 1.00 0.8244588 0.6114487
##
## Tuning parameter 'sigma' was held constant at a value of 0.1120186
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1120186 and C = 1.
# estimate on validation dataset
set.seed(7)
predictions <- predict(finalModel.svm, newdata = validation)
confusionMatrix(predictions,validation$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 101 18
## 1 8 50
##
## Accuracy : 0.8531
## 95% CI : (0.7922, 0.9017)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 3.506e-12
##
## Kappa : 0.6807
##
## Mcnemar's Test P-Value : 0.07756
##
## Sensitivity : 0.9266
## Specificity : 0.7353
## Pos Pred Value : 0.8487
## Neg Pred Value : 0.8621
## Prevalence : 0.6158
## Detection Rate : 0.5706
## Detection Prevalence : 0.6723
## Balanced Accuracy : 0.8309
##
## 'Positive' Class : 0
##
With SVM, We can see that the accuracy changes from 82.58% for training dataset to 85.31% for validation dataset.
# train a model and summarize model
set.seed(7)
finalModel.rf <- train(Survived~., data=train, method="rf",
metric=metric,trControl=trainControl)
print(finalModel.rf)
## Random Forest
##
## 714 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8235198 0.6123327
## 6 0.8253521 0.6245121
## 11 0.8136346 0.6029773
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
# estimate on validation dataset
set.seed(7)
predictions <- predict(finalModel.rf, newdata = validation)
confusionMatrix(predictions,validation$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 102 18
## 1 7 50
##
## Accuracy : 0.8588
## 95% CI : (0.7986, 0.9065)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 9.459e-13
##
## Kappa : 0.6921
##
## Mcnemar's Test P-Value : 0.0455
##
## Sensitivity : 0.9358
## Specificity : 0.7353
## Pos Pred Value : 0.8500
## Neg Pred Value : 0.8772
## Prevalence : 0.6158
## Detection Rate : 0.5763
## Detection Prevalence : 0.6780
## Balanced Accuracy : 0.8355
##
## 'Positive' Class : 0
##
We can see that the estimated accuracy with train dataset is 82.20% and prediction for the validation dataset was 85.31%.
Although three algorithms have a small difference in the estimated accuracy with training/testing dataset, but they have similar accuracy when predicting with validation dataset: 85.31%. The kappa, inter-rate reliability are also similar: 67.89%, 68.07%, 68.07% for three models.
So we can use any of those model for predicting new data.
# Save the final model to disk
saveRDS(finalModel.c50, here("output","model","finalModel.c50.rds"))
saveRDS(finalModel.rf, here("output","model","finalModel.rf.rds"))
saveRDS(finalModel.svm,here("output","model","finalModel.svm.rds"))
First we need to read the submission data and wrangling it
test <- read.csv(here("data","test.csv"))
# convert to factor for Sex, Embarked
test <- test %>%
mutate_at(.vars=c("Pclass","Sex","Embarked"), .funs=as.factor)
# Create Title from Name
test <- NameToTitle(test)
table(test$Sex, test$Title) # Check Title count again
##
## Master Miss Mr Mrs RareTitle
## female 0 79 0 72 1
## male 21 0 240 0 5
# Create Fsize
test$Fsize <- test$SibSp + test$Parch + 1
# Handling missing Age, Fare
test_Age_median <- median(test$Age,na.rm=T)
test_Fare_median <- median(test$Fare,na.rm=T)
test$Age[is.na(test$Age)] <- test_Age_median
test$Fare[is.na(test$Fare)] <- test_Fare_median
# load the model C50
set.seed(7)
superModel <- readRDS(here("output","model","finalModel.c50.rds"))
print(superModel)
## C5.0
##
## 714 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.8137259 0.5959738
## rules FALSE 10 0.8151148 0.5999225
## rules FALSE 20 0.8202530 0.6108416
## rules TRUE 1 0.8095201 0.5846711
## rules TRUE 10 0.8067358 0.5833702
## rules TRUE 20 0.8090832 0.5878150
## tree FALSE 1 0.8132694 0.5949868
## tree FALSE 10 0.8207029 0.6113957
## tree FALSE 20 0.8337963 0.6409762
## tree TRUE 1 0.8072053 0.5799236
## tree TRUE 10 0.8072444 0.5827135
## tree TRUE 20 0.8136998 0.5976517
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
## = FALSE.
# make a predictions on "new data" using the final model
prediction.c50 <- predict(superModel, test)
summary(prediction.c50)
## 0 1
## 276 142
# load the model svm
set.seed(7)
superModel <- readRDS(here("output","model","finalModel.svm.rds"))
print(superModel)
## Support Vector Machines with Radial Basis Function Kernel
##
## 714 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8132433 0.5863264
## 0.50 0.8197835 0.6017074
## 1.00 0.8244588 0.6114487
##
## Tuning parameter 'sigma' was held constant at a value of 0.1120186
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1120186 and C = 1.
# make a predictions on "new data" using the final model
prediction.svm <- predict(superModel, test)
summary(prediction.svm)
## 0 1
## 275 143
# load the model rf
set.seed(7)
superModel <- readRDS(here("output","model","finalModel.rf.rds"))
print(superModel)
## Random Forest
##
## 714 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8235198 0.6123327
## 6 0.8253521 0.6245121
## 11 0.8136346 0.6029773
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
# make a predictions on "new data" using the final model
prediction.rf <- predict(superModel, test)
summary(prediction.rf)
## 0 1
## 266 152
With C5.0 algorithm, we can predict 142 survived and 276 not survived. We then generate the list of Passenger with their survival status.
my_submission <- data.frame(PassengerId=test$PassengerId,
Survived=as.integer(as.character(prediction.c50)))
write.csv(my_submission,
here("output","data","my_titanic_01.csv"),
row.names=FALSE, quote = FALSE)