Titanic Survival Prediction with Caret

1. Problem Definition

In this exercise, I would like to practice Titanic Survival Prediction using Caret.

Load packages

Load Data

2. Summarize Data

Descriptive statistics

Dimension

dim(dataset) # rows and cols of dataset

## [1] 891  12

Data

# The first 10 rows
head(dataset,10)

##    PassengerId Survived Pclass
## 1            1        0      3
## 2            2        1      1
## 3            3        1      3
## 4            4        1      1
## 5            5        0      3
## 6            6        0      3
## 7            7        0      1
## 8            8        0      3
## 9            9        1      3
## 10          10        1      2
##                                                   Name    Sex Age SibSp Parch
## 1                              Braund, Mr. Owen Harris   male  22     1     0
## 2  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                               Heikkinen, Miss. Laina female  26     0     0
## 4         Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                             Allen, Mr. William Henry   male  35     0     0
## 6                                     Moran, Mr. James   male  NA     0     0
## 7                              McCarthy, Mr. Timothy J   male  54     0     0
## 8                       Palsson, Master. Gosta Leonard   male   2     3     1
## 9    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0     2
## 10                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1     0
##              Ticket    Fare Cabin Embarked
## 1         A/5 21171  7.2500  <NA>        S
## 2          PC 17599 71.2833   C85        C
## 3  STON/O2. 3101282  7.9250  <NA>        S
## 4            113803 53.1000  C123        S
## 5            373450  8.0500  <NA>        S
## 6            330877  8.4583  <NA>        Q
## 7             17463 51.8625   E46        S
## 8            349909 21.0750  <NA>        S
## 9            347742 11.1333  <NA>        S
## 10           237736 30.0708  <NA>        C

Attribute

# attributes of dataset
sapply(dataset,class)

## PassengerId    Survived      Pclass        Name         Sex         Age 
##   "integer"   "integer"   "integer" "character" "character"   "numeric" 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##   "integer"   "integer" "character"   "numeric" "character" "character"

Summary

We may not need Passenger ID, Ticket as well as Cabin, we will remove them. We will also convert Survived, Pclass,Sex and Embarked into factor before viewing summary

summary(dataset)

##  Survived Pclass      Name               Sex           Age       
##  0:549    1:216   Length:891         female:314   Min.   : 0.42  
##  1:342    2:184   Class :character   male  :577   1st Qu.:20.12  
##           3:491   Mode  :character                Median :28.00  
##                                                   Mean   :29.70  
##                                                   3rd Qu.:38.00  
##                                                   Max.   :80.00  
##                                                   NA's   :177    
##      SibSp           Parch             Fare        Embarked  
##  Min.   :0.000   Min.   :0.0000   Min.   :  0.00   C   :168  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91   Q   : 77  
##  Median :0.000   Median :0.0000   Median : 14.45   S   :644  
##  Mean   :0.523   Mean   :0.3816   Mean   : 32.20   NA's:  2  
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00             
##  Max.   :8.000   Max.   :6.0000   Max.   :512.33             
##

We can see we have 177 NA values for Age attribute and 2 NA values for Embarked. This suggests we may need to remove the records (or impute values) with NA values for some analysis and modeling techniques.

Distribution and Correlation check

Class distribution

In a classification problem you must know the proportion of instances that belong to each class label. This is important because it may highlight an imbalance in the data, that if severe may need to be addressed with rebalancing techniques. In the case of a multi-class classification problem it may expose a class with a small or zero instances that may be candidates for removing from the dataset.

##   freq percentage
## 0  549   61.61616
## 1  342   38.38384

61% vs 39% split for the class values which is imbalanced, but not so much that we need to thinking about re balancing, at least not yet.

Correlation check

Lets look at the correlation between the attributes. We have to exclude the rows with NA values (incomplete cases) when calculating the correlations.

##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.3073509 -0.1878965 0.09314252
## SibSp -0.30735094  1.0000000  0.3833375 0.13986049
## Parch -0.18789649  0.3833375  1.0000000 0.20662367
## Fare   0.09314252  0.1398605  0.2066237 1.00000000

There is no correlation among numeric attributes !

Data visualizations

Histogram

Except Age, we can see that other distributions have an exponential shape. We may benefit from log transforms or other power transforms later on.

Density

Let’s use density plots to get a more smoothed look at the distributions.

These plots add more support to our initial ideas. We can see exponential looking distributions. The Age attributes seems not normally distributed. Let’s perform Shapiro-wilk or Kolmogorov-smirnov test to confirm it

Box-plot

This helps point out the skew in many distributions so much so that data looks like outliers (e.g. beyond the whisker of the plots).

Correlation Plot

Correlation Plots for numeric variables

3. Prepare Data

We retrieve training dataset again before wrangling

dataset <- read.csv(here("data","train.csv"),na.strings="")

Feature Engineering

We will :

Extract Title from Name
Create a Family variable

Extract Title from Name

Create a function to convert Name To Title, we will apply for both Train and Test dataset

NameToTitle <- function(data){
  data$Title<-gsub('(.*, )|(\\..*)', '', data$Name)
  
  # Title with low counts should combine into "rare" title
rare_title <- c('Capt','Col','Don','Dona','Dr','Jonkheer','Lady','Major','Rev','Sir','the Countess')
# Reassign mlle, ms, and mme accordingly
data$Title[data$Title == 'Mlle']        <- 'Miss' 
data$Title[data$Title == 'Ms']          <- 'Miss'
data$Title[data$Title == 'Mme']         <- 'Mrs' 
data$Title[data$Title %in% rare_title]  <- 'RareTitle'

return(data)
}

Then create Title from Name for train dataset

dataset <- NameToTitle(dataset)
# Check Title count by Sex
table(dataset$Sex,dataset$Title)

##         
##          Master Miss  Mr Mrs RareTitle
##   female      0  185   0 126         3
##   male       40    0 517   0        20

Create Family Size variable

dataset$Fsize <- dataset$SibSp + dataset$Parch + 1

# Checking the family size and survival
ggplot(dataset, aes(x = Fsize, fill = as.factor(Survived))) +
  geom_bar(stat='count', position='dodge') +
  xlab("Family members") + scale_fill_discrete(name = "Survived") + ggtitle("Survivors by Number of Family members")

Data Wrangling

For data wrangling, we would need to:

Change columns to correct data types Columns need to change into factor are Survived, Pclass, Sex and Embarked
Drop columns which are considered not useful to prediction
Remove col linearity features : Sex vs Title, Fsize vs Sibsp+Parch

dataset <- dataset %>%
  mutate_at(.vars=c('Survived','Pclass','Embarked'),.funs=as.factor) %>%
  select(-c('Sex','SibSp','Parch')) %>% # remove collinearity columns 
  select(-c('PassengerId','Name','Ticket','Cabin')) # remove unnecessary columns
  
head(dataset)

##   Survived Pclass Age    Fare Embarked Title Fsize
## 1        0      3  22  7.2500        S    Mr     2
## 2        1      1  38 71.2833        C   Mrs     2
## 3        1      3  26  7.9250        S  Miss     1
## 4        1      1  35 53.1000        S   Mrs     2
## 5        0      3  35  8.0500        S    Mr     1
## 6        0      3  NA  8.4583        Q    Mr     1

Handling Missing Values

Check missing values

# Check for missing value
colSums(is.na(dataset))

## Survived   Pclass      Age     Fare Embarked    Title    Fsize 
##        0        0      177        0        2        0        0

We found there are missing value on Age, Fare and Embarked. Strategy for treatment missing values as below:

For Embarked, since it’s a factor, 2 missing value can be replaced with its mode.
For Age, we can either replace missing value with mean or median: replace with mean if there is no outlier, instead using median.
Same apply to for missing value in Fare.

Replace with Mode

# Check for Embarked mode
library(tracerer) # for calc_mode function
Embarked_mode <- calc_mode(dataset$Embarked) # checking Embarked Mode
Embarked_mode

## [1] S
## Levels: C Q S

We can see most of passenger embarked the Titanic from Southampton (S) port, then we should replace the missing value with ‘S’ or Mode of Embarked

dataset$Embarked[is.na(dataset$Embarked)] <- "S"
levels(dataset$Embarked)

## [1] "C" "Q" "S"

Replace with Mean/Median

We found there are outliers in Age feature, then we better replace the missing value with its median

Age_median <- median(dataset$Age, na.rm=T) # get the Median of Age
Fare_median <- median(dataset$Fare, na.rm=T) # get the Median of Fare

dataset$Age[is.na(dataset$Age)] <- Age_median
dataset$Fare[is.na(dataset$Fare)] <- Fare_median

# check again if any missing values
dataset %>% anyNA()

## [1] FALSE

Missing in Age and Fare was replaced with Median. There no NA anymore in our dataset.

Data transformation

Finalize data for modeling

We will finalize dataset by splitting it into Validation and Training/Testing.

# create a list of 80% of the rows in the original dataset we can use for training
set.seed(7)
validationIndex <- createDataPartition(dataset$Survived, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- dataset[-validationIndex,]
# use the remaining 80% of data to training and testing the models
train <- dataset[validationIndex,]

4. Modeling: Baseline

List of Algorithms

We don’t know what algorithms will perform well on this data before hand. We have to spot-check various different methods and see what looks good then double down on those methods.

Linear Algorithms:
- Logistic Regression (LG),
- Linear Discriminate Analysis (LDA)
- Regularized Logistic Regression (GLMNET).
Non-Linear Algorithms:
- k-Nearest Neighbors (KNN),
- Classification and Regression Trees (CART),
- Naive Bayes (NB)
- Support Vector Machines with Radial Basis Functions (SVM).

Test options and evaluation metric

We have a good amount of data so we will use 10-fold cross validation with 3 repeats. This is a good standard test harness configuration. It is a binary classification problem. For simplicity, we will use Accuracy and Kappa metrics. We could have gone with the Area Under ROC Curve (AUC) and looked at the sensitivity and specificity to select the best algorithms.

# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"

Build models

# LG - Logistic Regression
set.seed(7)
fit.glm <- train(Survived~., data=train, method="glm",
                 metric=metric,trControl=trainControl)
# LDA - Linear Discriminate Analysis
set.seed(7)
fit.lda <- train(Survived~., data=train, method="lda",
                 metric=metric,trControl=trainControl)

# GLMNET - Regularized Logistic Regression
set.seed(7)
fit.glmnet <- train(Survived~., data=train, method="glmnet",
                 metric=metric,trControl=trainControl)

# KNN - k-Nearest Neighbors 
set.seed(7)
fit.knn <- train(Survived~., data=train, method="knn",
                 metric=metric,trControl=trainControl)

# CART - Classification and Regression Trees (CART), 
set.seed(7)
fit.cart <- train(Survived~., data=train, method="rpart",
                 metric=metric,trControl=trainControl)

# NB - Naive Bayes (NB) 
set.seed(7)
Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))
fit.nb <- train(Survived~., data=train, method="nb",
                 metric=metric,trControl=trainControl,
                tuneGrid=Grid)

# SVM - Support Vector Machines with Radial Basis Functions (SVM).
set.seed(7)
fit.svm <- train(Survived~., data=train, method="svmRadial",
                 metric=metric,trControl=trainControl)

Results

## 
## Call:
## summary.resamples(object = results)
## 
## Models: LG, LDA, GLMNET, KNN, CART, NB, SVM 
## Number of resamples: 30 
## 
## Accuracy 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LG     0.7464789 0.7805164 0.8181729 0.8174426 0.8466843 0.8888889    0
## LDA    0.7323944 0.7887324 0.8194444 0.8160146 0.8450704 0.9027778    0
## GLMNET 0.7323944 0.7805164 0.8194444 0.8169666 0.8333333 0.9027778    0
## KNN    0.5915493 0.6944444 0.7202660 0.7260563 0.7605634 0.8169014    0
## CART   0.7323944 0.7777778 0.8028169 0.8029734 0.8421362 0.8888889    0
## NB     0.6619718 0.7333236 0.7500000 0.7586724 0.7909331 0.8309859    0
## SVM    0.7042254 0.8028169 0.8309859 0.8244588 0.8591549 0.9014085    0
## 
## Kappa 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LG     0.4379947 0.5285212 0.6159694 0.6086985 0.6803949 0.7692308    0
## LDA    0.3931624 0.5348390 0.6125828 0.6036667 0.6686419 0.7967742    0
## GLMNET 0.4197849 0.5302207 0.6151149 0.6055088 0.6473665 0.7967742    0
## KNN    0.1012658 0.3571429 0.4053942 0.4093096 0.5009117 0.5971192    0
## CART   0.4197849 0.5200000 0.5816498 0.5773519 0.6626215 0.7692308    0
## NB     0.1630648 0.3568799 0.4087276 0.4272276 0.5122415 0.6196429    0
## SVM    0.3292848 0.5628848 0.6253298 0.6114487 0.6957238 0.7923109    0

The highest accuracy from SVM with 82.58%. ### 5. Modeling: Ensembles {.tabset} #### Ensembles

Lets look at some boosting and bagging ensemble algorithms on the dataset. There are 4 ensemble methods:

Bagging: Bagged CART (BAG) and Random Forest (RF).
Boosting: Stochastic Gradient Boosting (GBM) and C5.0 (C50).

Build models

# Bagged CART
set.seed(7)
fit.treebag <- train(Survived~., data=train, method="treebag", metric=metric,trControl=trainControl)
# RF
set.seed(7)
fit.rf <- train(Survived~., data=train, method="rf", metric=metric,trControl=trainControl)
# GBM - Stochastic Gradient Boosting
set.seed(7)
fit.gbm <- train(Survived~., data=train, method="gbm",metric=metric,trControl=trainControl, verbose=FALSE)
# C5.0
set.seed(7)
fit.c50 <- train(Survived~., data=train, method="C5.0", metric=metric,trControl=trainControl)

Results

# Compare results
ensembleResults <- resamples(list(BAG=fit.treebag,RF=fit.rf,GBM=fit.gbm,C50=fit.c50))
summary(ensembleResults)

## 
## Call:
## summary.resamples(object = ensembleResults)
## 
## Models: BAG, RF, GBM, C50 
## Number of resamples: 30 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## BAG 0.7323944 0.7887324 0.8169014 0.8113524 0.8333333 0.9014085    0
## RF  0.7323944 0.7944542 0.8181729 0.8253521 0.8561718 0.9305556    0
## GBM 0.7605634 0.7887324 0.8169014 0.8216028 0.8333333 0.9154930    0
## C50 0.7464789 0.8028169 0.8392019 0.8337963 0.8606221 0.9014085    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## BAG 0.4294643 0.5318712 0.6080575 0.5968252 0.6470130 0.7893175    0
## RF  0.4362725 0.5669682 0.6115121 0.6245121 0.6917479 0.8529412    0
## GBM 0.4570400 0.5368370 0.6058716 0.6143691 0.6508810 0.8207071    0
## C50 0.4462738 0.5693241 0.6516580 0.6409762 0.7030321 0.7893175    0

dotplot(ensembleResults)

Interesting that C5.0 is now the algorithms with highest accuracy (82.86%), follow by SVM (82.58) and RF (82.20%). We will then select them as our final models for prediction on validation dataset.

8. Finalize Model

Tree algorithms with higher accuracy will be selected for prediction: C50, SVM and RF

Prediction on validation dataset

C5.0

# train a model and summarize model
set.seed(7)
finalModel.c50 <- train(Survived~., data=train, method="C5.0",
                 metric=metric,trControl=trainControl)
print(finalModel.c50)

## C5.0 
## 
## 714 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.8137259  0.5959738
##   rules  FALSE   10      0.8151148  0.5999225
##   rules  FALSE   20      0.8202530  0.6108416
##   rules   TRUE    1      0.8095201  0.5846711
##   rules   TRUE   10      0.8067358  0.5833702
##   rules   TRUE   20      0.8090832  0.5878150
##   tree   FALSE    1      0.8132694  0.5949868
##   tree   FALSE   10      0.8207029  0.6113957
##   tree   FALSE   20      0.8337963  0.6409762
##   tree    TRUE    1      0.8072053  0.5799236
##   tree    TRUE   10      0.8072444  0.5827135
##   tree    TRUE   20      0.8136998  0.5976517
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
##  = FALSE.

# estimate on validation dataset
set.seed(7)
predictions <- predict(finalModel.c50, newdata = validation)
confusionMatrix(predictions,validation$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 103  17
##          1   6  51
##                                           
##                Accuracy : 0.8701          
##                  95% CI : (0.8114, 0.9158)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 5.999e-14       
##                                           
##                   Kappa : 0.7168          
##                                           
##  Mcnemar's Test P-Value : 0.03706         
##                                           
##             Sensitivity : 0.9450          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.8583          
##          Neg Pred Value : 0.8947          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5819          
##    Detection Prevalence : 0.6780          
##       Balanced Accuracy : 0.8475          
##                                           
##        'Positive' Class : 0               
##

We can see that the estimated accuracy on the training dataset was 83.38%. Applying the finalModel in the fit, we can see that the accuracy on the validation dataset was 85.31%. It’s a good prediction with unseen data.

SVM

# train a model and summarize model
set.seed(7)
finalModel.svm <- train(Survived~., data=train, method="svmRadial",
                 metric=metric,trControl=trainControl)
print(finalModel.svm)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 714 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.8132433  0.5863264
##   0.50  0.8197835  0.6017074
##   1.00  0.8244588  0.6114487
## 
## Tuning parameter 'sigma' was held constant at a value of 0.1120186
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1120186 and C = 1.

# estimate on validation dataset
set.seed(7)
predictions <- predict(finalModel.svm, newdata = validation)
confusionMatrix(predictions,validation$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 101  18
##          1   8  50
##                                           
##                Accuracy : 0.8531          
##                  95% CI : (0.7922, 0.9017)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 3.506e-12       
##                                           
##                   Kappa : 0.6807          
##                                           
##  Mcnemar's Test P-Value : 0.07756         
##                                           
##             Sensitivity : 0.9266          
##             Specificity : 0.7353          
##          Pos Pred Value : 0.8487          
##          Neg Pred Value : 0.8621          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5706          
##    Detection Prevalence : 0.6723          
##       Balanced Accuracy : 0.8309          
##                                           
##        'Positive' Class : 0               
##

With SVM, We can see that the accuracy changes from 82.58% for training dataset to 85.31% for validation dataset.

RF

# train a model and summarize model
set.seed(7)
finalModel.rf <- train(Survived~., data=train, method="rf",
                metric=metric,trControl=trainControl)
print(finalModel.rf)

## Random Forest 
## 
## 714 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8235198  0.6123327
##    6    0.8253521  0.6245121
##   11    0.8136346  0.6029773
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.

# estimate on validation dataset
set.seed(7)
predictions <- predict(finalModel.rf, newdata = validation)
confusionMatrix(predictions,validation$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 102  18
##          1   7  50
##                                           
##                Accuracy : 0.8588          
##                  95% CI : (0.7986, 0.9065)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 9.459e-13       
##                                           
##                   Kappa : 0.6921          
##                                           
##  Mcnemar's Test P-Value : 0.0455          
##                                           
##             Sensitivity : 0.9358          
##             Specificity : 0.7353          
##          Pos Pred Value : 0.8500          
##          Neg Pred Value : 0.8772          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5763          
##    Detection Prevalence : 0.6780          
##       Balanced Accuracy : 0.8355          
##                                           
##        'Positive' Class : 0               
##

We can see that the estimated accuracy with train dataset is 82.20% and prediction for the validation dataset was 85.31%.

Summary

Although three algorithms have a small difference in the estimated accuracy with training/testing dataset, but they have similar accuracy when predicting with validation dataset: 85.31%. The kappa, inter-rate reliability are also similar: 67.89%, 68.07%, 68.07% for three models.

So we can use any of those model for predicting new data.

Save the final models

# Save the final model to disk
saveRDS(finalModel.c50, here("output","model","finalModel.c50.rds"))
saveRDS(finalModel.rf, here("output","model","finalModel.rf.rds"))
saveRDS(finalModel.svm,here("output","model","finalModel.svm.rds"))

Apply final model for new data

Read & Clean

First we need to read the submission data and wrangling it

test <- read.csv(here("data","test.csv"))
# convert to factor for Sex, Embarked
test <- test %>%
  mutate_at(.vars=c("Pclass","Sex","Embarked"), .funs=as.factor)
# Create Title from Name
test <- NameToTitle(test) 
table(test$Sex, test$Title) # Check Title count again

##         
##          Master Miss  Mr Mrs RareTitle
##   female      0   79   0  72         1
##   male       21    0 240   0         5

# Create Fsize
test$Fsize <- test$SibSp + test$Parch + 1

# Handling missing Age, Fare
  test_Age_median <- median(test$Age,na.rm=T)
  test_Fare_median <- median(test$Fare,na.rm=T)

test$Age[is.na(test$Age)] <- test_Age_median
test$Fare[is.na(test$Fare)] <- test_Fare_median

Prediction by C5.0

# load the model C50
set.seed(7)
superModel <- readRDS(here("output","model","finalModel.c50.rds"))
print(superModel)

## C5.0 
## 
## 714 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.8137259  0.5959738
##   rules  FALSE   10      0.8151148  0.5999225
##   rules  FALSE   20      0.8202530  0.6108416
##   rules   TRUE    1      0.8095201  0.5846711
##   rules   TRUE   10      0.8067358  0.5833702
##   rules   TRUE   20      0.8090832  0.5878150
##   tree   FALSE    1      0.8132694  0.5949868
##   tree   FALSE   10      0.8207029  0.6113957
##   tree   FALSE   20      0.8337963  0.6409762
##   tree    TRUE    1      0.8072053  0.5799236
##   tree    TRUE   10      0.8072444  0.5827135
##   tree    TRUE   20      0.8136998  0.5976517
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
##  = FALSE.

# make a predictions on "new data" using the final model
prediction.c50 <- predict(superModel, test)
summary(prediction.c50)

##   0   1 
## 276 142

Prediction by SVM

# load the model svm
set.seed(7)
superModel <- readRDS(here("output","model","finalModel.svm.rds"))
print(superModel)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 714 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.8132433  0.5863264
##   0.50  0.8197835  0.6017074
##   1.00  0.8244588  0.6114487
## 
## Tuning parameter 'sigma' was held constant at a value of 0.1120186
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1120186 and C = 1.

# make a predictions on "new data" using the final model
prediction.svm <- predict(superModel, test)
summary(prediction.svm)

##   0   1 
## 275 143

Prediction by RF

# load the model rf
set.seed(7)
superModel <- readRDS(here("output","model","finalModel.rf.rds"))
print(superModel)

## Random Forest 
## 
## 714 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 643, 642, 643, 643, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8235198  0.6123327
##    6    0.8253521  0.6245121
##   11    0.8136346  0.6029773
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.

# make a predictions on "new data" using the final model
prediction.rf <- predict(superModel, test)
summary(prediction.rf)

##   0   1 
## 266 152

Submission data with C5.0 Prediction

With C5.0 algorithm, we can predict 142 survived and 276 not survived. We then generate the list of Passenger with their survival status.

my_submission <- data.frame(PassengerId=test$PassengerId,
                    Survived=as.integer(as.character(prediction.c50)))
write.csv(my_submission,
          here("output","data","my_titanic_01.csv"),
          row.names=FALSE, quote = FALSE)

Titanic Survival Prediction with Caret

1. Problem Definition

Load packages

Load Data

2. Summarize Data

Descriptive statistics

Dimension

Data

Attribute

Summary

Distribution and Correlation check

Class distribution

Correlation check

Data visualizations

Histogram

Density

Box-plot

Correlation Plot

3. Prepare Data

Feature Engineering

Extract Title from Name

Create Family Size variable

Data Wrangling

Handling Missing Values

Check missing values

Replace with Mode

Replace with Mean/Median

Data transformation

Finalize data for modeling

4. Modeling: Baseline

List of Algorithms

Test options and evaluation metric

Build models

Results

Build models

Results

8. Finalize Model

Prediction on validation dataset

C5.0

SVM

RF

Summary

Save the final models

Apply final model for new data

Read & Clean

Prediction by C5.0

Prediction by SVM

Prediction by RF

Submission data with C5.0 Prediction

References