Case : From data titanic passenger, we want to know does passenger still alive or not. Data can be downloaded here https://www.kaggle.com/c/titanic

Import Data

titanic_train <- read.csv('train.csv')
titanic_test <- read.csv('test.csv')
titanic_gender <- read.csv('gender_submission.csv')

Data Wrangling

To make easier, combine data test and gender into 1 dataframe

library(dplyr)
titanic_test <- cbind(titanic_test, Survived = titanic_gender$Survived)
  • Check missing value
# data train
colSums(is.na(titanic_train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

There’s missing in Age column. Instead of delete it, we change the value with mean of age because we want our model learn more.

titanic_train <- titanic_train %>% mutate(Age = if_else(is.na(Age), mean(Age, na.rm = TRUE), Age))

colSums(is.na(titanic_train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
# data test
colSums(is.na(titanic_test))
## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0           1           0           0           0
titanic_test <- titanic_test %>% na.omit()
anyNA(titanic_test)
## [1] FALSE
  • Check structure of data
str(titanic_train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Here are some information about columns in wholesale data * PassengerId: Id number of passenger * Survived : Passenger Survival or Not * Pclass : A proxy for socio-economic status (SES). 1st = Upper, 2nd = Middle, 3rd = Lower * Name : Name of Passenger * Sex : Sex of Passenger * Age : Age of Passenger * SibSp : Sibling = brother, sister, stepbrother, stepsister; Spouse = husband, wife (mistresses and fiancés were ignored) * Parch : Parent = mother, father; Child = daughter, son, stepdaughter, stepson; Some children travelled only with a nanny, therefore parch=0 for them. * Ticket : Ticket number * Fare : Passenger fare * Cabin : Cabin number * Embarked : Port of Embarkation

  • Change column into right data type and delete some column
titanic_train <- titanic_train %>% select(-c(PassengerId, Name, Ticket, Cabin)) %>% 
  mutate(Survived = as.factor(Survived),
         Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         SibSp = as.factor(SibSp),
         Parch = as.factor(Parch),
         Embarked = as.factor(Embarked))

titanic_test <- titanic_test %>% select(-c(PassengerId, Name, Ticket, Cabin)) %>% 
  mutate(Survived = as.factor(Survived),
         Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         SibSp = as.factor(SibSp),
         Parch = as.factor(Parch),
         Embarked = as.factor(Embarked))

Data Pre-processing

  • Check class proportion
prop.table(table(titanic_train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

Our data still balance but we can make balance proportion with downsampling

RNGkind(sample.kind = "Rounding")
set.seed(555)
library(caret)
titanic_train <- downSample(x = titanic_train %>% 
                           select(-Survived),
                          y = titanic_train$Survived,
                         yname = "Survived") 
prop.table(table(titanic_train$Survived))
## 
##   0   1 
## 0.5 0.5

Now our data is balance.

Naive Bayes Model

  • Build model with naiveBayes() function from library e1071
library(e1071)
titanic_nb <- naiveBayes(Survived ~., titanic_train, laplace = 1)
titanic_nb
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   0   1 
## 0.5 0.5 
## 
## Conditional probabilities:
##    Pclass
## Y           1         2         3
##   0 0.1391304 0.1652174 0.6956522
##   1 0.3971014 0.2550725 0.3478261
## 
##    Sex
## Y      female      male
##   0 0.1627907 0.8372093
##   1 0.6802326 0.3197674
## 
##    Age
## Y       [,1]     [,2]
##   0 30.68284 12.76727
##   1 28.54978 13.77250
## 
##    SibSp
## Y            0          1          2          3          4          5
##   0 0.71060172 0.17478510 0.03151862 0.02578797 0.02865330 0.01432665
##   1 0.60458453 0.32378223 0.04011461 0.01432665 0.01146132 0.00286533
##    SibSp
## Y            8
##   0 0.01432665
##   1 0.00286533
## 
##    Parch
## Y             0           1           2           3           4           5
##   0 0.787965616 0.100286533 0.077363897 0.005730659 0.011461318 0.011461318
##   1 0.670487106 0.189111748 0.117478510 0.011461318 0.002865330 0.005730659
##    Parch
## Y             6
##   0 0.005730659
##   1 0.002865330
## 
##    Fare
## Y       [,1]     [,2]
##   0 21.34029 30.42235
##   1 48.39541 66.59700
## 
##    Embarked
## Y                         C           Q           S
##   0 0.002890173 0.127167630 0.080924855 0.789017341
##   1 0.008670520 0.271676301 0.089595376 0.630057803
prop.table(table(titanic_train$Survived, titanic_train$Sex), margin = 2)
##    
##        female      male
##   0 0.1909722 0.7247475
##   1 0.8090278 0.2752525

From the class proportion above, female passenger that survive is 81% meanwhile male passenger 27%. Female passenger prioritized.

  • Model evaluation
titanic_test$pred <- predict(titanic_nb, titanic_test, type = 'class')
  • Evaluation with confusionMatrix from library caret
library(caret)
confusionMatrix(titanic_test$pred, reference = titanic_test$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 173  31
##          1  31  96
##                                           
##                Accuracy : 0.8127          
##                  95% CI : (0.7664, 0.8533)
##     No Information Rate : 0.6163          
##     P-Value [Acc > NIR] : 8.728e-15       
##                                           
##                   Kappa : 0.6039          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8480          
##             Specificity : 0.7559          
##          Pos Pred Value : 0.8480          
##          Neg Pred Value : 0.7559          
##              Prevalence : 0.6163          
##          Detection Rate : 0.5227          
##    Detection Prevalence : 0.6163          
##       Balanced Accuracy : 0.8020          
##                                           
##        'Positive' Class : 0               
## 
  • ROC model naiveBayes

ROC (Receiver Operating Curve) describe relation between True Positif Rate(TPR) or Recall/Sensitivity and False Positif Rate(FPR) or (1-specificity). Ideal model has TPR > FPR Build dataframe ROC, assume positive class is 1(survived).

titanic_test$pred <- predict(titanic_nb, newdata = titanic_test, type = "raw")
titanic_test$actual <- ifelse(titanic_test$Survived == 1, 'Alive', 'Death')

Build ROC

library(ROCR)

# objek prediction
roc_pred <- prediction(predictions = titanic_test$pred[,1], labels = titanic_test$actual)

# ROC curve
plot(performance(prediction.obj = roc_pred, measure = "tpr", x.measure = "fpr"))
abline(0,1,lty = 8)

  • AUC model naivebayes

After we search for ROC, next we search value under ROC plot with AUC(Area Under Curve).

titanic_auc <- performance(prediction.obj = roc_pred, measure = 'auc')
titanic_auc@y.values
## [[1]]
## [1] 0.8926972

our Naive Bayes model is good enough to separate positive(alive) and negative(dead) class.

Decision Tree

  • Check class proportion
prop.table(table(titanic_train$Survived))
## 
##   0   1 
## 0.5 0.5
  • Build model with ctree() from library partykit
library(partykit)
titanic_tree <- ctree(Survived ~., titanic_train)
titanic_tree
## 
## Model formula:
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
## Fitted party:
## [1] root
## |   [2] Sex in female
## |   |   [3] Pclass in 1, 2: 1 (n = 169, err = 4.7%)
## |   |   [4] Pclass in 3
## |   |   |   [5] Fare <= 23.25: 1 (n = 99, err = 30.3%)
## |   |   |   [6] Fare > 23.25: 0 (n = 20, err = 15.0%)
## |   [7] Sex in male
## |   |   [8] Pclass in 1: 1 (n = 89, err = 49.4%)
## |   |   [9] Pclass in 2, 3
## |   |   |   [10] Age <= 12
## |   |   |   |   [11] SibSp in 0, 1, 2: 1 (n = 17, err = 0.0%)
## |   |   |   |   [12] SibSp in 3, 4, 5: 0 (n = 8, err = 12.5%)
## |   |   |   [13] Age > 12: 0 (n = 282, err = 16.3%)
## 
## Number of inner nodes:    6
## Number of terminal nodes: 7

For better information, visualize it

plot(titanic_tree, type = 'simple')

From tree structure above, we can see classification is based on sex, Pclass, Fare, and Age. Information of tree :

  • Root node (highest level of tree) : Sex, means that gender rescuers consider gender first. This node is very important to determine target.

  • Interior node : Pclass, Fare, Age, SibSp. Second branch used if root is not enough to determine target.

  • Leaf node : [3], [5], [6], [8], [11], [12], [13]. Predict target value.

  • Model evaluation

# predict data test
titanic_test_pred <- predict(object = titanic_tree, newdata = titanic_test, type = "response")

# confusion matrix data test
confusionMatrix(data = titanic_test_pred, reference = titanic_test$Survived, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 145   2
##          1  59 125
##                                          
##                Accuracy : 0.8157         
##                  95% CI : (0.7697, 0.856)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : 3.183e-15      
##                                          
##                   Kappa : 0.6408         
##                                          
##  Mcnemar's Test P-Value : 7.496e-13      
##                                          
##             Sensitivity : 0.9843         
##             Specificity : 0.7108         
##          Pos Pred Value : 0.6793         
##          Neg Pred Value : 0.9864         
##              Prevalence : 0.3837         
##          Detection Rate : 0.3776         
##    Detection Prevalence : 0.5559         
##       Balanced Accuracy : 0.8475         
##                                          
##        'Positive' Class : 1              
## 

Random Forest

For speed up computation, we can delete non-informative variable.

# check non-informative var
nearZeroVar(titanic_train)
## integer(0)

There’s no non-informative so we can use Random Forest for building model.

  • Build model with k-fold validation
library(randomForest)
set.seed(572)
ctrl <- trainControl(method = "repeatedcv",
                    number = 5, # k-fold
                     repeats = 3) # repeat
titanic_forest <- train(Survived ~., data = titanic_train, method = 'rf', trControl = ctrl)
titanic_forest
## Random Forest 
## 
## 684 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 547, 548, 548, 547, 546, 547, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7767414  0.5535650
##   11    0.8080179  0.6160646
##   20    0.7894903  0.5790283
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.

Optimum k splitting at each tree node is 11.

  • Check the important variable using varImp()
varImp(titanic_forest)
## rf variable importance
## 
##             Overall
## Fare      100.00000
## Sexmale    99.58037
## Age        91.01716
## Pclass3    27.98087
## SibSp1      8.78775
## EmbarkedS   7.66161
## Parch1      6.52128
## EmbarkedC   5.68503
## Pclass2     4.98466
## Parch2      3.26094
## EmbarkedQ   2.38751
## SibSp2      1.97904
## SibSp3      1.97675
## SibSp4      1.81017
## SibSp8      1.17351
## SibSp5      0.71195
## Parch4      0.68498
## Parch5      0.65456
## Parch3      0.04813
## Parch6      0.00000
titanic_forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##         OOB estimate of  error rate: 17.4%
## Confusion matrix:
##     0   1 class.error
## 0 283  59   0.1725146
## 1  60 282   0.1754386
  • Predict random forest model
titanic_forest_pred <- predict(titanic_forest, newdata = titanic_test, type = "raw")
  • Model evaluation
confusionMatrix(data = titanic_forest_pred, reference = titanic_test$Survived, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 161  18
##          1  43 109
##                                          
##                Accuracy : 0.8157         
##                  95% CI : (0.7697, 0.856)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : 3.183e-15      
##                                          
##                   Kappa : 0.6243         
##                                          
##  Mcnemar's Test P-Value : 0.00212        
##                                          
##             Sensitivity : 0.8583         
##             Specificity : 0.7892         
##          Pos Pred Value : 0.7171         
##          Neg Pred Value : 0.8994         
##              Prevalence : 0.3837         
##          Detection Rate : 0.3293         
##    Detection Prevalence : 0.4592         
##       Balanced Accuracy : 0.8237         
##                                          
##        'Positive' Class : 1              
## 

Conclusion

Naive Bayes model has

  • 81% accuracy
  • 84% sensitivity
  • 75% specificity
  • 84% precision

Decision Tree model has

  • 81% accuracy
  • 98% sensitivity
  • 71% specificity
  • 67% precision

Random Forest model has

  • 81% accuracy
  • 85% sensitivity
  • 78% specificity
  • 71% precision

In this case, we focused on Sensitivity. We want model can predict positive class from the actual is positive. So, Decision Tree is the best model which has highest sensitivity.