Case : From data titanic passenger, we want to know does passenger still alive or not. Data can be downloaded here https://www.kaggle.com/c/titanic
titanic_train <- read.csv('train.csv')
titanic_test <- read.csv('test.csv')
titanic_gender <- read.csv('gender_submission.csv')To make easier, combine data test and gender into 1 dataframe
library(dplyr)
titanic_test <- cbind(titanic_test, Survived = titanic_gender$Survived)# data train
colSums(is.na(titanic_train))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
There’s missing in Age column. Instead of delete it, we change the value with mean of age because we want our model learn more.
titanic_train <- titanic_train %>% mutate(Age = if_else(is.na(Age), mean(Age, na.rm = TRUE), Age))
colSums(is.na(titanic_train))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
# data test
colSums(is.na(titanic_test))## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 86 0
## Parch Ticket Fare Cabin Embarked Survived
## 0 0 1 0 0 0
titanic_test <- titanic_test %>% na.omit()
anyNA(titanic_test)## [1] FALSE
str(titanic_train)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Here are some information about columns in wholesale data *
PassengerId: Id number of passenger * Survived
: Passenger Survival or Not * Pclass : A proxy for
socio-economic status (SES). 1st = Upper, 2nd = Middle, 3rd = Lower *
Name : Name of Passenger * Sex : Sex of
Passenger * Age : Age of Passenger * SibSp :
Sibling = brother, sister, stepbrother, stepsister; Spouse = husband,
wife (mistresses and fiancés were ignored) * Parch : Parent
= mother, father; Child = daughter, son, stepdaughter, stepson; Some
children travelled only with a nanny, therefore parch=0 for them. *
Ticket : Ticket number * Fare : Passenger fare
* Cabin : Cabin number * Embarked : Port of
Embarkation
titanic_train <- titanic_train %>% select(-c(PassengerId, Name, Ticket, Cabin)) %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
SibSp = as.factor(SibSp),
Parch = as.factor(Parch),
Embarked = as.factor(Embarked))
titanic_test <- titanic_test %>% select(-c(PassengerId, Name, Ticket, Cabin)) %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
SibSp = as.factor(SibSp),
Parch = as.factor(Parch),
Embarked = as.factor(Embarked))prop.table(table(titanic_train$Survived))##
## 0 1
## 0.6161616 0.3838384
Our data still balance but we can make balance proportion with downsampling
RNGkind(sample.kind = "Rounding")
set.seed(555)
library(caret)
titanic_train <- downSample(x = titanic_train %>%
select(-Survived),
y = titanic_train$Survived,
yname = "Survived") prop.table(table(titanic_train$Survived))##
## 0 1
## 0.5 0.5
Now our data is balance.
naiveBayes() function from library
e1071library(e1071)
titanic_nb <- naiveBayes(Survived ~., titanic_train, laplace = 1)
titanic_nb##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5 0.5
##
## Conditional probabilities:
## Pclass
## Y 1 2 3
## 0 0.1391304 0.1652174 0.6956522
## 1 0.3971014 0.2550725 0.3478261
##
## Sex
## Y female male
## 0 0.1627907 0.8372093
## 1 0.6802326 0.3197674
##
## Age
## Y [,1] [,2]
## 0 30.68284 12.76727
## 1 28.54978 13.77250
##
## SibSp
## Y 0 1 2 3 4 5
## 0 0.71060172 0.17478510 0.03151862 0.02578797 0.02865330 0.01432665
## 1 0.60458453 0.32378223 0.04011461 0.01432665 0.01146132 0.00286533
## SibSp
## Y 8
## 0 0.01432665
## 1 0.00286533
##
## Parch
## Y 0 1 2 3 4 5
## 0 0.787965616 0.100286533 0.077363897 0.005730659 0.011461318 0.011461318
## 1 0.670487106 0.189111748 0.117478510 0.011461318 0.002865330 0.005730659
## Parch
## Y 6
## 0 0.005730659
## 1 0.002865330
##
## Fare
## Y [,1] [,2]
## 0 21.34029 30.42235
## 1 48.39541 66.59700
##
## Embarked
## Y C Q S
## 0 0.002890173 0.127167630 0.080924855 0.789017341
## 1 0.008670520 0.271676301 0.089595376 0.630057803
prop.table(table(titanic_train$Survived, titanic_train$Sex), margin = 2)##
## female male
## 0 0.1909722 0.7247475
## 1 0.8090278 0.2752525
From the class proportion above, female passenger that survive is 81% meanwhile male passenger 27%. Female passenger prioritized.
titanic_test$pred <- predict(titanic_nb, titanic_test, type = 'class')library(caret)
confusionMatrix(titanic_test$pred, reference = titanic_test$Survived)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 173 31
## 1 31 96
##
## Accuracy : 0.8127
## 95% CI : (0.7664, 0.8533)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : 8.728e-15
##
## Kappa : 0.6039
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8480
## Specificity : 0.7559
## Pos Pred Value : 0.8480
## Neg Pred Value : 0.7559
## Prevalence : 0.6163
## Detection Rate : 0.5227
## Detection Prevalence : 0.6163
## Balanced Accuracy : 0.8020
##
## 'Positive' Class : 0
##
ROC (Receiver Operating Curve) describe relation between True Positif Rate(TPR) or Recall/Sensitivity and False Positif Rate(FPR) or (1-specificity). Ideal model has TPR > FPR Build dataframe ROC, assume positive class is 1(survived).
titanic_test$pred <- predict(titanic_nb, newdata = titanic_test, type = "raw")
titanic_test$actual <- ifelse(titanic_test$Survived == 1, 'Alive', 'Death')Build ROC
library(ROCR)
# objek prediction
roc_pred <- prediction(predictions = titanic_test$pred[,1], labels = titanic_test$actual)
# ROC curve
plot(performance(prediction.obj = roc_pred, measure = "tpr", x.measure = "fpr"))
abline(0,1,lty = 8)After we search for ROC, next we search value under ROC plot with AUC(Area Under Curve).
titanic_auc <- performance(prediction.obj = roc_pred, measure = 'auc')
titanic_auc@y.values## [[1]]
## [1] 0.8926972
our Naive Bayes model is good enough to separate positive(alive) and negative(dead) class.
prop.table(table(titanic_train$Survived))##
## 0 1
## 0.5 0.5
ctree() from library
partykitlibrary(partykit)
titanic_tree <- ctree(Survived ~., titanic_train)
titanic_tree##
## Model formula:
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Fitted party:
## [1] root
## | [2] Sex in female
## | | [3] Pclass in 1, 2: 1 (n = 169, err = 4.7%)
## | | [4] Pclass in 3
## | | | [5] Fare <= 23.25: 1 (n = 99, err = 30.3%)
## | | | [6] Fare > 23.25: 0 (n = 20, err = 15.0%)
## | [7] Sex in male
## | | [8] Pclass in 1: 1 (n = 89, err = 49.4%)
## | | [9] Pclass in 2, 3
## | | | [10] Age <= 12
## | | | | [11] SibSp in 0, 1, 2: 1 (n = 17, err = 0.0%)
## | | | | [12] SibSp in 3, 4, 5: 0 (n = 8, err = 12.5%)
## | | | [13] Age > 12: 0 (n = 282, err = 16.3%)
##
## Number of inner nodes: 6
## Number of terminal nodes: 7
For better information, visualize it
plot(titanic_tree, type = 'simple')
From tree structure above, we can see classification is based on
sex, Pclass, Fare, and
Age. Information of tree :
Root node (highest level of tree) : Sex, means that gender rescuers consider gender first. This node is very important to determine target.
Interior node : Pclass, Fare, Age, SibSp. Second branch used if root is not enough to determine target.
Leaf node : [3], [5], [6], [8], [11], [12], [13]. Predict target value.
Model evaluation
# predict data test
titanic_test_pred <- predict(object = titanic_tree, newdata = titanic_test, type = "response")
# confusion matrix data test
confusionMatrix(data = titanic_test_pred, reference = titanic_test$Survived, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 145 2
## 1 59 125
##
## Accuracy : 0.8157
## 95% CI : (0.7697, 0.856)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : 3.183e-15
##
## Kappa : 0.6408
##
## Mcnemar's Test P-Value : 7.496e-13
##
## Sensitivity : 0.9843
## Specificity : 0.7108
## Pos Pred Value : 0.6793
## Neg Pred Value : 0.9864
## Prevalence : 0.3837
## Detection Rate : 0.3776
## Detection Prevalence : 0.5559
## Balanced Accuracy : 0.8475
##
## 'Positive' Class : 1
##
For speed up computation, we can delete non-informative variable.
# check non-informative var
nearZeroVar(titanic_train)## integer(0)
There’s no non-informative so we can use Random Forest for building model.
library(randomForest)
set.seed(572)
ctrl <- trainControl(method = "repeatedcv",
number = 5, # k-fold
repeats = 3) # repeat
titanic_forest <- train(Survived ~., data = titanic_train, method = 'rf', trControl = ctrl)
titanic_forest## Random Forest
##
## 684 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 547, 548, 548, 547, 546, 547, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7767414 0.5535650
## 11 0.8080179 0.6160646
## 20 0.7894903 0.5790283
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.
Optimum k splitting at each tree node is 11.
varImp(titanic_forest)## rf variable importance
##
## Overall
## Fare 100.00000
## Sexmale 99.58037
## Age 91.01716
## Pclass3 27.98087
## SibSp1 8.78775
## EmbarkedS 7.66161
## Parch1 6.52128
## EmbarkedC 5.68503
## Pclass2 4.98466
## Parch2 3.26094
## EmbarkedQ 2.38751
## SibSp2 1.97904
## SibSp3 1.97675
## SibSp4 1.81017
## SibSp8 1.17351
## SibSp5 0.71195
## Parch4 0.68498
## Parch5 0.65456
## Parch3 0.04813
## Parch6 0.00000
titanic_forest$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 11
##
## OOB estimate of error rate: 17.4%
## Confusion matrix:
## 0 1 class.error
## 0 283 59 0.1725146
## 1 60 282 0.1754386
titanic_forest_pred <- predict(titanic_forest, newdata = titanic_test, type = "raw")confusionMatrix(data = titanic_forest_pred, reference = titanic_test$Survived, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 161 18
## 1 43 109
##
## Accuracy : 0.8157
## 95% CI : (0.7697, 0.856)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : 3.183e-15
##
## Kappa : 0.6243
##
## Mcnemar's Test P-Value : 0.00212
##
## Sensitivity : 0.8583
## Specificity : 0.7892
## Pos Pred Value : 0.7171
## Neg Pred Value : 0.8994
## Prevalence : 0.3837
## Detection Rate : 0.3293
## Detection Prevalence : 0.4592
## Balanced Accuracy : 0.8237
##
## 'Positive' Class : 1
##
Naive Bayes model has
Decision Tree model has
Random Forest model has
In this case, we focused on Sensitivity. We want model can predict positive class from the actual is positive. So, Decision Tree is the best model which has highest sensitivity.