Machine Learning: Classification II

About Project

RMS Titanic was a British passenger liner, operated by the White Star Line, which sank in the North Atlantic Ocean on 15 April 1912 after striking an iceberg during her maiden voyage from Southampton, UK, to New York City, United States. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making it the deadliest sinking of a single ship up to that time. It remains the deadliest peacetime sinking of a superliner or cruise ship. The disaster drew public attention, provided foundational material for the disaster film genre, and has inspired many artistic works.

RMS Titanic was the largest ship afloat at the time she entered service and the second of three Olympic-class ocean liners operated by the White Star Line. She was built by the Harland and Wolff shipyard in Belfast. Thomas Andrews, the chief naval architect of the shipyard, died in the disaster. Titanic was under the command of Captain Edward Smith, who went down with the ship. The ocean liner carried some of the wealthiest people in the world, as well as hundreds of emigrants from Great Britain and Ireland, Scandinavia, and elsewhere throughout Europe, who were seeking a new life in the United States and Canada.

The first-class accommodation was designed to be the pinnacle of comfort and luxury, with a gymnasium, swimming pool, libraries, high-class restaurants, and opulent cabins. A high-powered radiotelegraph transmitter was available for sending passenger “marconigrams” and for the ship’s operational use. The Titanic had advanced safety features, such as watertight compartments and remotely activated watertight doors, contributing to its reputation as “unsinkable”.

Titanic was equipped with 16 lifeboat davits, each capable of lowering three lifeboats, for a total of 48 boats; she carried only 20 lifeboats, four of which were collapsible and proved hard to launch while she was sinking. Together, the 20 lifeboats could hold 1,178 people—about half the number of passengers on board, and one third of the number of passengers the ship could have carried at full capacity (consistent with the maritime safety regulations of the era). When the ship sank, many of the lifeboats that had been lowered were only about half full. (source: wikipedia)

Data Description

From the titanic, there are some data about passenger.

  • Survived: Survival, 0 = No, 1 = Yes
  • Pclass: Ticket Class, 1 = 1st/Upper, 2 = 2nd/Middle, 3 = 3rd/Lower
  • Name: Passenger Name
  • Sex: Passenger Sex
  • Age: Passenger Age in year
  • SibSp:
  • Parch:
  • Ticket: Ticker number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton

Import Librarry

library(dplyr)
library(caret)
library(rsample)
library(e1071)
library(randomForest)
library(partykit)
library(imputeTS)

Import Dataset

titanic_train <- read.csv("train.csv")
titanic_test <- read.csv("test.csv")

Data Wrangling and Cleaning

After import dataset, wes must make the clean data.

Check Data

glimpse(titanic_train)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
glimpse(titanic_test)
## Rows: 418
## Columns: 11
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903…
## $ Pclass      <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3…
## $ Name        <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "M…
## $ Sex         <chr> "male", "female", "male", "male", "female", "male", "femal…
## $ Age         <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0…
## $ SibSp       <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ticket      <chr> "330911", "363272", "240276", "315154", "3101298", "7538",…
## $ Fare        <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 2…
## $ Cabin       <chr> "", "", "", "", "", "", "", "", "", "", "", "", "B45", "",…
## $ Embarked    <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", "S"…

Clear Unused Column

From the dataset, delete column which have too much character.

titanic_train <- titanic_train %>% 
  select(-c("PassengerId", "Name", "Ticket", "Cabin"))

titanic_test <- titanic_test %>% 
  select(-c("PassengerId", "Name", "Ticket", "Cabin"))

Convert Data Type

titanic_train <- titanic_train %>% 
  mutate_at(.vars=c("Survived", "Pclass", "Sex", "Embarked"), .funs=as.factor)

glimpse(titanic_train)
## Rows: 891
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass   <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Sex      <fct> male, female, female, female, male, male, male, male, female,…
## $ Age      <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
## $ SibSp    <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch    <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S…
titanic_test <- titanic_test %>% 
  mutate_at(.vars=c("Pclass", "Sex", "Embarked"), .funs=as.factor)

glimpse(titanic_test)
## Rows: 418
## Columns: 7
## $ Pclass   <fct> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3, 1…
## $ Sex      <fct> male, female, male, male, female, male, female, male, female,…
## $ Age      <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0, N…
## $ SibSp    <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1…
## $ Parch    <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Fare     <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 29.0…
## $ Embarked <fct> Q, S, Q, S, S, S, Q, S, C, S, S, S, S, S, S, C, Q, C, S, C, C…

Check Missing Value

If there are missing value, we must fill it with an value.

colSums(is.na(titanic_train))
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0      177        0        0        0        0
colSums(is.na(titanic_test))
##   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0       86        0        0        1        0

Convert Missing Value by Mean

Most of missing value in the data from Age column, so fill the missing value with mean of Age from train data.

titanic_train  <- titanic_train %>% 
  na_replace(fill = mean(unlist(titanic_train$Age), na.rm=TRUE))

colSums(is.na(titanic_train))
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0
titanic_test$Age  <- titanic_test$Age %>% 
  na_replace(fill = mean(unlist(titanic_train$Age), na.rm=TRUE))

titanic_test$Fare  <- titanic_test$Fare %>% 
  na_replace(fill = mean(unlist(titanic_train$Fare), na.rm=TRUE))

colSums(is.na(titanic_test))
##   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0

Check Data Proportion

prop.table(table(titanic_train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

0.61 - 0.38 is balance enough to go next step.

Naive Bayes Model

The first model is naive bayes model.

Model

model_naive<- naiveBayes(Survived ~ ., data = titanic_train)

Model Evaluation use Data Train

preds_naive_train <- predict(model_naive, newdata = titanic_train)

confusionMatrix(preds_naive_train, titanic_train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 502 142
##          1  47 200
##                                           
##                Accuracy : 0.7879          
##                  95% CI : (0.7595, 0.8143)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5268          
##                                           
##  Mcnemar's Test P-Value : 8.059e-12       
##                                           
##             Sensitivity : 0.9144          
##             Specificity : 0.5848          
##          Pos Pred Value : 0.7795          
##          Neg Pred Value : 0.8097          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5634          
##    Detection Prevalence : 0.7228          
##       Balanced Accuracy : 0.7496          
##                                           
##        'Positive' Class : 0               
## 

Predict

preds_naive <- predict(model_naive, newdata = titanic_test)

table(preds_naive)
## preds_naive
##   0   1 
## 295 123

Decision Tree Model

Second model is Decision Tree Model.

Model

model_dt <- ctree(Survived ~ .,titanic_train)

plot(model_dt, type="simple")

Model Evaluation use Data Train

preds_dt_train <- predict(model_dt, newdata = titanic_train)

confusionMatrix(preds_dt_train, titanic_train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 492  96
##          1  57 246
##                                           
##                Accuracy : 0.8283          
##                  95% CI : (0.8019, 0.8525)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.629           
##                                           
##  Mcnemar's Test P-Value : 0.002125        
##                                           
##             Sensitivity : 0.8962          
##             Specificity : 0.7193          
##          Pos Pred Value : 0.8367          
##          Neg Pred Value : 0.8119          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5522          
##    Detection Prevalence : 0.6599          
##       Balanced Accuracy : 0.8077          
##                                           
##        'Positive' Class : 0               
## 

Predict

preds_dt <- predict(model_dt, newdata = titanic_test)

table(preds_dt)
## preds_dt
##   0   1 
## 264 154

From prediction of model_dt, total survive passenger greater than total survive passenger from model_naive.

Random Forest Model

Third model is Random Forest Model.

library(animation)
ani.options(interval = 1, nmax = 15)
cv.ani(main = "Demonstration of the k-fold Cross Validation", bty = "l") 

Model

set.seed(382)
ctrl <- trainControl(method="repeatedcv", number=4, repeats=4) # k-fold cross validation
model_forest <- train(Survived ~ ., data=titanic_train, method="rf", trControl = ctrl)

model_forest
## Random Forest 
## 
## 891 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times) 
## Summary of sample sizes: 668, 669, 668, 668, 668, 668, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8271669  0.6165041
##    6    0.8294041  0.6331556
##   10    0.8148137  0.6044607
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
varImp(model_forest)
## rf variable importance
## 
##           Overall
## Sexmale   100.000
## Fare       83.710
## Age        76.710
## Pclass3    24.243
## SibSp      13.248
## Parch       8.110
## EmbarkedS   3.311
## Pclass2     2.081
## EmbarkedC   1.799
## EmbarkedQ   0.000

Sexmale has a high influence on the prediction results.

model_forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 17.51%
## Confusion matrix:
##     0   1 class.error
## 0 486  63   0.1147541
## 1  93 249   0.2719298

Model Evaluation use Data Train

preds_rf_train <- predict(model_forest, newdata = titanic_train)

confusionMatrix(preds_rf_train, titanic_train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 543  19
##          1   6 323
##                                           
##                Accuracy : 0.9719          
##                  95% CI : (0.9589, 0.9818)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9403          
##                                           
##  Mcnemar's Test P-Value : 0.0164          
##                                           
##             Sensitivity : 0.9891          
##             Specificity : 0.9444          
##          Pos Pred Value : 0.9662          
##          Neg Pred Value : 0.9818          
##              Prevalence : 0.6162          
##          Detection Rate : 0.6094          
##    Detection Prevalence : 0.6308          
##       Balanced Accuracy : 0.9668          
##                                           
##        'Positive' Class : 0               
## 

Predict

preds_rf <- predict(model_forest, newdata = titanic_test)

table(preds_rf)
## preds_rf
##   0   1 
## 269 149

Conclusion

The results of the evaluation of model_naive and model_dt models, model_dt provides a better model for predicting survivors and passengers who survive from prediction_dt more than survivors from prediction_naive. The evaluation of model_forest model is better than model_dt, but from the prediction results, the total survival of the two models is almost the same, where the prediction of model_dt gives a total survival of 5 passengers greater than model_forest.