Machine Learning: Classification II
About Project
RMS Titanic was a British passenger liner, operated by the White Star Line, which sank in the North Atlantic Ocean on 15 April 1912 after striking an iceberg during her maiden voyage from Southampton, UK, to New York City, United States. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making it the deadliest sinking of a single ship up to that time. It remains the deadliest peacetime sinking of a superliner or cruise ship. The disaster drew public attention, provided foundational material for the disaster film genre, and has inspired many artistic works.
RMS Titanic was the largest ship afloat at the time she entered service and the second of three Olympic-class ocean liners operated by the White Star Line. She was built by the Harland and Wolff shipyard in Belfast. Thomas Andrews, the chief naval architect of the shipyard, died in the disaster. Titanic was under the command of Captain Edward Smith, who went down with the ship. The ocean liner carried some of the wealthiest people in the world, as well as hundreds of emigrants from Great Britain and Ireland, Scandinavia, and elsewhere throughout Europe, who were seeking a new life in the United States and Canada.
The first-class accommodation was designed to be the pinnacle of comfort and luxury, with a gymnasium, swimming pool, libraries, high-class restaurants, and opulent cabins. A high-powered radiotelegraph transmitter was available for sending passenger “marconigrams” and for the ship’s operational use. The Titanic had advanced safety features, such as watertight compartments and remotely activated watertight doors, contributing to its reputation as “unsinkable”.
Titanic was equipped with 16 lifeboat davits, each capable of lowering three lifeboats, for a total of 48 boats; she carried only 20 lifeboats, four of which were collapsible and proved hard to launch while she was sinking. Together, the 20 lifeboats could hold 1,178 people—about half the number of passengers on board, and one third of the number of passengers the ship could have carried at full capacity (consistent with the maritime safety regulations of the era). When the ship sank, many of the lifeboats that had been lowered were only about half full. (source: wikipedia)
Data Description
From the titanic, there are some data about passenger.
- Survived: Survival, 0 = No, 1 = Yes
- Pclass: Ticket Class, 1 = 1st/Upper, 2 = 2nd/Middle, 3 = 3rd/Lower
- Name: Passenger Name
- Sex: Passenger Sex
- Age: Passenger Age in year
- SibSp:
- Parch:
- Ticket: Ticker number
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton
Import Librarry
library(dplyr)
library(caret)
library(rsample)
library(e1071)
library(randomForest)
library(partykit)
library(imputeTS)Import Dataset
titanic_train <- read.csv("train.csv")
titanic_test <- read.csv("test.csv")Data Wrangling and Cleaning
After import dataset, wes must make the clean data.
Check Data
glimpse(titanic_train)## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
glimpse(titanic_test)## Rows: 418
## Columns: 11
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903…
## $ Pclass <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3…
## $ Name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "M…
## $ Sex <chr> "male", "female", "male", "male", "female", "male", "femal…
## $ Age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0…
## $ SibSp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ticket <chr> "330911", "363272", "240276", "315154", "3101298", "7538",…
## $ Fare <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 2…
## $ Cabin <chr> "", "", "", "", "", "", "", "", "", "", "", "", "B45", "",…
## $ Embarked <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", "S"…
Clear Unused Column
From the dataset, delete column which have too much character.
titanic_train <- titanic_train %>%
select(-c("PassengerId", "Name", "Ticket", "Cabin"))
titanic_test <- titanic_test %>%
select(-c("PassengerId", "Name", "Ticket", "Cabin"))Convert Data Type
titanic_train <- titanic_train %>%
mutate_at(.vars=c("Survived", "Pclass", "Sex", "Embarked"), .funs=as.factor)
glimpse(titanic_train)## Rows: 891
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Sex <fct> male, female, female, female, male, male, male, male, female,…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S…
titanic_test <- titanic_test %>%
mutate_at(.vars=c("Pclass", "Sex", "Embarked"), .funs=as.factor)
glimpse(titanic_test)## Rows: 418
## Columns: 7
## $ Pclass <fct> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3, 1…
## $ Sex <fct> male, female, male, male, female, male, female, male, female,…
## $ Age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0, N…
## $ SibSp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1…
## $ Parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Fare <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 29.0…
## $ Embarked <fct> Q, S, Q, S, S, S, Q, S, C, S, S, S, S, S, S, C, Q, C, S, C, C…
Check Missing Value
If there are missing value, we must fill it with an value.
colSums(is.na(titanic_train))## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 177 0 0 0 0
colSums(is.na(titanic_test))## Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 86 0 0 1 0
Convert Missing Value by Mean
Most of missing value in the data from Age column, so fill the missing value with mean of Age from train data.
titanic_train <- titanic_train %>%
na_replace(fill = mean(unlist(titanic_train$Age), na.rm=TRUE))
colSums(is.na(titanic_train))## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 0 0 0 0 0
titanic_test$Age <- titanic_test$Age %>%
na_replace(fill = mean(unlist(titanic_train$Age), na.rm=TRUE))
titanic_test$Fare <- titanic_test$Fare %>%
na_replace(fill = mean(unlist(titanic_train$Fare), na.rm=TRUE))
colSums(is.na(titanic_test))## Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 0 0 0 0
Check Data Proportion
prop.table(table(titanic_train$Survived))##
## 0 1
## 0.6161616 0.3838384
0.61 - 0.38 is balance enough to go next step.
Naive Bayes Model
The first model is naive bayes model.
Model
model_naive<- naiveBayes(Survived ~ ., data = titanic_train)Model Evaluation use Data Train
preds_naive_train <- predict(model_naive, newdata = titanic_train)
confusionMatrix(preds_naive_train, titanic_train$Survived)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 502 142
## 1 47 200
##
## Accuracy : 0.7879
## 95% CI : (0.7595, 0.8143)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5268
##
## Mcnemar's Test P-Value : 8.059e-12
##
## Sensitivity : 0.9144
## Specificity : 0.5848
## Pos Pred Value : 0.7795
## Neg Pred Value : 0.8097
## Prevalence : 0.6162
## Detection Rate : 0.5634
## Detection Prevalence : 0.7228
## Balanced Accuracy : 0.7496
##
## 'Positive' Class : 0
##
Predict
preds_naive <- predict(model_naive, newdata = titanic_test)
table(preds_naive)## preds_naive
## 0 1
## 295 123
Decision Tree Model
Second model is Decision Tree Model.
Model
model_dt <- ctree(Survived ~ .,titanic_train)
plot(model_dt, type="simple")Model Evaluation use Data Train
preds_dt_train <- predict(model_dt, newdata = titanic_train)
confusionMatrix(preds_dt_train, titanic_train$Survived)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 492 96
## 1 57 246
##
## Accuracy : 0.8283
## 95% CI : (0.8019, 0.8525)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.629
##
## Mcnemar's Test P-Value : 0.002125
##
## Sensitivity : 0.8962
## Specificity : 0.7193
## Pos Pred Value : 0.8367
## Neg Pred Value : 0.8119
## Prevalence : 0.6162
## Detection Rate : 0.5522
## Detection Prevalence : 0.6599
## Balanced Accuracy : 0.8077
##
## 'Positive' Class : 0
##
Predict
preds_dt <- predict(model_dt, newdata = titanic_test)
table(preds_dt)## preds_dt
## 0 1
## 264 154
From prediction of model_dt, total survive passenger greater than total survive passenger from model_naive.
Random Forest Model
Third model is Random Forest Model.
library(animation)
ani.options(interval = 1, nmax = 15)
cv.ani(main = "Demonstration of the k-fold Cross Validation", bty = "l") Model
set.seed(382)
ctrl <- trainControl(method="repeatedcv", number=4, repeats=4) # k-fold cross validation
model_forest <- train(Survived ~ ., data=titanic_train, method="rf", trControl = ctrl)
model_forest## Random Forest
##
## 891 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times)
## Summary of sample sizes: 668, 669, 668, 668, 668, 668, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8271669 0.6165041
## 6 0.8294041 0.6331556
## 10 0.8148137 0.6044607
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
varImp(model_forest)## rf variable importance
##
## Overall
## Sexmale 100.000
## Fare 83.710
## Age 76.710
## Pclass3 24.243
## SibSp 13.248
## Parch 8.110
## EmbarkedS 3.311
## Pclass2 2.081
## EmbarkedC 1.799
## EmbarkedQ 0.000
Sexmale has a high influence on the prediction results.
model_forest$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 17.51%
## Confusion matrix:
## 0 1 class.error
## 0 486 63 0.1147541
## 1 93 249 0.2719298
Model Evaluation use Data Train
preds_rf_train <- predict(model_forest, newdata = titanic_train)
confusionMatrix(preds_rf_train, titanic_train$Survived)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 543 19
## 1 6 323
##
## Accuracy : 0.9719
## 95% CI : (0.9589, 0.9818)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9403
##
## Mcnemar's Test P-Value : 0.0164
##
## Sensitivity : 0.9891
## Specificity : 0.9444
## Pos Pred Value : 0.9662
## Neg Pred Value : 0.9818
## Prevalence : 0.6162
## Detection Rate : 0.6094
## Detection Prevalence : 0.6308
## Balanced Accuracy : 0.9668
##
## 'Positive' Class : 0
##
Predict
preds_rf <- predict(model_forest, newdata = titanic_test)
table(preds_rf)## preds_rf
## 0 1
## 269 149
Conclusion
The results of the evaluation of model_naive and model_dt models, model_dt provides a better model for predicting survivors and passengers who survive from prediction_dt more than survivors from prediction_naive. The evaluation of model_forest model is better than model_dt, but from the prediction results, the total survival of the two models is almost the same, where the prediction of model_dt gives a total survival of 5 passengers greater than model_forest.