This is a learn by building project to predict the chance of survive of Titanic’s passsenger using Naive Bayes, Decision Tree & Random Forest Analysis method.
RMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of April 15, 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of modern history’s deadliest peacetime commercial marine disasters.
The analysis will use dataset from https://www.kaggle.com/c/titanic.
The data will be splitted into training and testing dataset.
Training dataset will be used to build machine learning models.
Testing dataset will be used to see how well machine learning models perform on unseen data.
## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'stringr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Warning: package 'plotly' was built under R version 3.6.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Warning: package 'GGally' was built under R version 3.6.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
## Warning: package 'funModeling' was built under R version 3.6.2
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 3.6.2
## Loading required package: lattice
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.2
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
##
## subplot
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
##
## Attaching package: 'funModeling'
## The following object is masked from 'package:GGally':
##
## range01
## Warning: package 'lmtest' was built under R version 3.6.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.6.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Warning: package 'car' was built under R version 3.6.2
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## Warning: package 'MLmetrics' was built under R version 3.6.2
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
## Warning: package 'e1071' was built under R version 3.6.2
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
## Warning: package 'rsample' was built under R version 3.6.2
## Warning: package 'caret' was built under R version 3.6.2
##
## Attaching package: 'caret'
## The following objects are masked from 'package:MLmetrics':
##
## MAE, RMSE
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
## Warning: package 'ROCR' was built under R version 3.6.2
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.6.2
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
## Warning: package 'partykit' was built under R version 3.6.2
## Loading required package: grid
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 3.6.2
## Loading required package: mvtnorm
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The name of variables are as follows:
1. PassengerId : Id number
2. Survived : Survival (0 = No, 1 = Yes)
3. Pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
4. Name : Name
5. Sex : Sex
6. Age : Age in years
7. SibSp : # of siblings/spouses aboard the Titanic
8. Parch : # of parents/children aboard the Titanic
9. Ticket : Ticket number
10.Fare : Passenger fare
11.Cabin : Cabin number
12.Embarked : Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
Sibsp: The dataset defines family relations in this way
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: The dataset defines family relations in this way
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
titanic <- titanic %>%
select(-PassengerId, -Name, -Ticket, -Fare) %>%
mutate(Pclass = as.factor(Pclass),
SibSp = as.factor(SibSp),
Parch = as.factor(Parch))
str(titanic)## 'data.frame': 891 obs. of 8 variables:
## $ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The variables PassengerId, Name, Ticket, Fare are excluded from predictor variable due to no relationship with target variable Survived.
## Survived Pclass Sex Age SibSp Parch Cabin Embarked
## 0 0 0 177 0 0 0 0
There is 177 na or missing values in variable Age of dataset.
We will make a mean imputation of missing values, which is simply calculate the mean of the observed values for that variable for all individuals who are non-missing.
## 'data.frame': 891 obs. of 9 variables:
## $ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
## $ Age_1 : num 22 38 26 35 35 ...
## Survived Pclass Sex Age SibSp Parch Cabin Embarked
## 0 0 0 177 0 0 0 0
## Age_1
## 0
## Survived Pclass Sex SibSp Parch Cabin Embarked Age_1
## 0 0 0 0 0 0 0 0
We will divide variable Age_1 into 9 (nine) age groups and convert into factor data type.
age_group <- function(x){
if (x < 6) {
"Toddler"
}
else if (x > 5 & x < 12) {
"Child"
}
else if (x > 12 & x < 17) {
"Early Teenager"
}
else if (x > 17 & x < 26) {
"Late Teenager"
}
else if (x > 26 & x < 36) {
"Early Adult"
}
else if (x > 36 & x < 46) {
"Late Adult"
}
else if (x > 46 & x < 56) {
"Early Elderly"
}
else if (x > 56 & x < 66) {
"Late Elderly"
}
else
"Old Man"
} ## 'data.frame': 891 obs. of 10 variables:
## $ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
## $ Age_1 : num 22 38 26 35 35 ...
## $ Age_2 : Factor w/ 9 levels "Child","Early Adult",..: 7 5 8 2 2 2 3 9 2 4 ...
titanic_data <- titanic_new2 %>%
select(-Age, -Age_1, -Cabin) %>%
mutate(Survived = as.factor(Survived))
str(titanic_data)## 'data.frame': 891 obs. of 7 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
## $ Age_2 : Factor w/ 9 levels "Child","Early Adult",..: 7 5 8 2 2 2 3 9 2 4 ...
The model for solving business problem is to predict the chance of Titanic’s passengers getting survived.
The model will be developed using variables as follows:
Target variable: Survived
Predictor variable: Pclass, Sex, SibSp, Parch, Cabin, Embarked, Age_2
## Survived Pclass Sex SibSp Parch Embarked Age_2
## 0:549 1:216 female:314 0:608 0:678 : 2 Early Adult :355
## 1:342 2:184 male :577 1:209 1:118 C:168 Late Teenager:188
## 3:491 2: 28 2: 80 Q: 77 Late Adult : 94
## 3: 16 3: 5 S:644 Old Man : 69
## 4: 18 4: 4 Early Elderly: 59
## 5: 5 5: 5 Toddler : 44
## 8: 7 6: 1 (Other) : 82
##
## 1 2 3
## 0 0.09 0.11 0.42
## 1 0.15 0.10 0.13
Most of passengers were in the 3rd class (55%).
##
## female male
## 0 0.09 0.53
## 1 0.26 0.12
Most of passengers were male (65%).
##
## 0 1 2 3 4 5 8
## 0 0.45 0.11 0.02 0.01 0.02 0.01 0.01
## 1 0.24 0.13 0.01 0.00 0.00 0.00 0.00
None of passengers were with siblings/spouses (69%).
##
## 0 1 2 3 4 5 6
## 0 0.50 0.06 0.04 0.00 0.00 0.00 0.00
## 1 0.26 0.07 0.04 0.00 0.00 0.00 0.00
None of passengers were with parents/child (76%).
##
## Child Early Adult Early Elderly Early Teenager Late Adult Late Elderly
## 0 0.02 0.25 0.04 0.02 0.07 0.02
## 1 0.01 0.14 0.03 0.02 0.04 0.01
##
## Late Teenager Old Man Toddler
## 0 0.14 0.05 0.01
## 1 0.07 0.03 0.03
Most of passengers were early and late adult (50%).
##
## 0 1
## 0.62 0.38
Based on the proportion of the target variable above, we can conclude that our target variable can be considered to be imbalance. Hence we will have to balance the same before using it for our models.
# Splitting data
set.seed(100)
split_nb <- initial_split(data = titanic_data, prop = 0.8, strata = Survived)
train_nb <- training(split_nb)
test_nb <- testing(split_nb)##
## 0 1
## 0.6161616 0.3838384
##
## 0 1
## 0.6162465 0.3837535
##
## 0 1
## 0.6158192 0.3841808
# Upsampling data
set.seed(100)
train_nb_up <- upSample(x= train_nb[,-1],y = train_nb$Survived, yname = "Survived")
prop.table(table(train_nb_up$Survived))##
## 0 1
## 0.5 0.5
##
## 0 1
## 440 274
##
## 0 1
## 440 440
Now that we have the train (already upsampled) and test datasets. We will store the ground truth labels as well. They are necessary for the training of the classifer as well as simply model evaluation later.
# Splitting data
set.seed(100)
split_dt <- initial_split(data = titanic_data, prop = 0.8, strata = Survived)
train_dt <- training(split_dt)
test_dt <- testing(split_dt)##
## 0 1
## 0.6161616 0.3838384
##
## 0 1
## 0.6162465 0.3837535
##
## 0 1
## 0.6158192 0.3841808
# Upsampling data
set.seed(100)
train_dt_up <- upSample(x= train_dt[,-1],y = train_dt$Survived, yname = "Survived")
prop.table(table(train_dt_up$Survived))##
## 0 1
## 0.5 0.5
##
## 0 1
## 440 274
##
## 0 1
## 440 440
Now that we have the train (already upsampled) and test datasets. We will store the ground truth labels as well. They are necessary for the training of the classifer as well as simply model evaluation later.
# Splitting data
set.seed(100)
split_rf <- initial_split(data = titanic_data, prop = 0.8, strata = Survived)
train_rf <- training(split_rf)
test_rf <- testing(split_rf)##
## 0 1
## 0.6161616 0.3838384
##
## 0 1
## 0.6162465 0.3837535
##
## 0 1
## 0.6158192 0.3841808
# Upsampling data
set.seed(100)
train_rf_up <- upSample(x= train_rf[,-1],y = train_rf$Survived, yname = "Survived")
prop.table(table(train_rf_up$Survived))##
## 0 1
## 0.5 0.5
##
## 0 1
## 440 274
##
## 0 1
## 440 440
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5 0.5
##
## Conditional probabilities:
## Pclass
## Y 1 2 3
## 0 0.1512415 0.1805869 0.6681716
## 1 0.3724605 0.2641084 0.3634312
##
## Sex
## Y female male
## 0 0.1425339 0.8574661
## 1 0.6764706 0.3235294
##
## SibSp
## Y 0 1 2 3 4 5
## 0 0.711409396 0.176733781 0.031319911 0.022371365 0.029082774 0.013422819
## 1 0.592841163 0.333333333 0.046979866 0.013422819 0.008948546 0.002237136
## SibSp
## Y 8
## 0 0.015659955
## 1 0.002237136
##
## Parch
## Y 0 1 2 3 4 5
## 0 0.791946309 0.100671141 0.078299776 0.006711409 0.008948546 0.008948546
## 1 0.666666667 0.205816555 0.102908277 0.013422819 0.002237136 0.006711409
## Parch
## Y 6
## 0 0.004474273
## 1 0.002237136
##
## Embarked
## Y C Q S
## 0 0.002252252 0.135135135 0.096846847 0.765765766
## 1 0.006756757 0.283783784 0.085585586 0.623873874
##
## Age_2
## Y Child Early Adult Early Elderly Early Teenager Late Adult Late Elderly
## 0 0.03118040 0.39643653 0.05790646 0.03118040 0.11358575 0.03340757
## 1 0.02227171 0.35634744 0.08463252 0.05345212 0.11135857 0.02449889
## Age_2
## Y Late Teenager Old Man Toddler
## 0 0.22271715 0.08240535 0.03118040
## 1 0.18485523 0.06904232 0.09354120
##
## Model formula:
## Survived ~ Pclass + Sex + SibSp + Parch + Embarked + Age_2
##
## Fitted party:
## [1] root
## | [2] Sex in female
## | | [3] Pclass in 1, 2: 1 (n = 208, err = 3.4%)
## | | [4] Pclass in 3
## | | | [5] Age_2 in Child, Early Elderly, Late Adult: 0 (n = 18, err = 11.1%)
## | | | [6] Age_2 in Early Adult, Early Teenager, Late Elderly, Late Teenager, Old Man, Toddler
## | | | | [7] Embarked in C, Q: 1 (n = 59, err = 15.3%)
## | | | | [8] Embarked in S: 1 (n = 75, err = 40.0%)
## | [9] Sex in male
## | | [10] Pclass in 1: 0 (n = 120, err = 47.5%)
## | | [11] Pclass in 2, 3
## | | | [12] Age_2 in Child, Early Adult, Early Elderly, Early Teenager, Late Adult, Late Elderly, Late Teenager, Old Man
## | | | | [13] Embarked in C
## | | | | | [14] Parch in 0: 0 (n = 42, err = 26.2%)
## | | | | | [15] Parch in 1: 1 (n = 10, err = 30.0%)
## | | | | [16] Embarked in Q, S: 0 (n = 321, err = 15.0%)
## | | | [17] Age_2 in Toddler
## | | | | [18] SibSp in 0, 1, 2: 1 (n = 17, err = 0.0%)
## | | | | [19] SibSp in 3, 4, 5: 0 (n = 10, err = 20.0%)
##
## Number of inner nodes: 9
## Number of terminal nodes: 10
## Warning: `repeats` has no meaning for this resampling method.
## Random Forest
##
## 880 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 704, 704, 704, 704, 704
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7693182 0.5386364
## 14 0.8034091 0.6068182
## 26 0.7965909 0.5931818
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.
## [1] 1 0 0 1 1 1
## Levels: 0 1
## 2 14 22 25 29 31
## 1 0 0 0 1 0
## Levels: 0 1
## 0 1
## 2 0.03365385 0.9663462
## 14 0.85046729 0.1495327
## 22 0.85046729 0.1495327
## 25 0.88888889 0.1111111
## 29 0.15254237 0.8474576
## 31 0.52500000 0.4750000
## [1] 1 0 0 0 1 1
## Levels: 0 1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85 14
## 1 24 54
##
## Accuracy : 0.7853
## 95% CI : (0.7174, 0.8434)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.07e-06
##
## Kappa : 0.5585
##
## Mcnemar's Test P-Value : 0.1443
##
## Sensitivity : 0.7941
## Specificity : 0.7798
## Pos Pred Value : 0.6923
## Neg Pred Value : 0.8586
## Prevalence : 0.3842
## Detection Rate : 0.3051
## Detection Prevalence : 0.4407
## Balanced Accuracy : 0.7870
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 93 17
## 1 16 51
##
## Accuracy : 0.8136
## 95% CI : (0.7483, 0.8681)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.058e-08
##
## Kappa : 0.6049
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7500
## Specificity : 0.8532
## Pos Pred Value : 0.7612
## Neg Pred Value : 0.8455
## Prevalence : 0.3842
## Detection Rate : 0.2881
## Detection Prevalence : 0.3785
## Balanced Accuracy : 0.8016
##
## 'Positive' Class : 1
##
titan_roc <- ROCR::prediction(pred_dt_prob[,2], test_dt$Survived)
plot(performance(titan_roc, "tpr", "fpr"))## [1] 0.8600243
The AUC number 0.8579331, closed to 1, meaning that the classification model have a good prediction between two classes.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 92 20
## 1 17 48
##
## Accuracy : 0.791
## 95% CI : (0.7236, 0.8483)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 4.548e-07
##
## Kappa : 0.5545
##
## Mcnemar's Test P-Value : 0.7423
##
## Sensitivity : 0.7059
## Specificity : 0.8440
## Pos Pred Value : 0.7385
## Neg Pred Value : 0.8214
## Prevalence : 0.3842
## Detection Rate : 0.2712
## Detection Prevalence : 0.3672
## Balanced Accuracy : 0.7750
##
## 'Positive' Class : 1
##
The evaluation of confusion matrix of Naive Bayes method from data train are as follows:
The evaluation of confusion matrix of Decision Tree method from data train are as follows:
The evaluation of confusion matrix of Random Forest method from data train are as follows:
From comparison of three confusion matrixes above, the accuracy and precision level of Decision Tree is the highest. Nevertheless, the recall level of Decision Tree is lower than Naive Bayes. The FN (False Negative) of Decision Tree 17, while the FN of Naive Bayes is 14. If we have a point of view that in such condition the probability to survive is small, we will choose Decision Tree method as our best prediction model.