## -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
A is a planning officer in a rescuer organisation. Upon the news of Titanic sinking, she was required to determine how many rescuers her organisation should deploy to the site. The number of rescuers is decided from the number of passenger who would likely survived, according to their manifestation data.
Before creating the model, A need to ensure that her data is ready:
## [1] TRUE
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
A found that there were 177 missing value in Age variable. Since Age was deemed relevant to target variable, she needed to imputate the blank rows in Age variable with mean/median. A would need to explore the Age variable first
From this histogram, the data is pretty much balanced (normally distributed). Then, she confirmed the exact number of the median and mean from the Age variable by summary() function
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
Both median and mean is within close range, so both values are acceptable to use. She will impute the missing value in Age variable with its mean value
A saved the imputed dataset into titanic_clean object
To ensure the imputation works, she checked the data summary again.
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:22.00
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :29.70
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:35.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
A checked the newer data structure with str function
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Variable explanation
survived : passenger survival status pclass : passenger ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) sex : passenger sex
Age : passenger age in years sibsp : the number of siblings / spouses of passengers aboarding the Titanic parch : the number of parents / children of passengers aboarding the Titanic ticket : Ticket number
fare : Passenger fare
cabin : Cabin number embarked : Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Target Value determined: Survived “1” = Survived Survived “0” = Not Survived
A identified that there are 4 variables that have zero information to decide whether a passenger should survive or not: - PassengerId - Ticket - Name - Cabin - Fare (as it already represented in Pclass variable)
So she removed them from the variable to speed up the modelling process. A also noticed that some variables do need to be converted to factor data type:
titanic_clean <- titanic_clean %>%
select(-PassengerId, -Ticket, -Name, -Cabin, -Fare) %>%
mutate(Pclass = as.factor(Pclass)) %>%
mutate(Survived = as.factor(Survived)) %>%
mutate(Embarked = as.factor(Embarked)) %>%
mutate(Embarked = as.factor(SibSp)) %>%
mutate(Embarked = as.factor(Parch))Based on A’s business knowledge, age is a signicant predictor to determine whether passenger survived or not. To include Age to model generation, she classify the Ages variable by making new function: agefunct
agefunct <- function(x){
if (x < 10) {x <- "<10"}
else if (x >= 10 & x < 20) {x <- "10-19"}
else if (x >= 20 & x < 30) {x <- "20-29"}
else if (x >= 30 & x < 40) {x <- "30-39"}
else if (x >= 40 & x < 50) {x <- "40-49"}
else if (x >= 50 & x < 60) {x <- "50-59"}
else {x <- "elder"}
}She later run the function and make a new column named Ageconv
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Embarked: Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Ageconv : chr "20-29" "30-39" "20-29" "30-39" ...
Since she already have Age conv column, Age variable column wasn’t any longer needed. Also, the Age conv column also had to be converted to factor. A assign this final dataframe to titanic_use object.
titanic_use <- titanic_clean %>%
select(-Age) %>%
mutate(Ageconv = as.factor(Ageconv)) %>%
mutate(SibSp = as.factor(SibSp)) %>%
mutate(Parch = as.factor(Parch))## 'data.frame': 891 obs. of 7 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Embarked: Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Ageconv : Factor w/ 7 levels "<10","10-19",..: 3 4 3 4 4 3 6 1 3 2 ...
Before we pursue to further process, we will check whether the target variable in source data is balanced enough to generate model
##
## 0 1
## 0.6161616 0.3838384
A detected no class imbalance among the target variables, so it’s confirmed that she can use this dataset.
A will split her dataset into train and test data, to ensure that her model would not be overfitted. She decided to use 80% of the randomly selected rows for generating the model, and the other 20% for testing her model.
library(rsample)
set.seed(921)
splitting <- initial_split(data = titanic_use, prop = 0.8,strata = Survived)
titanic_train <- training(splitting)
titanic_test <- testing(splitting)
prop.table(table(titanic_use$Survived))##
## 0 1
## 0.6161616 0.3838384
##
## 0 1
## 0.6162465 0.3837535
##
## 0 1
## 0.6158192 0.3841808
The first methodology A use is Naive Bayes. A already convert all variable data type to factor, so she can use it to generate a prediction model. A assign the generated model to titanic_nb object
After generating titanic_nb model, A applied it to her testing dataset. She use predict() function . Later, she assign them to nb_pred object.
A will evaluate her model by comparing how many correct predictions against the actual situation. A compared her prediction with confusionMatrix function
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 95 21
## 1 14 47
##
## Accuracy : 0.8023
## 95% CI : (0.7359, 0.8582)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 7.432e-08
##
## Kappa : 0.5738
##
## Mcnemar's Test P-Value : 0.3105
##
## Sensitivity : 0.6912
## Specificity : 0.8716
## Pos Pred Value : 0.7705
## Neg Pred Value : 0.8190
## Prevalence : 0.3842
## Detection Rate : 0.2655
## Detection Prevalence : 0.3446
## Balanced Accuracy : 0.7814
##
## 'Positive' Class : 1
##
As a rescuer planner, she need to deploy as little rescuer as possible while still optimizing the rescuing process. She focused more on her model precision, so she doesn’t want to misclassify those surviving into not survived category.
In above summary, her model only 77.05% valid in classifying the surviving passengers.
A also re-evaluate her Naive-Bayes model using ROC curve and AUC value. A expected that
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
## 0 1
## [1,] 0.07534514 0.92465486
## [2,] 0.94823511 0.05176489
## [3,] 0.69927932 0.30072068
## [4,] 0.58979035 0.41020965
## [5,] 0.03792225 0.96207775
## [6,] 0.60088792 0.39911208
titanic_roc <- ROCR::prediction(roc_df$prediction, roc_df$trueclass)
plot(performance(titanic_roc, "tpr", "fpr"))## [1] 0.8668376
With ROC/AUC evaluation, her titanic_nb model generate a 0.865 score, which is slightly closer to 1 than 0.5
The second algorithm A used was Decision Tree. In order to maintain the dataset, A used the previously splitted data, then assigned the generated model to titanic_nb object.
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
*SheA plotted the model for a better visualization that whowed the percentage of error in her prediction.
After plotting, A noticed that decision tree might not be a best predictor either, since they generated several error in classificating the survival status into buckets. Nevertheless, she continue confirming her model performance with confusion matrix.
After generating titanic_dt model, A applied it to her testing dataset. She use predict() function . Later, she assign them to dt_pred object.
A compared her prediction with confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 93 15
## 1 16 53
##
## Accuracy : 0.8249
## 95% CI : (0.7607, 0.8778)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.304e-09
##
## Kappa : 0.6309
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7794
## Specificity : 0.8532
## Pos Pred Value : 0.7681
## Neg Pred Value : 0.8611
## Prevalence : 0.3842
## Detection Rate : 0.2994
## Detection Prevalence : 0.3898
## Balanced Accuracy : 0.8163
##
## 'Positive' Class : 1
##
As a rescuer planner, she need to deploy as little rescuer as possible while still optimizing the rescuing process. She focused more on her model precision, so she doesn’t want to misclassify those surviving into not survived category.
In above summary, her 2nd model only 76.81% valid in classifying the surviving passengers.
A compared the performance of her models as follows:
In this case, Naive Bayes algorithm generates more precise model compared to Decision Tree.