library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

Modelling Context & Purpose

A is a planning officer in a rescuer organisation. Upon the news of Titanic sinking, she was required to determine how many rescuers her organisation should deploy to the site. The number of rescuers is decided from the number of passenger who would likely survived, according to their manifestation data.

Data Preparation

Before creating the model, A need to ensure that her data is ready:

titanic <- read.csv("data_input/train.csv")

Data Inspection

Checking Missing Value

anyNA(titanic)

## [1] TRUE

colSums(is.na(titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

A found that there were 177 missing value in Age variable. Since Age was deemed relevant to target variable, she needed to imputate the blank rows in Age variable with mean/median. A would need to explore the Age variable first

Age data exploration

hist(titanic$Age)

From this histogram, the data is pretty much balanced (normally distributed). Then, she confirmed the exact number of the median and mean from the Age variable by summary() function

summary(titanic$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

Both median and mean is within close range, so both values are acceptable to use. She will impute the missing value in Age variable with its mean value

Data Imputation

A saved the imputed dataset into titanic_clean object

titanic_clean <- titanic %>% 
  mutate(Age = 
            replace_na(Age, 
                       replace = mean(Age, 
                                      na.rm = T)))

To ensure the imputation works, she checked the data summary again.

summary(titanic_clean)

##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:22.00  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :29.70  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:35.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                               
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

Ensuring that the variables are already in proper data type

A checked the newer data structure with str function

str(titanic_clean)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Variable explanation

survived : passenger survival status pclass : passenger ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) sex : passenger sex
Age : passenger age in years sibsp : the number of siblings / spouses of passengers aboarding the Titanic parch : the number of parents / children of passengers aboarding the Titanic ticket : Ticket number
fare : Passenger fare
cabin : Cabin number embarked : Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Target Value determined: Survived “1” = Survived Survived “0” = Not Survived

Removing irrelevant variables

A identified that there are 4 variables that have zero information to decide whether a passenger should survive or not: - PassengerId - Ticket - Name - Cabin - Fare (as it already represented in Pclass variable)

So she removed them from the variable to speed up the modelling process. A also noticed that some variables do need to be converted to factor data type:

titanic_clean <- titanic_clean %>% 
  select(-PassengerId, -Ticket, -Name, -Cabin, -Fare) %>% 
  mutate(Pclass = as.factor(Pclass)) %>% 
  mutate(Survived = as.factor(Survived)) %>% 
  mutate(Embarked = as.factor(Embarked)) %>% 
  mutate(Embarked = as.factor(SibSp)) %>% 
  mutate(Embarked = as.factor(Parch))

Based on A’s business knowledge, age is a signicant predictor to determine whether passenger survived or not. To include Age to model generation, she classify the Ages variable by making new function: agefunct

agefunct <- function(x){
    if (x < 10) {x <- "<10"}
    else if (x >= 10 & x < 20) {x <- "10-19"}
  else if (x >= 20 & x < 30) {x <- "20-29"}
  else if (x >= 30 & x < 40) {x <- "30-39"}
  else if (x >= 40 & x < 50) {x <- "40-49"}
  else if (x >= 50 & x < 60) {x <- "50-59"}
  else {x <- "elder"}
}

She later run the function and make a new column named Ageconv

titanic_clean$Ageconv <- sapply(X = titanic_clean$Age,FUN =  agefunct)

str(titanic_clean)

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Embarked: Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Ageconv : chr  "20-29" "30-39" "20-29" "30-39" ...

Since she already have Age conv column, Age variable column wasn’t any longer needed. Also, the Age conv column also had to be converted to factor. A assign this final dataframe to titanic_use object.

titanic_use <- titanic_clean %>% 
  select(-Age) %>% 
  mutate(Ageconv = as.factor(Ageconv)) %>% 
  mutate(SibSp = as.factor(SibSp)) %>% 
  mutate(Parch = as.factor(Parch))

str(titanic_use)

## 'data.frame':    891 obs. of  7 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Embarked: Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Ageconv : Factor w/ 7 levels "<10","10-19",..: 3 4 3 4 4 3 6 1 3 2 ...

Exploratory Data Analysis

Checking Class Imbalance

Before we pursue to further process, we will check whether the target variable in source data is balanced enough to generate model

prop.table(table(titanic_use$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

A detected no class imbalance among the target variables, so it’s confirmed that she can use this dataset.

Model Generation & Evaluation

Splitting data into train & test

A will split her dataset into train and test data, to ensure that her model would not be overfitted. She decided to use 80% of the randomly selected rows for generating the model, and the other 20% for testing her model.

library(rsample)
set.seed(921)
splitting <- initial_split(data = titanic_use, prop = 0.8,strata = Survived)
titanic_train <- training(splitting)
titanic_test <- testing(splitting)

prop.table(table(titanic_use$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

prop.table(table(titanic_train$Survived))

## 
##         0         1 
## 0.6162465 0.3837535

prop.table(table(titanic_test$Survived))

## 
##         0         1 
## 0.6158192 0.3841808

First Method: Naive Bayes

The first methodology A use is Naive Bayes. A already convert all variable data type to factor, so she can use it to generate a prediction model. A assign the generated model to titanic_nb object

library(e1071)
titanic_nb <- naiveBayes(formula = Survived ~ . , data = titanic_use, laplace = 1)

Running the Model

After generating titanic_nb model, A applied it to her testing dataset. She use predict() function . Later, she assign them to nb_pred object.

nb_pred <- predict(titanic_nb, titanic_test)

Evaluating the Model

Confusion Matrix

A will evaluate her model by comparing how many correct predictions against the actual situation. A compared her prediction with confusionMatrix function

confusionMatrix(nb_pred,titanic_test$Survived, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 95 21
##          1 14 47
##                                           
##                Accuracy : 0.8023          
##                  95% CI : (0.7359, 0.8582)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 7.432e-08       
##                                           
##                   Kappa : 0.5738          
##                                           
##  Mcnemar's Test P-Value : 0.3105          
##                                           
##             Sensitivity : 0.6912          
##             Specificity : 0.8716          
##          Pos Pred Value : 0.7705          
##          Neg Pred Value : 0.8190          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2655          
##    Detection Prevalence : 0.3446          
##       Balanced Accuracy : 0.7814          
##                                           
##        'Positive' Class : 1               
##

As a rescuer planner, she need to deploy as little rescuer as possible while still optimizing the rescuing process. She focused more on her model precision, so she doesn’t want to misclassify those surviving into not survived category.

In above summary, her model only 77.05% valid in classifying the surviving passengers.

ROC curve / AUC value

A also re-evaluate her Naive-Bayes model using ROC curve and AUC value. A expected that

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

nb_pred_raw <- predict(titanic_nb, titanic_test, type = "raw")
head(nb_pred_raw)

##               0          1
## [1,] 0.07534514 0.92465486
## [2,] 0.94823511 0.05176489
## [3,] 0.69927932 0.30072068
## [4,] 0.58979035 0.41020965
## [5,] 0.03792225 0.96207775
## [6,] 0.60088792 0.39911208

roc_df <- data.frame("prediction"=nb_pred_raw[,2], "trueclass" = as.numeric(titanic_test$Survived))

titanic_roc <- ROCR::prediction(roc_df$prediction, roc_df$trueclass)  
plot(performance(titanic_roc, "tpr", "fpr"))

auc <- performance(titanic_roc, "auc")
auc@y.values[[1]]

## [1] 0.8668376

With ROC/AUC evaluation, her titanic_nb model generate a 0.865 score, which is slightly closer to 1 than 0.5

Second Method: Decision Tree

The second algorithm A used was Decision Tree. In order to maintain the dataset, A used the previously splitted data, then assigned the generated model to titanic_nb object.

library(partykit)

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

titanic_dt <- ctree(formula = Survived ~ . , data = titanic_train)

*SheA plotted the model for a better visualization that whowed the percentage of error in her prediction.

plot(titanic_dt, type = "simple", control = ctree_control(mincriterion = 0.98))

After plotting, A noticed that decision tree might not be a best predictor either, since they generated several error in classificating the survival status into buckets. Nevertheless, she continue confirming her model performance with confusion matrix.

Running the Model

After generating titanic_dt model, A applied it to her testing dataset. She use predict() function . Later, she assign them to dt_pred object.

dt_pred <- predict(titanic_dt, titanic_test)

Evaluating the Model

A compared her prediction with confusionMatrix

confusionMatrix(dt_pred, titanic_test$Survived, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 93 15
##          1 16 53
##                                           
##                Accuracy : 0.8249          
##                  95% CI : (0.7607, 0.8778)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.304e-09       
##                                           
##                   Kappa : 0.6309          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7794          
##             Specificity : 0.8532          
##          Pos Pred Value : 0.7681          
##          Neg Pred Value : 0.8611          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2994          
##    Detection Prevalence : 0.3898          
##       Balanced Accuracy : 0.8163          
##                                           
##        'Positive' Class : 1               
##

In above summary, her 2nd model only 76.81% valid in classifying the surviving passengers.

Summary

A compared the performance of her models as follows:

Naive Bayes, Precision: 77.05%
Decision Tree, Precision: 76.81%

In this case, Naive Bayes algorithm generates more precise model compared to Decision Tree.

Titanic Survival Prediction - 2

Dee

2/29/2020

Modelling Context & Purpose

Data Preparation

Data Inspection

Checking Missing Value

Age data exploration

Data Imputation

Ensuring that the variables are already in proper data type

Removing irrelevant variables

Exploratory Data Analysis

Checking Class Imbalance

Model Generation & Evaluation

Splitting data into train & test

First Method: Naive Bayes

Running the Model

Evaluating the Model

Confusion Matrix

ROC curve / AUC value

Second Method: Decision Tree

Running the Model

Evaluating the Model

Summary