Modelling Context & Purpose

A is a planning officer in a rescuer organisation. Upon the news of Titanic sinking, she is required to determine how many rescuers her organisation should deploy to the site. The number of rescuers is decided from the number of passenger who would likely survived, according to their manifestation data.

A will use ’generalized linear model` to generate prediction model.

Data Preparation

Before creating the model, A need to ensure that her data is ready

Data Inspection

Checking missing value

## [1] TRUE
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

A decided that Age is relevant to target variable, hence she need to imputate the blank rows in Age variable with mean/median. A need to explore the Age data first

Age data exploration

From this histogram, the data is pretty much balanced (normally distributed). Then, she confirmed the exact number of the median and mean from the Age variable by summary() function

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

Both median and mean is within close range, so both values are acceptable to use. She will impute the missing value in Age variable with its mean value

Data Imputation

A saved the imputed dataset into titanic_clean object

To ensure the imputation works, she checked the data summary again.

##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:22.00  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :29.70  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:35.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                               
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

Ensuring that the variables are already in proper data type

A checked the newer data structure with str function

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Variable explanation

survival : passenger survival status

pclass : passenger ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)

sex : passenger sexuality

Age : passenger age in years

sibsp : the number of siblings / spouses of passengers aboarding the Titanic

parch : the number of parents / children of passengers aboarding the Titanic

ticket : Ticket number

fare : Passenger fare

cabin : Cabin number

embarked : Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Removing irrelevant variables

A identified that there are 4 variables that have zero information to decide whether a passenger should survive or not:

  • PassengerId

  • Ticket

  • Name

  • Cabin

So she removed them from the variable to speed up the modelling process.

Exploratory Data Analysis

Checking Class Imbalance

## 
##         0         1 
## 0.6161616 0.3838384

A detected no class imbalance among the target variables, so it’s confirmed that she can use this dataset.

Checking Perfect Imbalance among variables

##    
##       1   2   3
##   0  80  97 372
##   1 136  87 119
##    
##     female male
##   0     81  468
##   1    233  109
##    
##     0.42 0.67 0.75 0.83 0.92   1   2   3   4   5   6   7   8   9  10  11  12
##   0    0    0    0    0    0   2   7   1   3   0   1   2   2   6   2   3   0
##   1    1    1    2    2    1   5   3   5   7   4   2   1   2   2   0   1   1
##    
##      13  14 14.5  15  16  17  18  19  20 20.5  21  22  23 23.5  24 24.5  25  26
##   0   0   3    1   1  11   7  17  16  12    1  19  16  10    1  15    1  17  12
##   1   2   3    0   4   6   6   9   9   3    0   5  11   5    0  15    0   6   6
##    
##      27  28 28.5  29 29.6991176470588  30 30.5  31  32 32.5  33  34 34.5  35
##   0   7  18    2  12              125  15    2   9   9    1   9   9    1   7
##   1  11   7    0   8               52  10    0   8   9    1   6   6    0  11
##    
##      36 36.5  37  38  39  40 40.5  41  42  43  44  45 45.5  46  47  48  49  50
##   0  11    1   5   6   9   7    2   4   7   4   6   7    2   3   8   3   2   5
##   1  11    0   1   5   5   6    0   2   6   1   3   5    0   0   1   6   4   5
##    
##      51  52  53  54  55 55.5  56  57  58  59  60  61  62  63  64  65  66  70
##   0   5   3   0   5   1    1   2   2   2   2   2   3   2   0   2   3   1   2
##   1   2   3   1   3   1    0   2   0   3   0   2   0   2   2   0   0   0   0
##    
##     70.5  71  74  80
##   0    1   2   1   0
##   1    0   0   0   1
##    
##       0   1   2   3   4   5   8
##   0 398  97  15  12  15   5   7
##   1 210 112  13   4   3   0   0
##    
##       0   1   2   3   4   5   6
##   0 445  53  40   2   4   4   1
##   1 233  65  40   3   0   1   0

There are no variables that perfectly contribute to the survival prediction, so A reckons that she can use all variables.

Model Generation

Splitting data into train & test

A will split her dataset into train and test data, to ensure that her model would not be overfitted. She decided to use 80% of the randomly selected rows for generating the model, and the other 20% for testing her model.

She confirmed that her splitted data have similar target variable proportion by comparing the proportion of each target.

## 
##         0         1 
## 0.6161616 0.3838384
## 
##         0         1 
## 0.6162465 0.3837535
## 
##         0         1 
## 0.6158192 0.3841808

She confirmed that all of her splitted data has similar target proportion.

Confirming the correlation between variables

All variables have varying degree of correlation against the target variable. However, A reckons that all variables might be relevant to decide the survival rate of the passengers. Hence she remove no variable at this point.

Generating the model

A will use the glm() function to generate classification model from all (mostly numeric) variables available.

## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5739  -0.6283  -0.4407   0.6738   2.3670  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  16.432864 535.411599   0.031  0.97552    
## Pclass       -0.950849   0.156136  -6.090 1.13e-09 ***
## Sexmale      -2.658113   0.220486 -12.056  < 2e-16 ***
## Age          -0.034028   0.008724  -3.901 9.60e-05 ***
## SibSp        -0.376437   0.122142  -3.082  0.00206 ** 
## Parch        -0.012432   0.133808  -0.093  0.92597    
## Fare          0.002422   0.002569   0.943  0.34576    
## EmbarkedC   -11.615881 535.411299  -0.022  0.98269    
## EmbarkedQ   -11.803506 535.411373  -0.022  0.98241    
## EmbarkedS   -12.104306 535.411273  -0.023  0.98196    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 950.86  on 713  degrees of freedom
## Residual deviance: 651.82  on 704  degrees of freedom
## AIC: 671.82
## 
## Number of Fisher Scoring iterations: 12

Her first model generate an AIC score of 671.82, and she wonders if she can improve her score by eliminating some variables. She use backward feature selection to generate possible higher AIC score.

## Start:  AIC=671.82
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
##            Df Deviance    AIC
## - Parch     1   651.83 669.83
## - Embarked  3   656.04 670.04
## - Fare      1   652.79 670.79
## <none>          651.82 671.82
## - SibSp     1   662.97 680.97
## - Age       1   667.86 685.86
## - Pclass    1   688.89 706.89
## - Sex       1   833.87 851.87
## 
## Step:  AIC=669.83
## Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked
## 
##            Df Deviance    AIC
## - Embarked  3   656.10 668.10
## - Fare      1   652.80 668.80
## <none>          651.83 669.83
## - SibSp     1   664.42 680.42
## - Age       1   667.87 683.87
## - Pclass    1   689.72 705.72
## - Sex       1   841.28 857.28
## 
## Step:  AIC=668.1
## Survived ~ Pclass + Sex + Age + SibSp + Fare
## 
##          Df Deviance    AIC
## - Fare    1   657.97 667.97
## <none>        656.10 668.10
## - SibSp   1   670.81 680.81
## - Age     1   672.55 682.55
## - Pclass  1   696.67 706.67
## - Sex     1   853.59 863.59
## 
## Step:  AIC=667.97
## Survived ~ Pclass + Sex + Age + SibSp
## 
##          Df Deviance    AIC
## <none>        657.97 667.97
## - SibSp   1   671.12 679.12
## - Age     1   675.21 683.21
## - Pclass  1   731.96 739.96
## - Sex     1   860.28 868.28
## 
## Call:  glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = titanic_train)
## 
## Coefficients:
## (Intercept)       Pclass      Sexmale          Age        SibSp  
##     4.79907     -1.05618     -2.69570     -0.03484     -0.37588  
## 
## Degrees of Freedom: 713 Total (i.e. Null);  709 Residual
## Null Deviance:       950.9 
## Residual Deviance: 658   AIC: 668

The backward feature selection generate a model with AIC score of 667.97. Since this score is slightly higher than the initial model, she decide to use this model and assign it to titanic_md_fin object

Running the Model

After generating a model with better AIC score, A applied her model to her testing dataset. She use predict() function with response type input to generate the surviving probability. Later, she assign them to Pred_Survived column.

She adds another variable of class_survived to assign which probabily class are the predictions run into.

Evaluating the Model

A will evaluate her model by comparing how many correct predictions (represented in class_survived column) against the actual situation (represented in Survived column). Before continue with the comparison, A converted the columns into factor data type

A compared her prediction with confusionMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 92 14
##          1 17 54
##                                           
##                Accuracy : 0.8249          
##                  95% CI : (0.7607, 0.8778)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.304e-09       
##                                           
##                   Kappa : 0.6329          
##                                           
##  Mcnemar's Test P-Value : 0.7194          
##                                           
##             Sensitivity : 0.7941          
##             Specificity : 0.8440          
##          Pos Pred Value : 0.7606          
##          Neg Pred Value : 0.8679          
##              Prevalence : 0.3842          
##          Detection Rate : 0.3051          
##    Detection Prevalence : 0.4011          
##       Balanced Accuracy : 0.8191          
##                                           
##        'Positive' Class : 1               
## 

As a rescuer planner, she need to deploy as little rescuer as possible while still optimizing the rescuing process. She focused more on her model precision, so she doesn’t want to misclassify those were surviving into not survived category.

In above summary, her model only 76.06% valid in classifying the surviving passengers. Hence, she need to tune her model by:

  1. Adding back the variables previously removed in feature selection
  2. Upsampling and/or downsampling her data test
  3. K-NN

Tuning the Model

Attempt 1 - Changing the Variables

To not disturb the previous model, A assign her data test into a new object titanic_test1

She decided to add back all variables previously removed in feature selection. She use her previous model titanic_model which includes all variables (except the completely irrelevant data), and replace the pred_survived column with the probability generated from the titanic_mdl model.

The class_survived also replaced with the new classification generated from above probabilities.

Re-evaluating the Model

A will evaluate her model by comparing how many correct predictions (represented in class_survived column) against the actual situation (represented in Survived column). Before continue with the comparison, A converted the columns into factor data type

A compared her prediction with confusionMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 93 15
##          1 16 53
##                                           
##                Accuracy : 0.8249          
##                  95% CI : (0.7607, 0.8778)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.304e-09       
##                                           
##                   Kappa : 0.6309          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7794          
##             Specificity : 0.8532          
##          Pos Pred Value : 0.7681          
##          Neg Pred Value : 0.8611          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2994          
##    Detection Prevalence : 0.3898          
##       Balanced Accuracy : 0.8163          
##                                           
##        'Positive' Class : 1               
## 

This model generate slightly better precision score (76.81% compared to 76.06%). However, the change is too small, so she decided to upsample her dataset titanic_train

Attempt 2 - Upsampling the dataset

To ensuring that the target variable (Survived column) is already in correct format - the mutated dataset will be assigned to a new object titanic_train2

upsampling the data train

## 
##         0         1 
## 0.6162465 0.3837535
## 
##   0   1 
## 0.5 0.5

Re-generate the Model using upsampled data train

A will use the glm() function to generate classification model from all (mostly numeric) variables available.

## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train_up)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7279  -0.6785  -0.0694   0.6893   2.1890  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  16.225169 535.411500   0.030  0.97582    
## Pclass       -0.878612   0.138607  -6.339 2.31e-10 ***
## Sexmale      -2.679176   0.197648 -13.555  < 2e-16 ***
## Age          -0.032568   0.007430  -4.383 1.17e-05 ***
## SibSp        -0.389671   0.109863  -3.547  0.00039 ***
## Parch         0.009659   0.121404   0.080  0.93658    
## Fare          0.002984   0.002707   1.102  0.27035    
## EmbarkedC   -11.097147 535.411274  -0.021  0.98346    
## EmbarkedQ   -11.413284 535.411327  -0.021  0.98299    
## EmbarkedS   -11.667092 535.411249  -0.022  0.98261    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1219.94  on 879  degrees of freedom
## Residual deviance:  829.28  on 870  degrees of freedom
## AIC: 849.28
## 
## Number of Fisher Scoring iterations: 12

Running & Re-evaluating the Model

After generating a model with better AIC score, A applied her model to her testing dataset. She use predict() function with response type input to generate the surviving probability. Later, she assign them to Pred_Survived column.

A will evaluate her model by comparing how many correct predictions (represented in class_survived column) against the actual situation (represented in Survived column). Before continue with the comparison, A converted the columns into factor data type

A compared her prediction with confusionMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 88 13
##          1 21 55
##                                           
##                Accuracy : 0.8079          
##                  95% CI : (0.7421, 0.8632)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 2.854e-08       
##                                           
##                   Kappa : 0.6028          
##                                           
##  Mcnemar's Test P-Value : 0.2299          
##                                           
##             Sensitivity : 0.8088          
##             Specificity : 0.8073          
##          Pos Pred Value : 0.7237          
##          Neg Pred Value : 0.8713          
##              Prevalence : 0.3842          
##          Detection Rate : 0.3107          
##    Detection Prevalence : 0.4294          
##       Balanced Accuracy : 0.8081          
##                                           
##        'Positive' Class : 1               
## 

As a rescuer planner, she need to deploy as little rescuer as possible while still optimizing the rescuing process. She focused more on her model precision, so she doesn’t want to misclassify those surviving into not survived category.

In above summary, her model only 72.37% valid in classifying the surviving passengers.

Attempt 3 - K-NN method

Data pre-processing

Before classifying data into two classes (Survived and Not Survived), A needed to scale her data, before assigning the variables into predictors and target, for both insample and testing data.

Scaling the Dataset

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Scaling Predictor Variable

To ensure all variables contribute in similar weight during classification, A needed to standardize the variables using z-score standardization with scale() function.

Identifying the k-optimum value

## [1] 26.72078

Based on above computation, the k-optimum value is 27, since the number of classes in target variable was even (2, Survived and Not Survived)

Running the Classifying Process with K-NN

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 93 15
##          1 16 53
##                                           
##                Accuracy : 0.8249          
##                  95% CI : (0.7607, 0.8778)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.304e-09       
##                                           
##                   Kappa : 0.6309          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7794          
##             Specificity : 0.8532          
##          Pos Pred Value : 0.7681          
##          Neg Pred Value : 0.8611          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2994          
##    Detection Prevalence : 0.3898          
##       Balanced Accuracy : 0.8163          
##                                           
##        'Positive' Class : 1               
## 

Summary

A compared the performance of her models as follows:

  • Original model, with variables as follows:

    • Pclass

    • Sex

    • Age

    • SibSp

    Precision: 76.06%

  • Tuning #1 : Adding back number of parent/children (Parch), embarking dock(Embarked), and fare paid (Fare) into the model Precision: 76.81%

  • Tuning #2 : Upsampling the data train / insample data Precision: 72.37%

  • Tuning #3 : K-NN classification Precision: 76.81%

All models do not provide satisfactory performance. Therefore, A concluded that generalized linear models and K-NN classification might not suitable for this dataset.