This Project want to analysis and estimate what factors that influence survival rate passengers in Titanic. This analysis will use three models such as,
1. Logistic Regression
2. Decision Tree
3. Support Vector Machine (SVM)
These three models will be compared which can best explain survival rate passengers in Titanic based by their accuracy level.

2 Cleansing Data and Exploratory Data

># Observations: 1,309
># Variables: 12
># $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
># $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
># $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
># $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
># $ Sex         <fct> male, female, female, female, male, male, male, male, f...
># $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
># $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
># $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
># $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
># $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
># $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
># $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S...

Information Data :
- Survival (0 = No, 1 = Yes)
- pclass = Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Sex (Male, Female)
- Age (in years)
- sibsp = Number of siblings / spouses aboard the Titanic
- parch = Number of parents / children aboard the Titanic
- ticket = Ticket number
- fare = Passenger fare
- cabin = Cabin number
- embarked = Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

First, we will check if there are any missing values for each variables,

># PassengerId    Survived      Pclass        Name         Sex         Age 
>#           0         418           0           0           0         263 
>#       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
>#           0           0           0           1        1014           2

From the result, can be seen that there are several variables have missing values such as : Survived (418 NA), Age (263 NA), Fare (1 NA), Cabin (1014 NA), Embarked (2 NA). I will solve this problem one by one.

Begin with Age variable, I will replace missing Age cells with the mean age of all passengers on the titanic and divides by age (“0-12”, “13-17”, “18-59”, “>60” years) category to simplify analysis

># 
>#   >60 00-12 13-17 18-59 
>#    40    94    60  1115

in Embarked variable, I will replace Embarked missing values by most frequent observation such as Southampton (S)

># 
>#   C   Q   S 
># 270 123 914
># 
>#   C   Q   S 
># 270 123 916

In Title variable, that variable contains the name and the title used for each passengers, I will subset to take only the title used for each passengers

># 
>#         Capt          Col          Don         Dona           Dr     Jonkheer 
>#            1            4            1            1            8            1 
>#         Lady        Major       Master         Miss         Mlle          Mme 
>#            1            2           61          260            2            1 
>#           Mr          Mrs           Ms          Rev          Sir the Countess 
>#          757          197            2            8            1            1

Because too many titles, I will make some of the titles to new category so we have only five titles such as Master, Miss, Mr, Mrs, and Rare Title

># 
>#     Master       Miss         Mr        Mrs Rare Title 
>#         61        264        757        198         29

in Family Size variable, I will divide into three categories such as (“1”, “2-5”, “>5”) family size to simplify the analysis

># 
>#  >5   1 2-5 
>#  82 790 437

recheck the missing values again

># PassengerId    Survived      Pclass        Name         Sex         Age 
>#           0         418           0           0           0           0 
>#       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
>#           0           0           0           1        1014           0 
>#   Age Group       Title  Familysize 
>#           0           0           0

only left Survived and Cabin variable that have missing values. Missing values in Survived, that is test data used for prediction. For Cabin variable, I will not used it later because not very useful for analysis.

Now, I will change the class type of some variable into a factor

Check the class for each variables again,

># Observations: 1,309
># Variables: 15
># $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
># $ Survived    <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
># $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
># $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
># $ Sex         <fct> male, female, female, female, male, male, male, male, f...
># $ Age         <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 29.88...
># $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
># $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
># $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
># $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
># $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
># $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S...
># $ `Age Group` <fct> 18-59, 18-59, 18-59, 18-59, 18-59, 18-59, 18-59, 00-12,...
># $ Title       <fct> Mr, Mrs, Miss, Mrs, Mr, Mr, Mr, Master, Mrs, Mrs, Miss,...
># $ Familysize  <fct> 2-5, 2-5, 1, 2-5, 1, 1, 1, >5, 2-5, 2-5, 2-5, 1, 1, >5,...

Now, I will discard some variables that not used in analysis, I will only used “Survived”, “Pclass”, “Sex”, “Fare”, “Embarked”, “Age Group”, “Title”, “Family Size” variables.

Now, I will separate again into train and test data as in the beginning.

3 Exploratory Data

3.1 Alluvial Graph

I will make the alluvial graph to see in general how each variable relates to each other

Notes :
- Green (Passenger can survived)
- Red (Passenger can’t survived)
From the graph, can be seen that female are more likely to survived than male.
From the Pclass perspective, there is a tendency pattern the higher the class, the higher also survival rate.
From the Age group perspective, most of the data is in the category “18-59” age, but look likely more are not survive than survive.
I will look more detail in the bar chart how each variable related with survival rate.

3.2 Bar Chart

3.2.1 Class

Notes : You can see the survival rate and not survive rate more detail by pressing the bar in the graph.

From the graph, it supports the alluvial graph that show the higher the class passengers, the higher the chances of surviving.

3.2.2 Sex

Notes : You can see the survival rate and not survive rate more detail by pressing the bar in the graph.

From the graph, it supports the alluvial graph that show female are more likely to survive than male.

3.2.3 Age Group

Notes : You can see the survival rate and not survive rate more detail by pressing the bar in the graph.

From the graph, can be seen the higher age group, the lower survival rate for passengers.

3.2.4 Embarked

Notes : You can see the survival rate and not survive rate more detail by pressing the bar in the graph.

From the graph, can be seen that Cherbourg embarkation that passenger come have more survival rate than Queenstown and Southampton.

3.2.5 Title

Notes : You can see the survival rate and not survive rate more detail by pressing the bar in the graph.

From the graph, can be seen that Mrs and Miss title have more survival rate than Master, Mr, and Rare Title.

3.2.6 Family Size

Notes : You can see the survival rate and not survive rate more detail by pressing the bar in the graph.

From the graph, there are unique facts that single person have survival rate lower than person come with the family around (2-5) people.

4 Machine Learning and Modelling

Before we make modelling for the survival rate, we will check how the proportion for survival rate with survive and not survive to see if there is a gap that is far enough.

># 
>#         0         1 
># 0.6161616 0.3838384 
># 
># 0 1 
># 

From the result, we have a proportion for survival rate around 60% and not survive 40%, still acceptable to use.

4.1 Logistic Regression

Make stepwise first to get the best model with Akaike Information Criterion (AIC)

># Start:  AIC=1188.66
># Survived ~ 1
># 
>#               Df Deviance     AIC
># + Title        4   886.59  896.59
># + Sex          1   917.80  921.80
># + Pclass       2  1083.11 1089.11
># + Familysize   2  1111.56 1117.56
># + Fare         1  1117.57 1121.57
># + Embarked     2  1161.29 1167.29
># + `Age Group`  3  1171.54 1179.54
># + Age          1  1182.21 1186.21
># <none>            1186.66 1188.66
># 
># Step:  AIC=896.59
># Survived ~ Title
># 
>#               Df Deviance     AIC
># + Pclass       2   784.43  798.43
># + Familysize   2   813.96  827.96
># + Fare         1   856.13  868.13
># + Embarked     2   866.08  880.08
># + Sex          1   879.37  891.37
># <none>             886.59  896.59
># + Age          1   885.61  897.61
># + `Age Group`  3   884.83  900.83
># - Title        4  1186.66 1188.66
># 
># Step:  AIC=798.43
># Survived ~ Title + Pclass
># 
>#               Df Deviance     AIC
># + Familysize   2   731.57  749.57
># + Embarked     2   775.98  793.98
># + Sex          1   778.55  794.55
># + Age          1   779.42  795.42
># <none>             784.43  798.43
># + Fare         1   784.42  800.42
># + `Age Group`  3   780.85  800.85
># - Pclass       2   886.59  896.59
># - Title        4  1083.11 1089.11
># 
># Step:  AIC=749.57
># Survived ~ Title + Pclass + Familysize
># 
>#               Df Deviance     AIC
># + Age          1   724.66  744.66
># + Sex          1   725.84  745.84
># + Fare         1   727.39  747.39
># <none>             731.57  749.57
># + Embarked     2   728.28  750.28
># + `Age Group`  3   727.08  751.08
># - Familysize   2   784.43  798.43
># - Pclass       2   813.96  827.96
># - Title        4  1039.15 1049.15
># 
># Step:  AIC=744.66
># Survived ~ Title + Pclass + Familysize + Age
># 
>#               Df Deviance     AIC
># + Sex          1   719.19  741.19
># + Fare         1   721.17  743.17
># <none>             724.66  744.66
># + Embarked     2   721.50  745.50
># - Age          1   731.57  749.57
># + `Age Group`  3   724.26  750.26
># - Familysize   2   779.42  795.42
># - Pclass       2   813.49  829.49
># - Title        4  1007.55 1019.55
># 
># Step:  AIC=741.19
># Survived ~ Title + Pclass + Familysize + Age + Sex
># 
>#               Df Deviance    AIC
># + Fare         1   715.59 739.59
># <none>             719.19 741.19
># + Embarked     2   715.84 741.84
># - Sex          1   724.66 744.66
># - Age          1   725.84 745.84
># + `Age Group`  3   718.84 746.84
># - Title        4   770.27 784.27
># - Familysize   2   773.78 791.78
># - Pclass       2   806.23 824.23
># 
># Step:  AIC=739.59
># Survived ~ Title + Pclass + Familysize + Age + Sex + Fare
># 
>#               Df Deviance    AIC
># <none>             715.59 739.59
># + Embarked     2   713.17 741.17
># - Fare         1   719.19 741.19
># - Sex          1   721.17 743.17
># - Age          1   721.58 743.58
># + `Age Group`  3   715.12 745.12
># - Pclass       2   757.89 777.89
># - Title        4   768.49 784.49
># - Familysize   2   773.78 793.78

Model :
\[{Survive_{i}}={\ln(\frac{P_{i}}{1-P_{i}})}={\beta_{0}} + {\beta_{1}}{Title_{i}} + {\beta_{2}}{Pclass_{i}} + {\beta_{3}}{Familysize_{i}} +{\beta_{4}}{Age}+ {\beta_{5}}{Sex_{i}} + {\beta_{6}}{Fare} + {\upsilon_{i}}\] \({Survive_{i}} = 1\), Passenger survive
\({Survive_{i}} = 0\), Passenger not survive
\({\ln(\frac{P_{i}}{1-P_{i}})}\), log odds ratio

Vector of Title :

\[{Title_{1}}|_{0=lainnya}^{1=Miss};{Title_{2}}|_{0=lainnya}^{1=Mr}; {Title_{3}}|_{0=lainnya}^{1=Mrs}; {Title_{4}}|_{0=lainnya}^{1=Rare Title}\]

Vector of Pclass :

\[{Pclass_{1}}|_{0=lainnya}^{1=2};{Pclass_{2}}|_{0=lainnya}^{1=3}\]

Vector of Family Size :

\[{FamilySize_{1}}|_{0=lainnya}^{1=1};{Family Size_{2}}|_{0=lainnya}^{1=2-5}\]

Vector of Sex :

\[{Sex_{1}}|_{0=lainnya}^{1=Male}\]

># 
># Call:
># glm(formula = Survived ~ Title + Pclass + Familysize + Age + 
>#     Sex + Fare, family = binomial(link = logit), data = train_fix)
># 
># Deviance Residuals: 
>#     Min       1Q   Median       3Q      Max  
># -2.7087  -0.5234  -0.4035   0.5456   2.3736  
># 
># Coefficients:
>#                   Estimate Std. Error z value       Pr(>|z|)    
># (Intercept)      15.939130 503.790184   0.032       0.974760    
># TitleMiss       -15.897367 503.790126  -0.032       0.974827    
># TitleMr          -3.668417   0.572644  -6.406 0.000000000149 ***
># TitleMrs        -15.309605 503.790184  -0.030       0.975757    
># TitleRare Title  -3.692859   0.804132  -4.592 0.000004382714 ***
># Pclass2          -1.149886   0.319436  -3.600       0.000319 ***
># Pclass3          -2.016565   0.311918  -6.465 0.000000000101 ***
># Familysize1       3.193405   0.509536   6.267 0.000000000367 ***
># Familysize2-5     2.866580   0.475845   6.024 0.000000001700 ***
># Age              -0.022124   0.009188  -2.408       0.016047 *  
># Sexmale         -15.244466 503.789826  -0.030       0.975860    
># Fare              0.004617   0.002654   1.740       0.081934 .  
># ---
># Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
># 
># (Dispersion parameter for binomial family taken to be 1)
># 
>#     Null deviance: 1186.66  on 890  degrees of freedom
># Residual deviance:  715.59  on 879  degrees of freedom
># AIC: 739.59
># 
># Number of Fisher Scoring iterations: 13
># 
># fitting null model for pseudo-r2
>#          llh      llhNull           G2     McFadden         r2ML         r2CU 
># -357.7927263 -593.3275684  471.0696841    0.3969727    0.4106280    0.5579149

Analysis from the model :
1. There are several significant variables which p-value < 0.05 such as Title, Pclass, and FamilySize variables.
2. using pR2 function in the pscl package allows to see a linear regression on R-square value equivalent, which is the McFadden R-Square index. This is equivalently saying that the variation logistic regression model has well explained 39,19% of variation of survival prediction.

Interpretation from the model :
first, we will convert the log odds ratio into probability using inverse logit. I will only convert the variables that have significant pvalue < 0.05

># [1] 0.01693142
># [1] 0.01384592
># [1] 0.2779434
># [1] 0.1464005
># [1] 0.9572399
># [1] 0.9465708

Interpretation from the model :
1. Probability for passenger who have Mr title to survive is 0.0169 (1,69%)
2. Probability for passenger who have Rare title to survive is 0.0138 (1,38%)
3. Probability for passenger from second ticket class to survive is 0.2779 (27,79%)
4. Probability for passenger from third ticket class to survive is 0.1464 (14,64%)
5. Probability for passenger who alone on the ship to survive is 0.9572 (95,72%)
6. Probability for passenger who bring his family as much as 2 to 5 people on the ship to survive is 0.9465 (94,65%)

Now I want to see the goodness of our model using Confusion Matrix to get the accuracy and ROC

4.1.1 Confusion Matrix

># Confusion Matrix and Statistics
># 
>#           Reference
># Prediction   0   1
>#          0 486  89
>#          1  63 253
>#                                               
>#                Accuracy : 0.8294              
>#                  95% CI : (0.8031, 0.8535)    
>#     No Information Rate : 0.6162              
>#     P-Value [Acc > NIR] : < 0.0000000000000002
>#                                               
>#                   Kappa : 0.6341              
>#                                               
>#  Mcnemar's Test P-Value : 0.04258             
>#                                               
>#             Sensitivity : 0.7398              
>#             Specificity : 0.8852              
>#          Pos Pred Value : 0.8006              
>#          Neg Pred Value : 0.8452              
>#              Prevalence : 0.3838              
>#          Detection Rate : 0.2840              
>#    Detection Prevalence : 0.3547              
>#       Balanced Accuracy : 0.8125              
>#                                               
>#        'Positive' Class : 1                   
># 

Accuracy = 0.8294
Recall/Sensitivity = 0.7398
Precision = 0.8006
Specificity = 0.8852

Logistic regression has a pretty good model because it has a level of accuracy such as 0.8294

4.1.2 Receiver Operating Characteristic (ROC) Curve

ROC curve is used to describe the relationship between Recall/Sensitivity and False Positive Rate (1-Specifity) for each threshold

># Formal class 'performance' [package "ROCR"] with 6 slots
>#   ..@ x.name      : chr "None"
>#   ..@ y.name      : chr "Area under the ROC curve"
>#   ..@ alpha.name  : chr "none"
>#   ..@ x.values    : list()
>#   ..@ y.values    :List of 1
>#   .. ..$ : num 0.88
>#   ..@ alpha.values: list()
># [1] 0.8797521

Area Under Curve ROC Curve has an area of 0.8797, it means that Logistic regression is pretty good to able to distinguish between positive classes and negative classes as well.

4.2 Model Comparison

in this part, I want to try to compare Logistic Regression with Decision Tree and Support Vector Machine (SVM), which model can best explain survival rate Titanic.

4.2.1 Decision Tree

Because the pitcure is a bit messy, I will try to make it neater

4.2.1.1 Confusion Matrix

># Confusion Matrix and Statistics
># 
>#           Reference
># Prediction   0   1
>#          0 492  93
>#          1  57 249
>#                                                
>#                Accuracy : 0.8316               
>#                  95% CI : (0.8054, 0.8557)     
>#     No Information Rate : 0.6162               
>#     P-Value [Acc > NIR] : < 0.00000000000000022
>#                                                
>#                   Kappa : 0.6369               
>#                                                
>#  Mcnemar's Test P-Value : 0.004267             
>#                                                
>#             Sensitivity : 0.7281               
>#             Specificity : 0.8962               
>#          Pos Pred Value : 0.8137               
>#          Neg Pred Value : 0.8410               
>#              Prevalence : 0.3838               
>#          Detection Rate : 0.2795               
>#    Detection Prevalence : 0.3434               
>#       Balanced Accuracy : 0.8121               
>#                                                
>#        'Positive' Class : 1                    
># 

Accuracy = 0.8316
Recall/Sensitivity = 0.7281
Precision = 0.8137
Specificity = 0.8962

Decision Tree has a pretty good model because it has a level of accuracy such as 0.8316, slightly higher than logistic regression

4.2.1.2 ROC Curve

># Formal class 'performance' [package "ROCR"] with 6 slots
>#   ..@ x.name      : chr "None"
>#   ..@ y.name      : chr "Area under the ROC curve"
>#   ..@ alpha.name  : chr "none"
>#   ..@ x.values    : list()
>#   ..@ y.values    :List of 1
>#   .. ..$ : num 0.873
>#   ..@ alpha.values: list()
># [1] 0.8726685

Area Under Curve ROC Curve has an area of 0.8726, Decision Tree AUC value little lower than logistic regression.

4.2.2 Support Vector Machine (SVM)

>#  Setting default kernel parameters  
># Support Vector Machine object of class "ksvm" 
># 
># SV type: C-svc  (classification) 
>#  parameter : cost C = 1 
># 
># Linear (vanilla) kernel function. 
># 
># Number of Support Vectors : 371 
># 
># Objective Function Value : -318.5 
># Training error : 0.171717
># [1] 0 1 1 1 0 0
># Levels: 0 1
># Confusion Matrix and Statistics
># 
>#           Reference
># Prediction   0   1
>#          0 492  96
>#          1  57 246
>#                                                
>#                Accuracy : 0.8283               
>#                  95% CI : (0.8019, 0.8525)     
>#     No Information Rate : 0.6162               
>#     P-Value [Acc > NIR] : < 0.00000000000000022
>#                                                
>#                   Kappa : 0.629                
>#                                                
>#  Mcnemar's Test P-Value : 0.002125             
>#                                                
>#             Sensitivity : 0.8962               
>#             Specificity : 0.7193               
>#          Pos Pred Value : 0.8367               
>#          Neg Pred Value : 0.8119               
>#              Prevalence : 0.6162               
>#          Detection Rate : 0.5522               
>#    Detection Prevalence : 0.6599               
>#       Balanced Accuracy : 0.8077               
>#                                                
>#        'Positive' Class : 0                    
># 

Accuracy = 0.8283
Recall/Sensitivity = 0.8962
Precision = 0.8367
Specificity = 0.7193

SVM has a level of accuracy such as 0.8283, lower than Logistic Regression and Decision Tree.

5 Conclusions

LR Accuracy : 0.8294
DT Accuracy : 0.8316
SVM Accuracy : 0.8283

  1. From all three models such as, Logistic Regression, Decision Tree, and SVM. The value of accuracy and AUC results are not very different. both of them are good at studying the data and predictions.
  2. From the logistic regression model, can easily see that the Titanic survival is highly depended on several predictors such as Title, Passenger class, and Family Size. in particular, people from a lower class and have more families in ship are less likely to survived keeping other predictors conditions constant.