LBB Logistic Regression & k-NN using Titanic Kaggle Train Dataset

Since the train and test data didn’t have the same variables, I decided to use only train data for Machine Learning Classification LBB. For the Question, I want to know which variables contributes significantly to make passenger survived.

Library & Setup

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ stringr 1.4.0
## ✓ tidyr   1.0.2     ✓ forcats 0.4.0
## ✓ readr   1.3.1
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

Logistic Regression

Data Import

titanic <- read.csv("/Users/dinnah/Documents/ALGORITMA Machine Learning/LBB Machine Learning/titanic/train.csv")
str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Information: PassengerId: Passenger’s Identification Number
Survived: Passenger’s survival, 1 = Yes, 0 = No
Pclass: Socio Economy Status, 1st = Higher, 2nd = Middle, 3rd = Lower
Name: Passenger’s name
Sex: Passenger’s gender
Age: Passenger’s age
SibSp: sibsp: The dataset defines family relations in this way
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: parch: The dataset defines family relations in this way
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them
Ticket: Passenger’s ticket ID
Fare: The fare of the ticket
Cabin: Cabin number
Embarked: Port of emabarkation. C = Cherbourg, Q = Queenstown, S = Southampton

Data Manipulation

# Data Wrangling
titanic_fin <- titanic %>% 
  dplyr::select(-c(PassengerId, Name, Cabin, Ticket)) %>% 
  mutate(Survived = as.factor(Survived),
         Pclass = as.factor(Pclass)) %>% 
  mutate(Age = replace_na(Age, replace = mean(Age, na.rm = T)))
str(titanic_fin)
## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
colSums(is.na(titanic_fin))
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0

Since the PassengerID, Name, Cabin, and Ticket has unique levels and aren’t meaningful, we take out those variables. The Age data represents only around 81% of train data so we fill the missing value with the mean data.

Exploratory Data Analysis

Let see if the proportion of eah class is considered balance for our Classification Model

prop.table(table(titanic_fin$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

I consider it is balance.

Splitting the data

set.seed(777)
intrain <- sample(nrow(titanic_fin), nrow(titanic_fin)*0.8)
titanic_train <- titanic_fin[intrain,]
titanic_test <- titanic_fin[-intrain,]
table(titanic_train$Survived)
## 
##   0   1 
## 443 269

Create Model

Model with all variables

titanic_m_all <- glm(Survived~., titanic_train, family = "binomial")
summary(titanic_m_all)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3771  -0.5866  -0.4158   0.6094   2.4470  
## 
## Coefficients:
##               Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)  14.931596 535.411331   0.028              0.97775    
## Pclass2      -1.067305   0.340786  -3.132              0.00174 ** 
## Pclass3      -2.181663   0.340487  -6.407       0.000000000148 ***
## Sexmale      -2.740152   0.225756 -12.138 < 0.0000000000000002 ***
## Age          -0.042590   0.009052  -4.705       0.000002536299 ***
## SibSp        -0.354001   0.122614  -2.887              0.00389 ** 
## Parch        -0.083305   0.132741  -0.628              0.53028    
## Fare          0.003161   0.002766   1.143              0.25313    
## EmbarkedC   -10.953789 535.411266  -0.020              0.98368    
## EmbarkedQ   -10.822691 535.411350  -0.020              0.98387    
## EmbarkedS   -11.061172 535.411257  -0.021              0.98352    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 944.09  on 711  degrees of freedom
## Residual deviance: 617.87  on 701  degrees of freedom
## AIC: 639.87
## 
## Number of Fisher Scoring iterations: 12

From this summary when we create model with all variables, the Parch factor as well as Fare and Embarkation are not significant. Let us see with Feature Engineering with Stepwise.

Feature Engineering for create more accurate model.

step(titanic_m_all, direction = "backward")
## Start:  AIC=639.87
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
##            Df Deviance    AIC
## - Embarked  3   618.52 634.52
## - Parch     1   618.27 638.27
## - Fare      1   619.34 639.34
## <none>          617.87 639.87
## - SibSp     1   627.95 647.95
## - Age       1   642.11 662.11
## - Pclass    2   663.39 681.39
## - Sex       1   803.56 823.56
## 
## Step:  AIC=634.52
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
## 
##          Df Deviance    AIC
## - Parch   1   619.05 633.05
## - Fare    1   620.21 634.21
## <none>        618.52 634.52
## - SibSp   1   629.32 643.32
## - Age     1   642.99 656.99
## - Pclass  2   665.53 677.53
## - Sex     1   812.87 826.87
## 
## Step:  AIC=633.05
## Survived ~ Pclass + Sex + Age + SibSp + Fare
## 
##          Df Deviance    AIC
## - Fare    1   620.43 632.43
## <none>        619.05 633.05
## - SibSp   1   632.89 644.89
## - Age     1   643.38 655.38
## - Pclass  2   668.62 678.62
## - Sex     1   815.52 827.52
## 
## Step:  AIC=632.43
## Survived ~ Pclass + Sex + Age + SibSp
## 
##          Df Deviance    AIC
## <none>        620.43 632.43
## - SibSp   1   633.16 643.16
## - Age     1   645.99 655.99
## - Pclass  2   708.40 716.40
## - Sex     1   821.16 831.16
## 
## Call:  glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = titanic_train)
## 
## Coefficients:
## (Intercept)      Pclass2      Pclass3      Sexmale          Age        SibSp  
##     4.17317     -1.28645     -2.39060     -2.75570     -0.04331     -0.37014  
## 
## Degrees of Freedom: 711 Total (i.e. Null);  706 Residual
## Null Deviance:       944.1 
## Residual Deviance: 620.4     AIC: 632.4

using step backwards, the optimum AIC (this means, the smallest value of AIC), is using the variables: Pclass, Sex, Age, and SibSp.

titanic_m_back <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
    data = titanic_train)
summary(titanic_m_back)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3597  -0.5797  -0.4058   0.6045   2.4692  
## 
## Coefficients:
##              Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)  4.173168   0.459195   9.088 < 0.0000000000000002 ***
## Pclass2     -1.286446   0.296515  -4.339           0.00001434 ***
## Pclass3     -2.390600   0.277284  -8.621 < 0.0000000000000002 ***
## Sexmale     -2.755700   0.218720 -12.599 < 0.0000000000000002 ***
## Age         -0.043313   0.008982  -4.822           0.00000142 ***
## SibSp       -0.370135   0.116381  -3.180              0.00147 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 944.09  on 711  degrees of freedom
## Residual deviance: 620.43  on 706  degrees of freedom
## AIC: 632.43
## 
## Number of Fisher Scoring iterations: 5
titanic_m_back2 <- stepAIC(titanic_m_all, direction = "backward")
## Start:  AIC=639.87
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
##            Df Deviance    AIC
## - Embarked  3   618.52 634.52
## - Parch     1   618.27 638.27
## - Fare      1   619.34 639.34
## <none>          617.87 639.87
## - SibSp     1   627.95 647.95
## - Age       1   642.11 662.11
## - Pclass    2   663.39 681.39
## - Sex       1   803.56 823.56
## 
## Step:  AIC=634.52
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
## 
##          Df Deviance    AIC
## - Parch   1   619.05 633.05
## - Fare    1   620.21 634.21
## <none>        618.52 634.52
## - SibSp   1   629.32 643.32
## - Age     1   642.99 656.99
## - Pclass  2   665.53 677.53
## - Sex     1   812.87 826.87
## 
## Step:  AIC=633.05
## Survived ~ Pclass + Sex + Age + SibSp + Fare
## 
##          Df Deviance    AIC
## - Fare    1   620.43 632.43
## <none>        619.05 633.05
## - SibSp   1   632.89 644.89
## - Age     1   643.38 655.38
## - Pclass  2   668.62 678.62
## - Sex     1   815.52 827.52
## 
## Step:  AIC=632.43
## Survived ~ Pclass + Sex + Age + SibSp
## 
##          Df Deviance    AIC
## <none>        620.43 632.43
## - SibSp   1   633.16 643.16
## - Age     1   645.99 655.99
## - Pclass  2   708.40 716.40
## - Sex     1   821.16 831.16

using stepAIC, the optimum AIC (this means, the smallest value of AIC), is using the variables: Pclass, Sex, Age, and SibSp. Same value with the step backwards, so next we only use the step backwards.

Predicting the model

titanic_predict <- predict(titanic_m_back, titanic_test)
class(titanic_predict)
## [1] "numeric"
titanic_test$titanic_prob <- predict(titanic_m_back, titanic_test, type = "response")

Visualize the probability distribution.

ggplot(titanic_test, aes(x=titanic_prob))+geom_density(lwd=0.5)+theme_minimal()

on this plot, the probability on our test data is skewed to 0, means our data distribute to Not Survived.

titanic_test$titanic_predict <- factor(ifelse(titanic_test$titanic_prob > 0.5, "1", "0"))
head(titanic_test[1:10, c("titanic_predict", "Survived")])
##    titanic_predict Survived
## 3                1        1
## 4                1        1
## 20               1        1
## 28               0        0
## 30               0        0
## 40               1        1

Model Evaluation

confusionMatrix(titanic_test$titanic_predict, titanic_test$Survived, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 85 21
##          1 21 52
##                                           
##                Accuracy : 0.7654          
##                  95% CI : (0.6964, 0.8254)
##     No Information Rate : 0.5922          
##     P-Value [Acc > NIR] : 0.0000008154    
##                                           
##                   Kappa : 0.5142          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7123          
##             Specificity : 0.8019          
##          Pos Pred Value : 0.7123          
##          Neg Pred Value : 0.8019          
##              Prevalence : 0.4078          
##          Detection Rate : 0.2905          
##    Detection Prevalence : 0.4078          
##       Balanced Accuracy : 0.7571          
##                                           
##        'Positive' Class : 1               
## 

Model Interpretaion

Model Interpretation: the ability of the model predicting the Y target is correct, (Accuracy of the model to predict ‘Survived’ and ‘Not Survived’) is at 76.5%. The ability of the model predicting the Actual Positive data (Sensitivity) is correct, is at 71.2%. And the ability of the model predicitng the Actual Negative data (Specificity) is at 80.2%.

k-Nearest Neighbor

Data Manipulation

titanicnum <- titanic %>% 
  dplyr::select(-c(PassengerId, Name, Cabin, Ticket)) %>% 
  mutate(Sex = as.numeric(Sex),
         Embarked = as.numeric(Embarked)) %>% 
  mutate(Age = replace_na(Age, replace = mean(Age, na.rm = T)))

# assigning label for target
titanicnum$Survived <- factor(titanicnum$Survived, levels = c("0", "1"), labels = c("NotSurvived", "Survived"))
str(titanicnum)
## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "NotSurvived",..: 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : num  2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: num  4 2 4 4 4 3 4 4 4 2 ...
# target class proportion
prop.table(table(titanicnum$Survived))
## 
## NotSurvived    Survived 
##   0.6161616   0.3838384
summary(titanicnum)
##         Survived       Pclass           Sex             Age       
##  NotSurvived:549   Min.   :1.000   Min.   :1.000   Min.   : 0.42  
##  Survived   :342   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:22.00  
##                    Median :3.000   Median :2.000   Median :29.70  
##                    Mean   :2.309   Mean   :1.648   Mean   :29.70  
##                    3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:35.00  
##                    Max.   :3.000   Max.   :2.000   Max.   :80.00  
##      SibSp           Parch             Fare           Embarked   
##  Min.   :0.000   Min.   :0.0000   Min.   :  0.00   Min.   :1.00  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91   1st Qu.:3.00  
##  Median :0.000   Median :0.0000   Median : 14.45   Median :4.00  
##  Mean   :0.523   Mean   :0.3816   Mean   : 32.20   Mean   :3.53  
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00   3rd Qu.:4.00  
##  Max.   :8.000   Max.   :6.0000   Max.   :512.33   Max.   :4.00
colSums(is.na(titanicnum))
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0

Cross Validation

set.seed(249)
splitting <- initial_split(data = titanicnum, prop = 0.75, strata = "Survived")
titanic_train <- training(splitting)
titanic_test <- testing(splitting)

prop.table(table(titanic_train$Survived))
## 
## NotSurvived    Survived 
##   0.6158445   0.3841555

Data Pre-processing

Scaling using Z-score Standarization

#  for predictor data
titanic_train_p <- scale(titanic_train[,-1])
titanic_test_p <- scale(titanic_test[,-1],
                        center = attr(titanic_train_p, "scaled:center"),
                        scale = attr(titanic_train_p, "scaled:scale"))
# for label (target) data
titanic_train_l <- titanic_train[,1]
titanic_test_l <- titanic_test[,1]

Predicting Data

# find optimum k
round(sqrt(nrow(titanic_train)),0)
## [1] 26

since the result is an even number ‘26’, we will use ‘27’ as the k value.

titanic_knn_predict <- knn(train = titanic_train_p, test = titanic_test_p, cl = titanic_train_l, k = 27)
head(titanic_knn_predict)
## [1] NotSurvived Survived    NotSurvived Survived    NotSurvived NotSurvived
## Levels: NotSurvived Survived

Model Evaluation

class(titanic_knn_predict)
## [1] "factor"
confusionMatrix(data = titanic_knn_predict, reference = titanic_test_l, positive = "Survived")
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    NotSurvived Survived
##   NotSurvived         129       33
##   Survived              8       52
##                                           
##                Accuracy : 0.8153          
##                  95% CI : (0.7579, 0.8641)
##     No Information Rate : 0.6171          
##     P-Value [Acc > NIR] : 0.0000000001314 
##                                           
##                   Kappa : 0.5861          
##                                           
##  Mcnemar's Test P-Value : 0.0001781       
##                                           
##             Sensitivity : 0.6118          
##             Specificity : 0.9416          
##          Pos Pred Value : 0.8667          
##          Neg Pred Value : 0.7963          
##              Prevalence : 0.3829          
##          Detection Rate : 0.2342          
##    Detection Prevalence : 0.2703          
##       Balanced Accuracy : 0.7767          
##                                           
##        'Positive' Class : Survived        
## 

Model Interpretaion

Model Interpretation: the ability of the variable data predicting the Y target is correct, (Accuracy of the model to predict ‘Survived’ and ‘Not Survived’) is at 80.2%. The ability of the model predicting the Actual Positive data (Sensitivity) is correct, is at 60.0%. And the ability of the model predicitng the Actual Negative data (Specificity) is at 97.7%.

Summary & Suggestion

From the two method of Logistic Regression and k-Nearest Neighbor, the ability to predict the Actual Positive (True Positive) of Survived person in Titanic Cruise, is with Logistic Regression with predictors: Pclass, Sex, Age, and SibSp. It is better because it has the Recall/Sensitivity value at 71.2%, higher than the Recall/Sensitivity from k-NN at 60%.