Case : From data titanic passenger, we want to know does passenger still alive or not. Data can be downloaded here https://www.kaggle.com/c/titanic

Import Data

titanic_train <- read.csv('train.csv')
titanic_test <- read.csv('test.csv')
head(titanic_train)
#head(titanic_test)

Data Wrangling

To make easier, combine data train and test into 1 dataframe

library(dplyr)
titanic_test$Survived <- NA

titanic <- rbind(titanic_train, titanic_test) %>% 
    mutate(Survived = as.factor(Survived))
  • Check structure of data
str(titanic)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Here are some information about columns in wholesale data * PassengerId: Id number of passenger * Survived : Passenger Survival or Not * Pclass : A proxy for socio-economic status (SES). 1st = Upper, 2nd = Middle, 3rd = Lower * Name : Name of Passenger * Sex : Sex of Passenger * Age : Age of Passenger * SibSp : Sibling = brother, sister, stepbrother, stepsister; Spouse = husband, wife (mistresses and fiancés were ignored) * Parch : Parent = mother, father; Child = daughter, son, stepdaughter, stepson; Some children travelled only with a nanny, therefore parch=0 for them. * Ticket : Ticket number * Fare : Passenger fare * Cabin : Cabin number * Embarked : Port of Embarkation

  • Check proportion
prop.table(table(titanic$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

Our data has balance enough.

  • Balancing class proportion with downsampling
# downsampling
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: ggplot2
## Loading required package: lattice
titanic <- downSample(x = titanic %>% 
                           select(-Survived),
                          y = titanic$Survived,
                         yname = "Survived") #nama kolom target

head(titanic)
#cek proporsi diab_train
prop.table(table(titanic$Survived))
## 
##   0   1 
## 0.5 0.5

Our data has balance proportion

  • Check missing value
anyNA(titanic)
## [1] TRUE
# remove missing value
titanic <- titanic %>% na.omit()
anyNA(titanic)
## [1] FALSE

Cross Validation

Give proportion to data train 80% and 20% to data test. Proportion data train bigger than data test because we want our model learn more.

RNGkind(sample.kind = "Rounding")
set.seed(100)

# index sampling
index <- sample(x = nrow(titanic), 
                size = nrow(titanic)*0.8)

# splitting
titanic_train <- titanic[index,] 
titanic_test <- titanic[-index,]
  • Check data train proportion
prop.table(table(titanic_train$Survived))
## 
##         0         1 
## 0.4920635 0.5079365

Our data class proportion is balance enough.

Data Pre-processing

Logistic Regression

  • Build model with predictors Pclass, Sex, Age, SibSp, Parch, and Embarked
model_logistic <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked,
                   data = titanic,
                   family = "binomial")
  • Summary model
summary(model_logistic)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Embarked, family = "binomial", data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8565  -0.7227   0.1903   0.6342   2.2393  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  18.276310 604.089038   0.030  0.97586    
## Pclass       -1.173481   0.162927  -7.202 5.91e-13 ***
## Sexmale      -2.729552   0.263541 -10.357  < 2e-16 ***
## Age          -0.048796   0.009219  -5.293 1.20e-07 ***
## SibSp        -0.436793   0.146866  -2.974  0.00294 ** 
## Parch        -0.132034   0.137642  -0.959  0.33743    
## EmbarkedC   -11.636185 604.088855  -0.019  0.98463    
## EmbarkedQ   -13.033583 604.089100  -0.022  0.98279    
## EmbarkedS   -12.241519 604.088813  -0.020  0.98383    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 763.81  on 551  degrees of freedom
## Residual deviance: 491.01  on 543  degrees of freedom
## AIC: 509.01
## 
## Number of Fisher Scoring iterations: 13

Log of odds value from logistic model can’t be interpreted, we should convert it into Odds.

exp(model_logistic$coefficients)
##  (Intercept)       Pclass      Sexmale          Age        SibSp        Parch 
## 8.655670e+07 3.092885e-01 6.524851e-02 9.523755e-01 6.461050e-01 8.763108e-01 
##    EmbarkedC    EmbarkedQ    EmbarkedS 
## 8.840344e-06 2.185681e-06 4.825871e-06

We can explain how each variable significant to prediction. According to model summary, passenger in Pclass 2 2.81 times more likely survived. Passenger with parent or children 9.56 times more likely survived.

  • Feature Selection

Choose significant variable to our model

model_step <- step(model_logistic, direction = 'backward', trace = F)
summary(model_step)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9667  -0.7100   0.2112   0.6222   2.2415  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.240306   0.630638   9.895  < 2e-16 ***
## Pclass      -1.270671   0.159013  -7.991 1.34e-15 ***
## Sexmale     -2.646674   0.247856 -10.678  < 2e-16 ***
## Age         -0.049091   0.009126  -5.379 7.48e-08 ***
## SibSp       -0.483300   0.137687  -3.510 0.000448 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 763.81  on 551  degrees of freedom
## Residual deviance: 497.75  on 547  degrees of freedom
## AIC: 507.75
## 
## Number of Fisher Scoring iterations: 5
  • Predict data test
titanic_test$predict_value <- predict(object = model_step,
                      newdata = titanic_test,
                      type = "response")

Predict result type probability, transform it into label according to target variable(Survived)

unique(titanic_test$Survived)
## [1] 0 1
## Levels: 0 1
titanic_test$pred_label <- ifelse(test = titanic_test$predict_value > 0.5,
                                yes = "1",
                                no = "0")

titanic_test$pred_label <- as.factor(titanic_test$pred_label)
head(titanic_test)
  • Model evaluation with confusionMatrix from library caret
library(caret)
confusionMatrix(data = titanic_test$pred_label,
                reference = titanic_test$Survived,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35 14
##          1 10 52
##                                           
##                Accuracy : 0.7838          
##                  95% CI : (0.6956, 0.8563)
##     No Information Rate : 0.5946          
##     P-Value [Acc > NIR] : 1.946e-05       
##                                           
##                   Kappa : 0.5578          
##                                           
##  Mcnemar's Test P-Value : 0.5403          
##                                           
##             Sensitivity : 0.7879          
##             Specificity : 0.7778          
##          Pos Pred Value : 0.8387          
##          Neg Pred Value : 0.7143          
##              Prevalence : 0.5946          
##          Detection Rate : 0.4685          
##    Detection Prevalence : 0.5586          
##       Balanced Accuracy : 0.7828          
##                                           
##        'Positive' Class : 1               
## 

Logistic model has 78% accuracy, 78% sensitivity, 77% specificity, and 83% precision. Next, we want to compare this model with another model (K-Nearest Neighbor) to see which model better in predict.

K-Nearest Neighbor

  • Data cleaning, we just use numerical column
titanic_clean <- titanic %>% select(-c(Name, Sex, Ticket, Fare, Cabin, Embarked))
  • Check class proportion
prop.table(table(titanic_clean$Survived))
## 
##         0         1 
## 0.4746377 0.5253623
  • Cross validation
RNGkind(sample.kind = "Rounding")
set.seed(257)

# data train 80%
index <- sample(x = nrow(titanic_clean), 
                size = nrow(titanic_clean)*0.8)

# splitting
titanic_train_for_knn <- titanic_clean[index, ]
titanic_test_for_knn <-  titanic_clean[-index, ]

prop.table(table(titanic_train_for_knn$Survived)) # recheck class proportion
## 
##         0         1 
## 0.4829932 0.5170068

Our data is balance enough

  • Before we build KNN model, we should split target and predictor in both data train and data test.
# predictor train
titanic_train_predictor <- titanic_train_for_knn %>% 
  select(-Survived)

# target train
titanic_train_target <- titanic_train_for_knn %>% 
  pull(Survived) 


# predictor test
titanic_test_predictor <- titanic_test_for_knn %>% 
  select(-Survived)

# taget test
titanic_test_target <- titanic_test_for_knn %>% 
  pull(Survived)
  • Scale data train predictor for range standarization of variable predictor
titanic_train_predictor_scale <- titanic_train_predictor %>% 
  scale()
titanic_test_predictor_scale <- titanic_test_predictor %>% 
  scale(center = attr(titanic_train_predictor_scale,"scaled:center"),
        scale = attr(titanic_train_predictor_scale,"scaled:scale"))
  • Predict with knn() from library class

Finding k optimum

sqrt(nrow(titanic_train_predictor))
## [1] 21

K optimum is 22

library(class)
titanic_pred <- knn(train = titanic_train_predictor_scale,
                   test = titanic_test_predictor_scale,
                   cl = titanic_train_target,
                   k = 20)
titanic_pred
##   [1] 1 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1
##  [38] 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1
##  [75] 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1
## Levels: 0 1
  • Model evaluation with confusionMatrix from library caret
library(caret)
confusionMatrix(data = titanic_pred, reference = titanic_test_target, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 36 17
##          1 13 45
##                                           
##                Accuracy : 0.7297          
##                  95% CI : (0.6372, 0.8096)
##     No Information Rate : 0.5586          
##     P-Value [Acc > NIR] : 0.0001498       
##                                           
##                   Kappa : 0.4566          
##                                           
##  Mcnemar's Test P-Value : 0.5838824       
##                                           
##             Sensitivity : 0.7258          
##             Specificity : 0.7347          
##          Pos Pred Value : 0.7759          
##          Neg Pred Value : 0.6792          
##              Prevalence : 0.5586          
##          Detection Rate : 0.4054          
##    Detection Prevalence : 0.5225          
##       Balanced Accuracy : 0.7303          
##                                           
##        'Positive' Class : 1               
## 

KNN model has 72% accuracy, 72% sensitivity, 73% specificity, and 77% precision.

Conclusion

Logistic model has

  • 78% accuracy
  • 78% sensitivity
  • 77% specificity
  • 83% precision

KNN model has

  • 72% accuracy
  • 72% sensitivity
  • 73% specificity
  • 77% precision

In this case, we focused on recall or sensitivity. So, Logistic Model better than KNN Model to predict passenger who survived from titanic sink disaster.