Case : From data titanic passenger, we want to know does passenger still alive or not. Data can be downloaded here https://www.kaggle.com/c/titanic

Import Data

titanic_train <- read.csv('train.csv')
titanic_test <- read.csv('test.csv')
head(titanic_train)

#head(titanic_test)

Data Wrangling

To make easier, combine data train and test into 1 dataframe

library(dplyr)
titanic_test$Survived <- NA

titanic <- rbind(titanic_train, titanic_test) %>% 
    mutate(Survived = as.factor(Survived))

Check structure of data

str(titanic)

## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Here are some information about columns in wholesale data * PassengerId: Id number of passenger * Survived : Passenger Survival or Not * Pclass : A proxy for socio-economic status (SES). 1st = Upper, 2nd = Middle, 3rd = Lower * Name : Name of Passenger * Sex : Sex of Passenger * Age : Age of Passenger * SibSp : Sibling = brother, sister, stepbrother, stepsister; Spouse = husband, wife (mistresses and fiancés were ignored) * Parch : Parent = mother, father; Child = daughter, son, stepdaughter, stepson; Some children travelled only with a nanny, therefore parch=0 for them. * Ticket : Ticket number * Fare : Passenger fare * Cabin : Cabin number * Embarked : Port of Embarkation

Check proportion

prop.table(table(titanic$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

Our data has balance enough.

Balancing class proportion with downsampling

# downsampling
RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)
library(caret)

## Warning: package 'caret' was built under R version 4.0.5

## Loading required package: ggplot2

## Loading required package: lattice

titanic <- downSample(x = titanic %>% 
                           select(-Survived),
                          y = titanic$Survived,
                         yname = "Survived") #nama kolom target

head(titanic)

#cek proporsi diab_train
prop.table(table(titanic$Survived))

## 
##   0   1 
## 0.5 0.5

Our data has balance proportion

Check missing value

anyNA(titanic)

## [1] TRUE

# remove missing value
titanic <- titanic %>% na.omit()
anyNA(titanic)

## [1] FALSE

Cross Validation

Give proportion to data train 80% and 20% to data test. Proportion data train bigger than data test because we want our model learn more.

RNGkind(sample.kind = "Rounding")
set.seed(100)

# index sampling
index <- sample(x = nrow(titanic), 
                size = nrow(titanic)*0.8)

# splitting
titanic_train <- titanic[index,] 
titanic_test <- titanic[-index,]

Check data train proportion

prop.table(table(titanic_train$Survived))

## 
##         0         1 
## 0.4920635 0.5079365

Our data class proportion is balance enough.

Data Pre-processing

Logistic Regression

Build model with predictors Pclass, Sex, Age, SibSp, Parch, and Embarked

model_logistic <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked,
                   data = titanic,
                   family = "binomial")

Summary model

summary(model_logistic)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Embarked, family = "binomial", data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8565  -0.7227   0.1903   0.6342   2.2393  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  18.276310 604.089038   0.030  0.97586    
## Pclass       -1.173481   0.162927  -7.202 5.91e-13 ***
## Sexmale      -2.729552   0.263541 -10.357  < 2e-16 ***
## Age          -0.048796   0.009219  -5.293 1.20e-07 ***
## SibSp        -0.436793   0.146866  -2.974  0.00294 ** 
## Parch        -0.132034   0.137642  -0.959  0.33743    
## EmbarkedC   -11.636185 604.088855  -0.019  0.98463    
## EmbarkedQ   -13.033583 604.089100  -0.022  0.98279    
## EmbarkedS   -12.241519 604.088813  -0.020  0.98383    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 763.81  on 551  degrees of freedom
## Residual deviance: 491.01  on 543  degrees of freedom
## AIC: 509.01
## 
## Number of Fisher Scoring iterations: 13

Log of odds value from logistic model can’t be interpreted, we should convert it into Odds.

exp(model_logistic$coefficients)

##  (Intercept)       Pclass      Sexmale          Age        SibSp        Parch 
## 8.655670e+07 3.092885e-01 6.524851e-02 9.523755e-01 6.461050e-01 8.763108e-01 
##    EmbarkedC    EmbarkedQ    EmbarkedS 
## 8.840344e-06 2.185681e-06 4.825871e-06

We can explain how each variable significant to prediction. According to model summary, passenger in Pclass 2 2.81 times more likely survived. Passenger with parent or children 9.56 times more likely survived.

Feature Selection

Choose significant variable to our model

model_step <- step(model_logistic, direction = 'backward', trace = F)
summary(model_step)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9667  -0.7100   0.2112   0.6222   2.2415  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.240306   0.630638   9.895  < 2e-16 ***
## Pclass      -1.270671   0.159013  -7.991 1.34e-15 ***
## Sexmale     -2.646674   0.247856 -10.678  < 2e-16 ***
## Age         -0.049091   0.009126  -5.379 7.48e-08 ***
## SibSp       -0.483300   0.137687  -3.510 0.000448 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 763.81  on 551  degrees of freedom
## Residual deviance: 497.75  on 547  degrees of freedom
## AIC: 507.75
## 
## Number of Fisher Scoring iterations: 5

Predict data test

titanic_test$predict_value <- predict(object = model_step,
                      newdata = titanic_test,
                      type = "response")

Predict result type probability, transform it into label according to target variable(Survived)

unique(titanic_test$Survived)

## [1] 0 1
## Levels: 0 1

titanic_test$pred_label <- ifelse(test = titanic_test$predict_value > 0.5,
                                yes = "1",
                                no = "0")

titanic_test$pred_label <- as.factor(titanic_test$pred_label)
head(titanic_test)

Model evaluation with confusionMatrix from library caret

library(caret)
confusionMatrix(data = titanic_test$pred_label,
                reference = titanic_test$Survived,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35 14
##          1 10 52
##                                           
##                Accuracy : 0.7838          
##                  95% CI : (0.6956, 0.8563)
##     No Information Rate : 0.5946          
##     P-Value [Acc > NIR] : 1.946e-05       
##                                           
##                   Kappa : 0.5578          
##                                           
##  Mcnemar's Test P-Value : 0.5403          
##                                           
##             Sensitivity : 0.7879          
##             Specificity : 0.7778          
##          Pos Pred Value : 0.8387          
##          Neg Pred Value : 0.7143          
##              Prevalence : 0.5946          
##          Detection Rate : 0.4685          
##    Detection Prevalence : 0.5586          
##       Balanced Accuracy : 0.7828          
##                                           
##        'Positive' Class : 1               
##

Logistic model has 78% accuracy, 78% sensitivity, 77% specificity, and 83% precision. Next, we want to compare this model with another model (K-Nearest Neighbor) to see which model better in predict.

K-Nearest Neighbor

Data cleaning, we just use numerical column

titanic_clean <- titanic %>% select(-c(Name, Sex, Ticket, Fare, Cabin, Embarked))

Check class proportion

prop.table(table(titanic_clean$Survived))

## 
##         0         1 
## 0.4746377 0.5253623

Cross validation

RNGkind(sample.kind = "Rounding")
set.seed(257)

# data train 80%
index <- sample(x = nrow(titanic_clean), 
                size = nrow(titanic_clean)*0.8)

# splitting
titanic_train_for_knn <- titanic_clean[index, ]
titanic_test_for_knn <-  titanic_clean[-index, ]

prop.table(table(titanic_train_for_knn$Survived)) # recheck class proportion

## 
##         0         1 
## 0.4829932 0.5170068

Our data is balance enough

Before we build KNN model, we should split target and predictor in both data train and data test.

# predictor train
titanic_train_predictor <- titanic_train_for_knn %>% 
  select(-Survived)

# target train
titanic_train_target <- titanic_train_for_knn %>% 
  pull(Survived) 


# predictor test
titanic_test_predictor <- titanic_test_for_knn %>% 
  select(-Survived)

# taget test
titanic_test_target <- titanic_test_for_knn %>% 
  pull(Survived)

Scale data train predictor for range standarization of variable predictor

titanic_train_predictor_scale <- titanic_train_predictor %>% 
  scale()

titanic_test_predictor_scale <- titanic_test_predictor %>% 
  scale(center = attr(titanic_train_predictor_scale,"scaled:center"),
        scale = attr(titanic_train_predictor_scale,"scaled:scale"))

Predict with knn() from library class

Finding k optimum

sqrt(nrow(titanic_train_predictor))

## [1] 21

K optimum is 22

library(class)
titanic_pred <- knn(train = titanic_train_predictor_scale,
                   test = titanic_test_predictor_scale,
                   cl = titanic_train_target,
                   k = 20)
titanic_pred

##   [1] 1 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1
##  [38] 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1
##  [75] 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1
## Levels: 0 1

Model evaluation with confusionMatrix from library caret

library(caret)
confusionMatrix(data = titanic_pred, reference = titanic_test_target, positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 36 17
##          1 13 45
##                                           
##                Accuracy : 0.7297          
##                  95% CI : (0.6372, 0.8096)
##     No Information Rate : 0.5586          
##     P-Value [Acc > NIR] : 0.0001498       
##                                           
##                   Kappa : 0.4566          
##                                           
##  Mcnemar's Test P-Value : 0.5838824       
##                                           
##             Sensitivity : 0.7258          
##             Specificity : 0.7347          
##          Pos Pred Value : 0.7759          
##          Neg Pred Value : 0.6792          
##              Prevalence : 0.5586          
##          Detection Rate : 0.4054          
##    Detection Prevalence : 0.5225          
##       Balanced Accuracy : 0.7303          
##                                           
##        'Positive' Class : 1               
##

KNN model has 72% accuracy, 72% sensitivity, 73% specificity, and 77% precision.

Conclusion

Logistic model has

78% accuracy
78% sensitivity
77% specificity
83% precision

KNN model has

72% accuracy
72% sensitivity
73% specificity
77% precision

In this case, we focused on recall or sensitivity. So, Logistic Model better than KNN Model to predict passenger who survived from titanic sink disaster.

Survived Titanic Passenger Prediction with Logistic Model and KNN

Rahma Fairuz Rania

2022-07-09