Titanic Classification

Data Preparation

Load package

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gtools)
## Warning: package 'gtools' was built under R version 4.2.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
library(class)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

Load Dataset

Membuka data sekaligus melihat struktur data

titanic <- read.csv("train.csv")
str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Keterangan Data: - PassangerId = Nomor Identitas Penumpang - Survived = Survival 0 = No, 1 = Yes - Pclass = Tipe Kelas Tiket 1 = 1st, 2 = 2nd, 3 = 3rd - Name = Nama Lengkap Penumpang - Sex = Jenis Kelamin - Age = Usia Dalam Tahun - SibSp = Jumlah Saudara Yang Naik Ke Titanic - Parch = Jumlah Orang Tua / Anak Yang Naik Ke Titanic
- Ticket = Nomor Tiket - Fare = Tarif Penumpang - Cabin = Nomor Kabin - Embarked = Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Data Wrangling

Mengubah kolom menjadi faktor

titanic <- titanic %>% 
  mutate(Survived = as.factor(Survived),
         Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         SibSp = as.factor(SibSp),
         Parch = as.factor(Parch),
         Embarked = as.factor(Embarked))
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <fct> male, female, female, female, male, male, male, male, fema…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <fct> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <fct> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…

Mengecek apakah terdapat missing value

anyNA(titanic)
## [1] TRUE

Dihasilkan “TRUE” Menyatakan bahwa adanya “missing value” pada dataset

Mengecek secara lengkap data yang hilang

colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Karena kolom Age memiliki banyak missing value, kita isi dengan nilai “mean”

titanic$Age[is.na(titanic$Age)] <- mean(titanic$Age, na.rm = T)
colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Sebelum membagi data ke “data_train” dan “data_test”, kita harus mengecek proporsi data variabel target

prop.table(table(titanic$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

Jika dilihat dari proporsi kedua kelas, data sudah cukup seimbang sehingga kita tidak membutuhkan pre-processing data tambahan

Splitting Data Train dan Data Test

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100) 

index <- sample(nrow(titanic), nrow(titanic)*0.7)

titanic_train <- titanic[index,]
titanic_test <- titanic[-index,]

Modelling

model1 <- glm(formula = Survived~Age+Sex+Pclass+Cabin+Embarked, family = "binomial", data = titanic_train)

summary(model1)
## 
## Call:
## glm(formula = Survived ~ Age + Sex + Pclass + Cabin + Embarked, 
##     family = "binomial", data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7493  -0.5801  -0.3941   0.5423   2.4457  
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.67751    0.62935   4.254  2.1e-05 ***
## Age                -0.03846    0.01072  -3.586 0.000336 ***
## Sexmale            -2.26008    0.23544  -9.600  < 2e-16 ***
## Pclass2            -0.61454    0.50687  -1.212 0.225357    
## Pclass3            -1.62616    0.48796  -3.333 0.000860 ***
## CabinA10          -18.11660 6522.63862  -0.003 0.997784    
## CabinA16           17.21695 6522.63862   0.003 0.997894    
## CabinA19          -17.84135 6522.63862  -0.003 0.997818    
## CabinA20           19.51549 6522.63861   0.003 0.997613    
## CabinA23           21.22523 6522.63864   0.003 0.997404    
## CabinA24          -17.79132 6522.63862  -0.003 0.997824    
## CabinA31           19.16937 6522.63863   0.003 0.997655    
## CabinA32          -17.84135 6522.63862  -0.003 0.997818    
## CabinA34           18.30246 6522.63862   0.003 0.997761    
## CabinA6            19.22544 6522.63863   0.003 0.997648    
## CabinB101          18.97708 6522.63862   0.003 0.997679    
## CabinB102         -17.84135 6522.63862  -0.003 0.997818    
## CabinB19          -16.63760 6522.63862  -0.003 0.997965    
## CabinB20           18.31233 3786.64085   0.005 0.996141    
## CabinB22           17.27302 6522.63863   0.003 0.997887    
## CabinB28           18.27292 6522.63863   0.003 0.997765    
## CabinB3            17.54222 6522.63861   0.003 0.997854    
## CabinB30          -17.00133 6522.63863  -0.003 0.997920    
## CabinB35           16.29397 6522.63862   0.002 0.998007    
## CabinB38          -17.25292 6522.63862  -0.003 0.997890    
## CabinB42           16.61925 6522.63862   0.003 0.997967    
## CabinB49           16.10169 6522.63862   0.002 0.998030    
## CabinB5            16.75538 4577.63558   0.004 0.997080    
## CabinB50           18.86171 6522.63861   0.003 0.997693    
## CabinB51 B53 B55    0.65057    1.48765   0.437 0.661885    
## CabinB58 B60       -0.64211    1.55908  -0.412 0.680448    
## CabinB73           17.04228 6522.63861   0.003 0.997915    
## CabinB77           17.15765 6522.63863   0.003 0.997901    
## CabinB78           16.51315 6522.63861   0.003 0.997980    
## CabinB80           17.60153 6522.63862   0.003 0.997847    
## CabinB82 B84      -17.73202 6522.63862  -0.003 0.997831    
## CabinB94          -17.44521 6522.63862  -0.003 0.997866    
## CabinB96 B98       18.25259 2778.50026   0.007 0.994759    
## CabinC101          17.92680 6522.63863   0.003 0.997807    
## CabinC103          18.11909 6522.63863   0.003 0.997784    
## CabinC104          20.14842 6522.63863   0.003 0.997535    
## CabinC106          19.29078 6522.63861   0.003 0.997640    
## CabinC123         -17.56058 6522.63862  -0.003 0.997852    
## CabinC124         -17.84135 6522.63862  -0.003 0.997818    
## CabinC125          18.11909 6522.63862   0.003 0.997784    
## CabinC126          17.03071 6522.63862   0.003 0.997917    
## CabinC128         -17.84135 6522.63862  -0.003 0.997818    
## CabinC148          18.63097 6522.63863   0.003 0.997721    
## CabinC22 C26       -2.40471    1.43129  -1.680 0.092938 .  
## CabinC23 C25 C27   -0.40764    1.35773  -0.300 0.763999    
## CabinC45           16.83238 6522.63863   0.003 0.997941    
## CabinC47           18.77323 6522.63861   0.003 0.997704    
## CabinC52           19.22544 6522.63863   0.003 0.997648    
## CabinC65           -1.39203    1.73175  -0.804 0.421496    
## CabinC68          -17.61665 6522.63862  -0.003 0.997845    
## CabinC7            17.08073 6522.63861   0.003 0.997911    
## CabinC70           18.28485 6522.63863   0.003 0.997763    
## CabinC78           -0.99371    1.83806  -0.541 0.588761    
## CabinC82          -18.46271 6522.63862  -0.003 0.997742    
## CabinC83           17.23456 6522.63862   0.003 0.997892    
## CabinC85           16.83238 6522.63861   0.003 0.997941    
## CabinC86          -17.57819 6522.63862  -0.003 0.997850    
## CabinC90           16.29397 6522.63863   0.002 0.998007    
## CabinC91          -17.52212 6522.63862  -0.003 0.997857    
## CabinC92           18.44242 3722.90778   0.005 0.996047    
## CabinC95          -17.84135 6522.63862  -0.003 0.997818    
## CabinD             17.44906 4489.48234   0.004 0.996899    
## CabinD10 D12       18.51560 6522.63863   0.003 0.997735    
## CabinD11           17.84988 6522.63863   0.003 0.997817    
## CabinD15           16.60163 6522.63862   0.003 0.997969    
## CabinD17           17.75385 4612.02267   0.004 0.996929    
## CabinD19           19.76384 6522.63861   0.003 0.997582    
## CabinD21           17.03071 6522.63863   0.003 0.997917    
## CabinD26          -17.65146 4433.57119  -0.004 0.996823    
## CabinD28           16.50387 6522.63862   0.003 0.997981    
## CabinD30          -18.25281 6522.63862  -0.003 0.997767    
## CabinD35           19.57156 6522.63863   0.003 0.997606    
## CabinD36           16.25552 6522.63863   0.002 0.998012    
## CabinD37           17.67844 6522.63862   0.003 0.997837    
## CabinD46          -17.17601 6522.63862  -0.003 0.997899    
## CabinD47           16.61925 6522.63862   0.003 0.997967    
## CabinD48          -17.27053 6522.63862  -0.003 0.997887    
## CabinD56           20.07072 6522.63861   0.003 0.997545    
## CabinD6           -17.86824 6522.63862  -0.003 0.997814    
## CabinD7            18.31137 6522.63862   0.003 0.997760    
## CabinE101          17.65043 4606.79057   0.004 0.996943    
## CabinE12           19.99459 6522.63861   0.003 0.997554    
## CabinE121          20.00553 6522.63862   0.003 0.997553    
## CabinE17           20.10996 6522.63863   0.003 0.997540    
## CabinE24           19.63448 4603.45063   0.004 0.996597    
## CabinE25           19.53310 4612.20202   0.004 0.996621    
## CabinE31          -17.21446 6522.63862  -0.003 0.997894    
## CabinE33           16.88899 4601.62468   0.004 0.997072    
## CabinE34           16.90929 6522.63863   0.003 0.997932    
## CabinE36           16.52472 6522.63863   0.003 0.997979    
## CabinE40           16.94775 6522.63863   0.003 0.997927    
## CabinE44           17.38839 6522.63862   0.003 0.997873    
## CabinE58          -17.17601 6522.63862  -0.003 0.997899    
## CabinE67            0.20234    1.82460   0.111 0.911700    
## CabinE68           16.58079 6522.63862   0.003 0.997972    
## CabinE77          -18.43697 6522.63861  -0.003 0.997745    
## CabinF E69         18.13930 6522.63860   0.003 0.997781    
## CabinF G63        -15.74214 6522.63861  -0.002 0.998074    
## CabinF G73        -16.62666 6522.63861  -0.003 0.997966    
## CabinF2             0.95663    1.51230   0.633 0.527015    
## CabinF33           17.52488 4607.72770   0.004 0.996965    
## CabinF38          -17.14205 6522.63861  -0.003 0.997903    
## CabinF4            17.93506 4001.34204   0.004 0.996424    
## CabinG6            -0.48454    1.05501  -0.459 0.646034    
## EmbarkedC           0.51756    0.32392   1.598 0.110092    
## EmbarkedQ           0.92685    0.38795   2.389 0.016890 *  
## EmbarkedS                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 834.17  on 622  degrees of freedom
## Residual deviance: 476.18  on 512  degrees of freedom
## AIC: 698.18
## 
## Number of Fisher Scoring iterations: 17

Meskipun banyak variabel prediktor yang signifikan terhadap variabel target, kita akan coba melakukan model fitting menggunakan metode stepwise.

model2 <- stepAIC(model1, direction = "backward")
## Start:  AIC=698.18
## Survived ~ Age + Sex + Pclass + Cabin + Embarked
## 
##             Df Deviance    AIC
## - Cabin    103   584.64 600.64
## <none>           476.18 698.18
## - Embarked   2   483.29 701.29
## - Age        1   489.98 709.98
## - Pclass     2   496.19 714.19
## - Sex        1   578.89 798.89
## 
## Step:  AIC=600.64
## Survived ~ Age + Sex + Pclass + Embarked
## 
##            Df Deviance    AIC
## <none>          584.64 600.64
## - Embarked  3   591.61 601.61
## - Age       1   599.80 613.80
## - Pclass    2   652.54 664.54
## - Sex       1   722.70 736.70

Dengan menggunakan metode backward pada stepwise, kita memperoleh model sebagai berikut.

summary(model2)
## 
## Call:
## glm(formula = Survived ~ Age + Sex + Pclass + Embarked, family = "binomial", 
##     data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5201  -0.6694  -0.4310   0.7097   2.4059  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  15.622859 535.411444   0.029 0.976722    
## Age          -0.033174   0.008758  -3.788 0.000152 ***
## Sexmale      -2.347520   0.215845 -10.876  < 2e-16 ***
## Pclass2      -1.105401   0.311185  -3.552 0.000382 ***
## Pclass3      -2.196170   0.286563  -7.664  1.8e-14 ***
## EmbarkedC   -11.935408 535.411312  -0.022 0.982215    
## EmbarkedQ   -11.645347 535.411380  -0.022 0.982647    
## EmbarkedS   -12.423658 535.411285  -0.023 0.981488    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 834.17  on 622  degrees of freedom
## Residual deviance: 584.64  on 615  degrees of freedom
## AIC: 600.64
## 
## Number of Fisher Scoring iterations: 12

Prediksi

Dengan menggunakan model2, kita akan prediksi menggunakan data test

titanic_test$prob_surv <- predict(model2, type = "response", newdata = titanic_test)

Melihat sebaran peluang prediksi data.

ggplot(titanic_test, aes(x=prob_surv)) +
  geom_density(lwd=1) +
  labs(title = "Distribusi Peluang Prediksi Data") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Pada grafik di atas, hasil prediksi lebih mengarah ke 0. Selanjutnya kita akan cek hasil prediksi dengan data aktual.

titanic_test$pred_surv <- factor(ifelse(titanic_test$prob_surv > 0.5, yes = 1, no = 0))
titanic_test[1:10, c("pred_surv", "Survived")]
##    pred_surv Survived
## 3          1        1
## 4          1        1
## 5          0        0
## 6          0        0
## 7          0        0
## 8          0        0
## 16         1        1
## 17         0        0
## 19         0        0
## 21         0        0

Model Evaluation

Melakukan evaluasi model dengan confusionMatrix

conf <- confusionMatrix(titanic_test$pred_surv, titanic_test$Survived, positive = "1")
conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 146  28
##          1  24  70
##                                           
##                Accuracy : 0.806           
##                  95% CI : (0.7535, 0.8516)
##     No Information Rate : 0.6343          
##     P-Value [Acc > NIR] : 7.312e-10       
##                                           
##                   Kappa : 0.5781          
##                                           
##  Mcnemar's Test P-Value : 0.6774          
##                                           
##             Sensitivity : 0.7143          
##             Specificity : 0.8588          
##          Pos Pred Value : 0.7447          
##          Neg Pred Value : 0.8391          
##              Prevalence : 0.3657          
##          Detection Rate : 0.2612          
##    Detection Prevalence : 0.3507          
##       Balanced Accuracy : 0.7866          
##                                           
##        'Positive' Class : 1               
## 
  • Re-call/Sensitivity = dari semua data aktual yang positif, seberapa mampu proporsi model saya menebak benar.
  • Specificity = dari semua data aktual yang negatif, seberapa mampu proporsi model saya menebak yang benar.
  • Accuracy = seberapa mampu model saya menebak dengan benar target Y.
  • Precision = dari semua hasil prediksi, seberapa mampu model saya dapat menebak benar kelas positif.

Berdasarkan hasil confusionMatrix diatas, dapat kita ambil informasi bahwa kemampuan model dalam menebak target Y ( Survived dan Not Not Survived) sebesar 80,6%. Sedangkan dari keluruhan data aktual orang yang Survived, model dapat mampu menebak benar sebesar 85,8%. Dari keseluruhan data aktual orang yang Survived, model mampu menebak dengan benar sebesar 71,4%. Dari keseluruhan hasil prediksi yang mampu ditebak oleh model, model mampu menebak benar kelas positif sebesar 74,4%.

K-Nearest Neighbour

Pre-Processing Data

titanic_knn <- read.csv("train.csv")
titanic_knn <- titanic_knn[, -1]
titanic_knn$Survived <- as.factor(titanic_knn$Survived)
titanic_knn$Age[is.na(titanic_knn$Age)] <- mean(titanic_knn$Age, na.rm = T)
head(titanic_knn)
##   Survived Pclass                                                Name    Sex
## 1        0      3                             Braund, Mr. Owen Harris   male
## 2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female
## 3        1      3                              Heikkinen, Miss. Laina female
## 4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female
## 5        0      3                            Allen, Mr. William Henry   male
## 6        0      3                                    Moran, Mr. James   male
##        Age SibSp Parch           Ticket    Fare Cabin Embarked
## 1 22.00000     1     0        A/5 21171  7.2500              S
## 2 38.00000     1     0         PC 17599 71.2833   C85        C
## 3 26.00000     0     0 STON/O2. 3101282  7.9250              S
## 4 35.00000     1     0           113803 53.1000  C123        S
## 5 35.00000     0     0           373450  8.0500              S
## 6 29.69912     0     0           330877  8.4583              Q
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

index <- sample(nrow(titanic_knn), nrow(titanic_knn)*0.7)

knn_train <- titanic_knn[index,]
knn_test <- titanic_knn[-index,]
prop.table(table(knn_train$Survived))
## 
##         0         1 
## 0.6083467 0.3916533

Untuk proses k-NN, dipisahkan antara prediktor dan label (target variabelnya).

library(dplyr)

knn_train_real <- knn_train %>% 
  dplyr::select(c(Survived,Pclass,Age,SibSp,Parch,Fare))

knn_test_real <- knn_test %>% 
  dplyr::select(c(Survived,Pclass,Age,SibSp,Parch,Fare))
train_x <- knn_train_real[,-1]

test_x <- knn_test_real[,-1]

train_y <- knn_train_real$Survived

test_y <- knn_test_real$Survived

Data prediktor akan discaling menggunakan z-score standarization. Data test juga harus discaling menggunakan parameter dari data train (karena menganggap data test adalah unseen data).

train_xs <- scale(train_x)
test_xs <- scale(test_x,
                      center = attr(train_xs,"scaled:center") ,
                      scale = attr(train_xs,"scaled:scale"))

Mencari nilai optimum K

sqrt(nrow(train_xs))
## [1] 24.95997

k-NN tidak membuat model sehingga kita bisa langsung proses predict.

pred_knn <- knn(train = train_xs,
                 test = test_xs,
                 cl = train_y,
                 k = 23)

Cek hasil prediksi

pred_knn
##   [1] 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0
##  [38] 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 0 1 0 0 1
##  [75] 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0
## [112] 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 0 0 1 1
## [149] 1 1 1 0 1 0 1 0 0 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 1
## [186] 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0
## [223] 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 0 0
## [260] 0 0 0 1 0 0 0 0 0
## Levels: 0 1

Evaluasi Model

confusionMatrix(data = pred_knn, reference = test_y, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 136  42
##          1  34  56
##                                           
##                Accuracy : 0.7164          
##                  95% CI : (0.6584, 0.7696)
##     No Information Rate : 0.6343          
##     P-Value [Acc > NIR] : 0.002792        
##                                           
##                   Kappa : 0.378           
##                                           
##  Mcnemar's Test P-Value : 0.422001        
##                                           
##             Sensitivity : 0.5714          
##             Specificity : 0.8000          
##          Pos Pred Value : 0.6222          
##          Neg Pred Value : 0.7640          
##              Prevalence : 0.3657          
##          Detection Rate : 0.2090          
##    Detection Prevalence : 0.3358          
##       Balanced Accuracy : 0.6857          
##                                           
##        'Positive' Class : 1               
## 

Berdasarkan hasil confusionMatrix di atas, kemampuan model dalam menebak target Y ( Survived dan Not Not Survived) sebesar 71,6%. Sedangkan dari keluruhan data aktual orang yang Survived, model dapat mampu menebak benar sebesar 80%. Dari keseluruhan data aktual orang yang Survived, model mampu menebak dengan benar sebesar 57,1%. Dari keseluruhan hasil prediksi yang mampu ditebak oleh model, model mampu menebak benar kelas positif sebesar 62,2%.

Conclusion

Dari model di atas, saya akan memberikan perhatian lebih ke metric sensitivity karena saya tidak ingin model yang saya buat salah dalam memprediksi penumpang yang selamat tapi diprediksi tidak selamat.